EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
Computing
1
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
Parallel Algorithm Examples
2
What is Parallel Computing? (1)
3
What is Parallel Computing? (2)
4
Demand for Computational Speed
5
Parallel Computing: Resources
6
Parallel Computing: The computational problem
7
Parallel Computing: what for? (1)
8
Parallel Computing: what for? (2)
9
Parallel Computing: what for? (3)
10
Example: Global Weather Forecasting
Atmosphere modeled by dividing it into 3-dimensional cells.
Calculations of each cell repeated many times to model passage of
time.
13
Why Parallel Computing? (2)
14
Limitations of Serial Computing
15
The future
16
Who and What? (1)
17
Who and What? (2)
18
Concepts and Terminology
19
Basic Design
Basic design
– Memory is used to store both program
and data instructions
– Program instructions are coded data
which tell the computer to do something
– Data is simply information to be used by
the program
A central processing unit (CPU) gets
instructions and/or data from
memory, decodes the instructions
and then sequentially performs them.
20
Flynn's Classical Taxonomy
21
Flynn Matrix
22
Single Instruction, Single Data (SISD)
23
Single Instruction, Multiple Data (SIMD)
A type of parallel computer
Single instruction: All processing units execute the same instruction at any given clock
cycle
Multiple data: Each processing unit can operate on a different data element
This type of machine typically has an instruction dispatcher, a very high-bandwidth
internal network, and a very large array of very small-capacity instruction units.
Best suited for specialized problems characterized by a high degree of regularity,such as
image processing.
Synchronous (lockstep) and deterministic execution
Two varieties: Processor Arrays and Vector Pipelines
Examples:
– Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
– Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820
24
Multiple Instruction, Single Data (MISD)
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via
independent instruction streams.
Few actual examples of this class of parallel computer have ever
existed. One is the experimental Carnegie-Mellon C.mmp computer
(1971).
Some conceivable uses might be:
– multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single coded
message.
25
Multiple Instruction, Multiple Data (MIMD)
Currently, the most common type of parallel computer. Most modern
computers fall into this category.
Multiple Instruction: every processor may be executing a different
instruction stream
Multiple Data: every processor may be working with a different data
stream
Execution can be synchronous or asynchronous, deterministic or non-
deterministic
Examples: most current supercomputers, networked parallel
computer "grids" and multi-processor SMP computers - including
some types of PCs.
26
Some General Parallel Terminology
Like everything else, parallel computing has its own "jargon". Some of the
more commonly used terms associated with parallel computing are listed
below. Most of these will be discussed in more detail later.
Task
– A logically discrete section of computational work. A task is
typically a program or program-like set of instructions that is
executed by a processor.
Parallel Task
– A task that can be executed by multiple processors safely (yields
correct results)
Serial Execution
– Execution of a program sequentially, one statement at a time. In
the simplest sense, this is what happens on a one processor
machine. However, virtually all parallel tasks will have sections of
a parallel program that must be executed serially.
27
Parallel Execution
– Execution of a program by more than one task, with each task being able
to execute the same or different statement at the same moment in time.
Shared Memory
– From a strictly hardware point of view, describes a computer architecture
where all processors have direct (usually bus based) access to common
physical memory. In a programming sense, it describes a model where
parallel tasks all have the same "picture" of memory and can directly
address and access the same logical memory locations regardless of
where the physical memory actually exists.
Distributed Memory
– In hardware, refers to network based memory access for physical
memory that is not common. As a programming model, tasks can only
logically "see" local machine memory and must use communications to
access memory on other machines where other tasks are executing.
28
Communications
– Parallel tasks typically need to exchange data. There are several ways this
can be accomplished, such as through a shared memory bus or over a
network, however the actual event of data exchange is commonly
referred to as communications regardless of the method employed.
Synchronization
– The coordination of parallel tasks in real time, very often associated with
communications. Often implemented by establishing a synchronization
point within an application where a task may not proceed further until
another task(s) reaches the same or logically equivalent point.
– Synchronization usually involves waiting by at least one task, and can
therefore cause a parallel application's wall clock execution time to
increase.
29
Granularity
– In parallel computing, granularity is a qualitative measure of the ratio of
computation to communication.
– Coarse: relatively large amounts of computational work are done
between communication events
– Fine: relatively small amounts of computational work are done between
communication events
Observed Speedup
– Observed speedup of a code which has been parallelized, defined as:
wall-clock time of serial execution
wall-clock time of parallel execution
– One of the simplest and most widely used indicators for a parallel
program's performance.
30
Parallel Overhead
– The amount of time required to coordinate parallel tasks, as opposed to
doing useful work. Parallel overhead can include factors such as:
Task start-up time
Synchronizations
Data communications
Software overhead imposed by parallel compilers, libraries, tools, operating
system, etc.
Task termination time
Massively Parallel
– Refers to the hardware that comprises a given parallel system - having
many processors. The meaning of many keeps increasing, but currently
BG/L pushes this number to 6 digits.
31
Scalability
– Refers to a parallel system's (hardware and/or software) ability to
demonstrate a proportionate increase in parallel speedup with
the addition of more processors. Factors that contribute to
scalability include:
Hardware - particularly memory-cpu bandwidths and network
communications
Application algorithm
Parallel overhead related
Characteristics of your specific application and coding
32
Parallel Computing
Motives
n computers operating simultaneously can achieve the result n
times faster - it will not be n times faster for various reasons.
33
Speedup Factor
35
Maximum Speedup
Amdahl’s law
ts
fts (1-f)ts
(b) Multiple
processors
...
p processors
tp (1-f)ts / p
36
Speedup factor is given by:
ts 1 1
S ( p) lim S ( p)
(1 f )t s (1 f ) p f
ft s f
p p
37
Superlinear Speedup
Example - searching
Start Time
ts
ts/p
Sub-space Dt
search
xts/p
Solution found
x indeterminate
38
(b) Searching each sub-space in parallel
ts
x´ +Dt
p
S(p) =
Dt
Dt
Solution found
39
Worst case for sequential search when solution found in last sub-
space search. Then parallel version offers greatest benefit, i.e.
p-1
p ´ ts + Dt
S(p) = as Dt tends to go to zero
Dt
Least advantage for parallel version when solution found in first sub-
space search of the sequential search, i.e.
Dt
S(p) = =1
Dt
Actual speed-up depends upon which subspace holds solution but could
be extremely large.
40
Conventional Computer
Consists of a processor executing a program stored in a (main)
memory:
Main memory
Processor
41
Parallel Computers
Shared memory vs. Distributed memory
Memory module
One
address
space
Interconnection
network
42
Processors
Quad Pentium Shared Memory Multiprocessor
I/O bus
Memory
Shared memory
43
Need to address Cache coherency problem!
Shared Memory Multiprocessors
44
Message-Passing Multicomputer
Interconnection
network
Messages
Processor
Local
memory
Computers
45
Interconnection Networks
Hypercube
Trees
Using Switches:
Crossbar
Multistage interconnection networks
46
One-dimensional array
Links Computer/Processor
47
Ring
Two-dimensional Torus
48
Three-dimensional hypercube
110 111
100 101
010 011
000 001
49
Four-dimensional hypercube
Root
Switch
Links element
Processors
51
Crossbar switch
Memories
Processors Switches
52
Multistage Interconnection Network
Example: Omega network
2 x 2 switch elements
(straight-through or
Inputs crossover connections) Outputs
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111
53
Embedding a ring onto a hypercube
110 111
100 101
010 011
000 001
11 11 0110
01 01 0110 01 0111
00
0000 0001 0011 0010 0110 0111 0101 0100 1100 1101
2-bit graycode
4-bit graycode
55
Distributed Shared Memory
Interconnection
network
Messages
Processor
Shared
memory
Computers
56
Networked Computers as a Computing Platform
Key Advantages:
Very high performance workstations and PCs readily available at low cost.
The latest processors can be easily incorporated into the system as they
become available.
58
Parallel Algorithm Examples:
Odd Even Transposition Sort
Initial array: Worst case scenario.
6, 5, 4, 3, 2, 1, 0
6, 4, 5, 2, 3, 0, 1 Phase 1
4, 6, 2, 5, 0, 3, 1 Phase 2
4, 2, 6, 0, 5, 1, 3 Phase 1
2, 4, 0, 6, 1, 5, 3 Phase 2
2, 0, 4, 1, 6, 3, 5 Phase 1
0, 2, 1, 4, 3, 6, 5 Phase 2
0, 1, 2, 3, 4, 5, 6 Phase 1
Shared Memory
Distributed Memory
Hybrid Distributed-Shared Memory
Memory architectures
Shared Memory
Distributed Memory
Hybrid Distributed-Shared Memory
Shared Memory : UMA vs. NUMA
Advantages
– Global address space provides a user-friendly programming perspective
to memory
– Data sharing between tasks is both fast and uniform due to the proximity
of memory to CPUs
Disadvantages:
– Primary disadvantage is the lack of scalability between memory and
CPUs. Adding more CPUs can geometrically increases traffic on the
shared memory-CPU path, and for cache coherent systems, geometrically
increase traffic associated with cache/memory management.
– Programmer responsibility for synchronization constructs that insure
"correct" access of global memory.
– Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of
processors.
Distributed Memory
Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication network
to connect inter-processor memory.
Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
Because each processor has its own local memory, it operates independently. Changes
it makes to its local memory have no effect on the memory of other processors. Hence,
the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated. Synchronization
between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can can be as simple
as Ethernet.
Distributed Memory: Pro and Con
Advantages
– Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
– Each processor can rapidly access its own memory without interference
and without the overhead incurred with trying to maintain cache
coherency.
– Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
Disadvantages
– The programmer is responsible for many of the details associated with
data communication between processors.
– It may be difficult to map existing data structures, based on global
memory, to this memory organization.
Hybrid Distributed-Shared Memory
The largest and fastest computers in the world today employ both shared and
distributed memory architectures.
Overview
Shared Memory Model
Threads Model
Message Passing Model
Data Parallel Model
Other Models
Overview
Although it might not seem apparent, these models are NOT specific to a
particular type of machine or memory architecture. In fact, any of these
models can (theoretically) be implemented on any underlying hardware.
Shared memory model on a distributed memory machine: Kendall Square
Research (KSR) ALLCACHE approach.
– Machine memory was physically distributed, but appeared to the user as a single
shared memory (global address space). Generically, this approach is referred to as
"virtual shared memory".
– Note: although KSR is no longer in business, there is no reason to suggest that a
similar implementation will not be made available by another vendor in the future.
– Message passing model on a shared memory machine: MPI on SGI Origin.
The SGI Origin employed the CC-NUMA type of shared memory architecture,
where every task has direct access to global memory. However, the ability to
send and receive messages with MPI, as is commonly done over a network of
distributed memory machines, is not only implemented but is very commonly
used.
Overview
In the threads model of parallel programming, a single process can have multiple,
concurrent execution paths.
Perhaps the most simple analogy that can be used to describe threads is the concept of
a single program that includes a number of subroutines:
– The main program a.out is scheduled to run by the native operating system. a.out loads and
acquires all of the necessary system and user resources to run.
– a.out performs some serial work, and then creates a number of tasks (threads) that can be
scheduled and run by the operating system concurrently.
– Each thread has local data, but also, shares the entire resources of a.out. This saves the
overhead associated with replicating a program's resources for each thread. Each thread also
benefits from a global memory view because it shares the memory space of a.out.
– A thread's work may best be described as a subroutine within the main program. Any thread
can execute any subroutine at the same time as other threads.
– Threads communicate with each other through global memory (updating address locations).
This requires synchronization constructs to insure that more than one thread is not updating
the same global address at any time.
– Threads can come and go, but a.out remains present to provide the necessary shared
resources until the application has completed.
Threads are commonly associated with shared memory architectures and operating
systems.
Threads Model Implementations
OpenMP
– Compiler directive based; can use serial code
– Jointly defined and endorsed by a group of major computer hardware
and software vendors. The OpenMP Fortran API was released October 28,
1997. The C/C++ API was released in late 1998.
– Portable / multi-platform, including Unix and Windows NT platforms
– Available in C/C++ and Fortran implementations
– Can be very easy and simple to use - provides for "incremental
parallelism"
Microsoft has its own implementation for threads, which is not
related to the UNIX POSIX standard or OpenMP.
Message Passing Model
The need for communications between tasks depends upon your problem
You DON'T need communications
– Some types of problems can be decomposed and executed in parallel with virtually
no need for tasks to share data. For example, imagine an image processing
operation where every pixel in a black and white image needs to have its color
reversed. The image data can easily be distributed to multiple tasks that then act
independently of each other to do their portion of the work.
– These types of problems are often called embarrassingly parallel because they are
so straight-forward. Very little inter-task communication is required.
You DO need communications
– Most parallel applications are not quite so simple, and do require tasks to share
data with each other. For example, a 3-D heat diffusion problem requires a task to
know the temperatures calculated by the tasks that have neighboring data.
Changes to neighboring data has a direct effect on that task's data.
Factors to Consider (1)
Visibility of communications
– With the Message Passing Model, communications are explicit
and generally quite visible and under the control of the
programmer.
– With the Data Parallel Model, communications often occur
transparently to the programmer, particularly on distributed
memory architectures. The programmer may not even be able to
know exactly how inter-task communications are being
accomplished.
Factors to Consider (4)
Scope of communications
– Knowing which tasks must communicate with each other is critical
during the design stage of a parallel code. Both of the two
scopings described below can be implemented synchronously or
asynchronously.
– Point-to-point - involves two tasks with one task acting as the
sender/producer of data, and the other acting as the
receiver/consumer.
– Collective - involves data sharing between more than two tasks,
which are often specified as being members in a common group,
or collective.
Collective Communications
Examples
Factors to Consider (6)
Efficiency of communications
– Very often, the programmer will have a choice with regard to
factors that can affect communications performance. Only a few
are mentioned here.
– Which implementation for a given model should be used? Using
the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform than
another.
– What type of communication operations should be used? As
mentioned previously, asynchronous communication operations
can improve overall program performance.
– Network media - some platforms may offer more than one
network for communications. Which one is best?
Factors to Consider (7)
Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier. It then stops, or "blocks".
– When the last task reaches the barrier, all tasks are synchronized.
– What happens from here varies. Often, a serial section of work must be done. In other cases,
the tasks are automatically released to continue their work.
Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global data or a section of code. Only one task at
a time may use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can then safely (serially) access the
protected data or code.
– Other tasks can attempt to acquire the lock but must wait until the task that owns the lock
releases it.
– Can be blocking or non-blocking
Synchronous communication operations
– Involves only those tasks executing a communication operation
– When a task performs a communication operation, some form of coordination is required with
the other task(s) participating in the communication. For example, before a task can perform a
send operation, it must first receive an acknowledgment from the receiving task that it is OK to
send.
– Discussed previously in the Communications section.
Automatic vs. Manual Parallelization
Understand the Problem and the Program
Partitioning
Communications
Synchronization
Data Dependencies
Load Balancing
Granularity
I/O
Limits and Costs of Parallel Programming
Performance Analysis and Tuning
Definitions
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE
task 1 task 2
------ ------
X = 2 X = 4
. .
. .
Y = X**2 Y = X**3
If you have access to a parallel file system, investigate using it. If you
don't, keep reading...
Rule #1: Reduce overall I/O as much as possible
Confine I/O to specific serial portions of the job, and then use parallel
communications to distribute data to parallel tasks. For example, Task
1 could read an input file and then communicate required data to
other tasks. Likewise, Task 1 could perform write operation after
receiving required data from all other tasks.
For distributed memory systems with shared filespace, perform I/O in
local, non-shared filespace. For example, each processor may have
/tmp filespace which can used. This is usually much more efficient
than performing I/O over the network to one's home directory.
Create unique filenames for each tasks' input/output file(s)
Automatic vs. Manual Parallelization
Understand the Problem and the Program
Partitioning
Communications
Synchronization
Data Dependencies
Load Balancing
Granularity
I/O
Limits and Costs of Parallel Programming
Performance Analysis and Tuning
Amdahl's Law
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Amdahl's Law
Array Processing
PI Calculation
Simple Heat Equation
1-D Wave Equation
Array Processing
Arrays elements are distributed so that each processor owns a portion of an array
(subarray).
Independent calculation of array elements insures there is no need for communication
between tasks.
Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the
subarrays. Unit stride maximizes cache/memory usage.
Since it is desirable to have unit stride through the subarrays, the choice of a
distribution scheme depends on the programming language.
After the array is distributed, each task executes the portion of the loop corresponding
to the data it owns. For example, with Fortran block distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
Notice that only the outer loop variables are different from the serial solution.
Array Processing Solution 1
One possible implementation
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between
0 and 1
xcoordinate = random1 ;
ycoordinate = random2
if (xcoordinate, ycoordinate)
inside circle then circle_count =
circle_count + 1
end do
PI = 4.0*circle_count/npoints
where c is a constant