0% found this document useful (0 votes)
42 views20 pages

Why Multiprocessors?: Motivation: Opportunity

Multiprocessors provide opportunities to go beyond the performance of a single processor by exploiting parallelism without requiring specialized hardware. They take advantage of existing software and can handle both parallel programs and multi-programmed workloads without excessive complexity. The key models are SIMD, MIMD with centralized shared memory, and MIMD with physically distributed memory using either distributed shared memory or message passing approaches. Effective parallel applications exhibit high computation to communication ratios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views20 pages

Why Multiprocessors?: Motivation: Opportunity

Multiprocessors provide opportunities to go beyond the performance of a single processor by exploiting parallelism without requiring specialized hardware. They take advantage of existing software and can handle both parallel programs and multi-programmed workloads without excessive complexity. The key models are SIMD, MIMD with centralized shared memory, and MIMD with physically distributed memory using either distributed shared memory or message passing approaches. Effective parallel applications exhibit high computation to communication ratios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Why Multiprocessors?

Motivation: Opportunity:
Go beyond the Software available
performance offered
by a single processor Parallel programs

Without requiring Multi-programmed


specialized machines
processors
Without the
complexity of too
much multiple issue
Multiprocessors: The SIMD Model

SISD: Single Instruction stream, Single
Data stream
 Uniprocessor
 This is the view at the ISA level
 Tomasulo uncovers data stream parallelism

SIMD: Single Instruction stream, Multiple
Data streams
 ISA makes data parallelism explicit
 Special SIMD instructions
 Same instruction goes to multiple functional
units, but acts on different data
SIMD Drawbacks
SIMD useful for loop-level parallelism
Model is too inflexible to accommodate
parallel programs as well as multi-
programmed environments
Cannot take advantage of uniprocessor
performance growth
SIMD architecture usually used in special
purpose designs
Signal or image processing
Multiprocessors: The MIMD Model

MIMD: Multiple Instruction streams, Multiple
Data streams
 Each processor fetches its own instruction and
data

Advantages:
 Flexibility: parallel programs, or multi-
programmed OS, or both
 Built using off-the-shelf uniprocessors
MIMD: The Centralized Shared-
Memory Model
Single bus connects a
shared memory to all
P P P
processors
Also called Uniform
$ $ $
Memory Access (UMA)
Bus machine
Disadvantage: cannot
Main Memory I/O scale very well, especially
with fast processors (more
memory bandwidth required)
MIMD: Physically Distributed
Memory
Independent memory for
P+$ P+$ each processor

M I/O M I/O High-bandwidth


interconnection
Interconnection n/w Adv: cost-effective memory
bandwidth scaling
M I/O M I/O
Adv: lesser latency for
P+$ P+$ local access
Disadv: communication of
data between nodes
Communication Models with
Physically Distributed Memory

Distributed Shared Memory (DSM)
 Memory address space is the same across nodes
 Also called scalable shared memory
 Also called NUMA: non-uniform memory access
 Communication is implicit via load/store

Multicomputer, or Message Passing Machine
 Separate private address spaces for each node
 Communication is explicit, through messages
 Synchronous, or asynchronous
 Std. Message Passing Interface (MPI) possible
Multiprocessing: Classification
Multiprocessing

SIMD MIMD

Centralized Physically
shared memory distributed memory

Distributed shared Message passing


memory (DSM) machines
Multiprocessing: Classification
Multiprocessing

SIMD MIMD

Centralized Physically
shared memory distributed memory

Distributed shared Message passing


memory (DSM) machines
DSM vs. Message Passing
Shared Memory Message Passing
Well understood Hardware simplicity
mechanisms for
Communication is
programming
explicit – forces
Program independent of programmer to pay
communication pattern attention to what is
Low overhead for expensive
communicating small
items
Hardware controlled
caching
Achieving the Desired
Communication Model
Message Passing on top of Shared Memory
Considerable easier
Difficulty arises in dealing with arbitrary
message lengths
Shared Memory on top of Message Passing
Harder since every load/store has to be faked
Every memory reference may involve OS
One promising direction: use of VM to share
objects at page level: shared VM
Challenges in Parallel
Processing

Limited parallelism available in programs
 90% parallelizable ==> max speed possible?
 Exception: super-linear speedup

Increased memory/cache available

Usually not very great however

Large latency of communication
 50-10000 clock cycles
 0.5% instructions access remote memory ==>
what is the increase in CPI?
Addressing the Challenges

Limited parallelism
 Tackled mainly by redesigning the algorithm or
software

Avoiding large latency
 Hardware mechanism: caching
 Software mechanism: restructure to make more
accesses local
Some Example Applications

Two classes
 Parallel programs or program kernels
 Multi-programmed OS

Spatial and temporal data access patterns
are important

Computation to communication ratio is
important
Parallel Application Kernels

The FFT kernel
 Used in spectral methods
 Data represented as array
 Computation involves

1D FFT on each row

Transpose

1D FFT on each row again
 Each processor gets a few rows of data
 Main communication step is the transpose (all to
all communication)
Parallel Application Kernels
(continued)

The LU kernel
 LU factorization of a matrix
 Blocking is used
 Computation (dense matrix multiply) is
performed by processor which owns the
destination block
 Communication happens at regular intervals
Parallel Applications

Barnes application
 N-body problem
 Octree representation
 Each processor is allocated a subtree
 Tree expansion as required (communication in
this process)
Parallel Applications
(continued)

Ocean application
 Influence of eddy and boundary currents on
ocean flows
 Involves solving PDEs
 Ocean divided into hierarchy of grids (finer grid
for more accuracy)
 Each processor gets a set of grids
 Communication to exchange boundary
conditions, at each step of the process
Computation to Communication
Ratios
Scaling of
Computation Communication
Application computation to
scaling scaling
communication
FFT nlogn/p n/p Logn
LU n/p sqrt(n/p) sqrt(n/p)
Barnes nlogn/p logn*sqrt(n/p) sqrt(n/p)
Ocean n/p sqrt(n/p) sqrt(n/p)
Multiprogrammed OS workload

Workload used here is:
 Two independent copies of the compilation of the
Andrew benchmark
 Three steps:

Compilation: compute intensive

Installing object files in a library: I/O intensive

Removing the object files: I/O intensive

You might also like