0% found this document useful (0 votes)
177 views42 pages

pdc1: MODULE 1: PARALLELISM FUNDAMENTALS

This document discusses parallel and distributed computing fundamentals. It covers motivation for parallel computing like addressing computationally intensive problems. Key concepts discussed include Flynn's taxonomy of parallel systems and challenges of parallelization like load imbalance. Multi-core processors with shared or separate caches are presented as solutions to Moore's law limitations. Parallelism can be exploited through instruction-level, data-level, task-level and request-level parallelism. Amdahl's law and Gustafson's law govern parallel scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views42 pages

pdc1: MODULE 1: PARALLELISM FUNDAMENTALS

This document discusses parallel and distributed computing fundamentals. It covers motivation for parallel computing like addressing computationally intensive problems. Key concepts discussed include Flynn's taxonomy of parallel systems and challenges of parallelization like load imbalance. Multi-core processors with shared or separate caches are presented as solutions to Moore's law limitations. Parallelism can be exploited through instruction-level, data-level, task-level and request-level parallelism. Amdahl's law and Gustafson's law govern parallel scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Parallel and Distributed Computing

Module 1-Parallelism Fundamentals


Outline
• Motivation
• Key Concepts
• Challenges
• Parallel computing
• Flynn‘s Taxonomy
• Multi-core Processors,
• Shared vs Distributed memory.
SCOPE, VIT Chennai
Motivation
• To address computationally intensive
problems and increase their
response time
• Weather Forecasting
• Genome Sequencing
• Crash Simulation Testing

SCOPE, VIT Chennai


Moore’s Law
• Transistor count on a integerated
chip doubles every 24 months
independent of technology used
– @Gordon Moore, Co-founder of Intel Corp
(1965)

SCOPE, VIT Chennai


Moore’s Law
• Pipelined functional units:
instruction level parallelism
• Superscalar architecture
• Out-of-order execution
• Larger Caches

SCOPE, VIT Chennai


Moore’s Law
• More complexity will not lead to more efficiency
• More functional units are packed into CPU
• The higher probability that the “average” code
will not be able to use them
– As the number of independent instructions in a
sequential instruction stream is limited.
• Faster clock boosts power dissipation making
idling transistors useless

SCOPE, VIT Chennai


Power-Performance Dilemma
• Simplify processor design
• Additional transistors for larger
cache
• So, multi-core processors came into
existence

SCOPE, VIT Chennai


Parallelization
• Misconception: More hardware
Faster Response Time
• Billions of CPU hours wasted
• Supercomputer users have no idea
about limitations of parallel
execution

SCOPE, VIT Chennai


Classes of Parallelism
• Data-Level Parallelism (DLP) : Huge
data can be operated in parallel
• Task-Level Parallelism (TLP) :
independent and large tasks in
parallel on at the same time.

SCOPE, VIT Chennai


Ways to exploit parallelism
• Instruction-Level Parallelism:
– Exploit data-level parallelism
– Pipelining
• Vector Architectures and Graphic
Processor Units (GPUs)
– Exploit data-level parallelism
– Applies a single instruction to a collection of
data in parallel.

SCOPE, VIT Chennai


Ways to exploit parallelism
• Thread-Level Parallelism
– Exploits either data-level parallelism or task-level
parallelism
– Allows for interaction among parallel threads.
• Request-Level Parallelism
– Exploits parallelism among largely decoupled tasks
– Specified by the programmer or the operating
system.

SCOPE, VIT Chennai


Flynn’s Taxonomy
• These 4 ways to support
the data-level parallelism
and task-level parallelism
for hardware [50 years SISD SIMD
back].
• When Michael Flynn
[1966] found a simple
MISD MIMD
classification whose
abbreviations we still use
today.
SCOPE, VIT Chennai
Single instruction stream,
single data stream (SISD)
• This category is the
uniprocessor. INSTRUCTION
STREAM
• No exploitation of
parallelism
DATA Processing
• Can have concurrent STREAM
Unit

processing
• Pipelined and super
scalar processors
SCOPE, VIT Chennai
Single instruction stream,
multiple data streams (SIMD)
INSTRUCTION STREAM

DATA Processing
STREAM1 Unit

DATA Processing
DATA
STREAM 2 Unit

DATA Processing
Unit
STREAM 3

SCOPE, VIT Chennai


Multiple instruction streams,
single data stream (MISD)
INSTRUCTION STREAM

Instruction Instruction Instruction


Stream1 Stream2 Stream3

Processing
Unit

DATA Processing
STREAM Unit

Processing
Unit

SCOPE, VIT Chennai


MIMD
INSTRUCTION STREAM

Instruction Instruction Instruction


Stream1 Stream2 Stream3

DATA Processing
STREAM1 Unit

DATA Processing
DATA
STREAM 2 Unit

DATA Processing
Unit
STREAM 3

SCOPE, VIT Chennai


Parallel Scalability
• Factors limit parallel execution
• Scalability metrics
• Scalability Laws:
– Amdahl’s Law
– Gustafson’s Law
• Parallel efficiency

SCOPE, VIT Chennai


Factors that limit parallelism
1 2 3 4 5 6 7 8 9 10 11 12

Time
W1 1 2 3 4

W2 5 6 7 8 Top: Sequence of tasks that


needs to be parallelized
W3 9 10 11 12 Bottom: 3 workers
(W1,W2,W3) are used to
parallelize the tasks with the
Time time reduction

SCOPE, VIT Chennai


• But it will not be true and the factor is load
imbalance
• Some of the resources will be under utilized
• Communication between co-workers
• Tools to be shared by the co-workers
• These factors will add overhead which will
serialize some part of parallel execution

SCOPE, VIT Chennai


Load imbalance
• Tasks will be
executed by the
workers at
W1 1 2 3 4
different speeds
W2 5 6 7 8 because of load
imbalance
W3 11
9 10 12
• Regions indicate
that unused
Time
regions

SCOPE, VIT Chennai


Scalability Metrics
• Algorithmic limitations because of
mutual dependencies
• Bottlenecks because of shared
resources
• Startup Overhead
• Communication

SCOPE, VIT Chennai


Run-time
• For
  fixed problem to be solved by N
workers,
• The single worker run time,
• The N-workers run time,
• Where S is amount of work can be
serialized and P is amount of work
that can be parallelized
SCOPE, VIT Chennai
Scalability Laws
• Application
  speed up is defined as
quotient of parallel and serial
performance for fixed problem size
• Performance can be defined as “work
done over time”
• Serial Performance, = 1
• Parallel Performance,
SCOPE, VIT Chennai
Amdahl’s Law
•• Gene
  Amdahl, application speed up is
limited to 1/s when N tends to infinite
• “What is the improvement in runtime of an
application run when the problem is put on
N CPUs?”
• Serial performance,
• Parallel Performance,
• Speed up =
SCOPE, VIT Chennai
Problems
• Bob is given the job to write a program that
will get a speedup of 3.8 on 4 processors.
• He makes it 95% parallel, and goes home
dreaming of a big pay raise.
• Using Amdahl’s law, and assuming the
problem size is the same as the serial
version, and ignoring communication costs,
what speedup will Bob actually get?
SCOPE, VIT Chennai
Gustafson’s Law
• “How
  much more work can my
program do in a given amount of
time when I put a larger problem on
CPUs?”

• Where 0 < < 1, is number of workers,

SCOPE, VIT Chennai


Parallel Efficiency
• 

SCOPE, VIT Chennai


Multi-core Processors
• Cores on a die can have separate caches or shared caches at
certain levels
• Sharing cache will reduce latency and improve bandwidth for
communication between core
• But it may lead to cache bandwidth bottlenecks
• Recent multicore designs have integrated memory controller
which is connected directly to memory modules without
chipset
• This reduce main memory latency and allows addition of
Hyper Transport or Quick Path inter-socket networks
• Efficient cache coherence communication

SCOPE, VIT Chennai


Dual core Processor
• Separate L1, L2,
and L3 Cahes P P
• Intel Monticeto
L1D L1D

L2 L2

L3 L3

SCOPE, VIT Chennai


Quad Core Processor
• Two Dual core processors with Shared
L1, Separate L2 (Intel Harpertown)

P P P P
L1D L1D L1D L1D
L2 L2

SCOPE, VIT Chennai


Hexa core processor
• Shared L2,
separate L3
• L2 groups are P P P P P P
dual cores and L1D L1D L1D L1D L1D L1D
L3 group is L2 L2 L2
whole chip L3

• Intel
Dunnigton
SCOPE, VIT Chennai
Quad Core Processor
• Separate L1 and L2
caches
• Shared L3 cache P P P P
• L3 group is whole chip
• Built-in memory L1D L1D L1D L1D
interface allows to L2 L2 L2 L2
attach memory and
L3
other sockets without HT/
chipset QPI
Memory Interface
• AMD Shangai and Intel
Nehalem
SCOPE, VIT Chennai
Shared-memory
• A system where the number of CPUs
work on the physical address space
• Two varieties:
– Uniform Memory Access (UMA)
– Cache Coherent Non-Uniform Memory
Access (ccNUMA)

SCOPE, VIT Chennai


UMA
• Known as Symmetric Multi
Processing (SMP)
• Latency and Bandwidth are the same
for all the processors
• Simplest implementation is Dual-
core

SCOPE, VIT Chennai


UMA System with two Single-
core chips
Common
P Front Side
Bus (FSB) P Socket

L1D L1D

L2 L2

Chipset

Memory

SCOPE, VIT Chennai


UMA System with two Dual-
core chips
P P P P Socket

L1D L1D L1D L1D

L2 L2

Chipset

Memory

SCOPE, VIT Chennai


Cache Coherence
• Cache coherence mechanism is
required
• Because of same copy of cache line is
resided in in several caches
• If one of those caches get modified,
other cache content needs to be
reflected
SCOPE, VIT Chennai
MESI Protocol
• M: modified
P1 P2
• E: exclusive

C1 C2
S: shared 3 7

• I: invalid
A1 A2 A1 A2

A1 A2
Memory

SCOPE, VIT Chennai


• 1. C1 requests exclusive CL
ownership P1 P2
• 2. set CL in C2 to state I
• 3. CL has state E in C1 → modify
C1 C2
A1 in C1 and set to state M
• 4. C2 requests exclusive CL 3 7
ownership
• 5. evict CL from C1 and set to A1 A2 A1 A2
state I
• 6. load CL to C2 and set to state A1 A2
E
Memory
• 7. modify A2 in C2 and set to
state M in C2
Courtesy: Hager, G. (2017). Introduction To high performance
computing for scientists and engineers: CRC press
SCOPE, VIT Chennai
NUMA System

P P P P P P P P
L1D L1D L1D L1D L1D L1D L1D L1D

L2 L2 L2 L2 L2 L2 L2 L2

L3 L3

Coherent
Memory Interface Memory Interface
Link

Memory Memory

SCOPE, VIT Chennai


Distributed Memory
P P P P-Processor
P
C-Cache
NI-Network
C C C Interface
C
M-Memory

M M M M

NI NI NI NI

Communication Network

SCOPE, VIT Chennai


Reference
• Hager, G. (2017). Introduction To
high performance computing for
scientists and engineers: CRC press.
• Patterson and Hennesy. Computer
architecture: a quantitative
approach, 5th edition

SCOPE, VIT Chennai

You might also like