0% found this document useful (0 votes)
20 views

Lecture 8 Miscellaneous Topics

Uploaded by

faizan majid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture 8 Miscellaneous Topics

Uploaded by

faizan majid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

CS 3006

Parallel and Distributed Computing


Lecture 8
Danyal Farhat
FAST School of Computing
NUCES Lahore
Miscellaneous Topics
Outline
• Motivating Parallelism
The Memory / Disk Speed Argument
The Data Communication Argument
• Types of Parallelism
Data Parallelism
Functional Parallelism
Pipelining
• Multi-Processor vs. Multi-Computer
• Cluster vs. Network of Workstations
Outline (Cont.)
• Cache Coherence
• Snooping
• Branch Prediction
• Architecture of an Ideal Parallel Computer - PRAM
• PRAM Classes
• PRAM Arbitration Protocols
• Physical Complexity of an Ideal Parallel Computer
• Communication Costs in Parallel Machines
Outline (Cont.)
• Message Passing Costs in Parallel Computers
Startup time, Per-hop time, Per-word transfer time

Store-and-Forward Routing, Packet Routing, Cut-Through Routing

Simplified Cost Model for Communicating Messages

• Summary
• Additional Resources
Motivating Parallelism
The Memory / Disk Speed Argument

• While clock rates of high-end processors have increased at roughly


40% per year over the past decade, DRAM access times have only
improved at the rate of roughly 10% per year over this interval

• This mismatch in speeds causes significant performance


bottlenecks

• Parallel platforms provide increased bandwidth to the memory


system
Motivating Parallelism (Cont.)
The Memory / Disk Speed Argument

• Parallel platforms also provide higher aggregate caches

• Some of the fastest growing applications of parallel computing


utilize not their raw computational speed, rather their ability to
pump data to memory and disk faster
CPU is fast but memory isn’t, which is a mismatch in context of parallel
programming
Solution: Parallel platform (increased bandwidth to memory) and
combination or collection of caches
Motivating Parallelism (Cont.)
The Data Communication Argument

• As the network evolves, the vision of the Internet as one large


computing platform has emerged.

• In many applications like databases and data mining problems, the


volume of data is such that they cannot be moved.

• Any analyses on this data must be performed over the network


using parallel techniques
Types of Parallelism
Data Parallelism
• When there are independent tasks applying the same
operation to different elements of a data set
• Example code
for i=0 to 99 do
a[ i ] = b[ i ] + c [ i ]
Endfor
• Here same operation addition is being performed on first 100
elements of ‘b’ and ‘c’ arrays
• All 100 iterations of the loop could be executed simultaneously
Types of Parallelism (Cont.)
Functional Parallelism
• When there are independent tasks applying different
operations to different data elements
• Example code
1. a=2
2. b=3
3. m= (a+b) / 2
4. s= (𝒂 x a + 𝒃 x b) / 2
5. v= s - 𝒎^𝟐
• Here third and fourth statements could be performed
concurrently
Types of Parallelism (Cont.)
Pipelining
• Usually used for the problems where single instance of the
problem can not be parallelized

• The output of one stage is input of the other stage

• Dividing whole computation of each instance into multiple


stages provided that there are multiple instances of the
problem
Types of Parallelism (Cont.)
Pipelining (Cont.)
• An effective method of attaining parallelism on the
uniprocessor architectures

• Depends on pipelining abilities of the processor


Pentium 2 supports 12 stage pipelining

Pentium 4 supports 20 stage pipelining at 2.0 GHz


Types of Parallelism (Cont.)
• Pipelining Example: Assembly line analogy
Types of Parallelism (Cont.)
• Pipelining Example: Assembly line analogy
Types of Parallelism (Cont.)
• Pipelining Example: Assembly line analogy
Multi-Processor vs. Multi-Computer
Multi-Processor
• Multiple CPUs with a shared memory
• Same address on two different CPUs refers to the same
memory location
• Generally two categories
Centralized Multi-processor
Distributed Multi-processor
Multi-Processor vs. Multi-Computer
(Cont.)
Centralized Multi-Processor
• Additional CPUs are attached to
the system bus, and all the
processors share the same primary
memory
• All the memory is at one place and
has the same access time from
every processor
• Also known to as UMA (Uniform
Memory Access) Multi-processor or
SMP (Symmetrical Multi-processor)

Introduction: 1-17
Multi-Processor vs. Multi-Computer
(Cont.)
Distributed Multi-processor
• Distributed collection of memories
forms one logical address space
• Again, the same address on different
processors refers to the same
memory location
• Also known as Non-Uniform Memory
Access (NUMA) architecture
• Because, memory access time varies
significantly, depending on the
physical location of the referenced
address

Introduction: 1-18
Multi-Processor vs. Multi-Computer (Cont.)
Multi-Computer
• Distributed memory, multi-CPU computer
• Unlike NUMA architecture, a multicomputer has disjoint
local address spaces
• Each processor has direct access to their local memory
only
• Same address on different processors refers to two
different physical memory locations
• Processors interact with each other through passing
messages
Multi-Processor vs. Multi-Computer
(Cont.)
Asymmetric Multi-Computers
• A front-end computer that interacts
with users and I/O devices
• The back-end processors are
dedicatedly used for “number
crunching”
• Front-end computer executes a full,
multi-programmed OS and provides
all functions needed for program
development
• The backends are reserved for
executing parallel programs
Introduction: 1-20
Multi-Processor vs. Multi-Computer
(Cont.)
Symmetric Multi-Computers
• Every computer executes same OS
• Users may log into any of the
computers
• This enables multiple users to
concurrently login, edit and compile
their programs
• All the nodes can participate in
execution of a parallel program

Introduction: 1-21
Cluster vs. Network of Workstations
Cluster Network of workstations
• Usually a co-located collection of • A dispersed collection of computers
low-cost computers and switches, • Individual workstations may have
dedicated to running parallel jobs different Operating systems and
• All computer run the same version executable programs
of operating system • Users have the power to login and
• Some of the computers may not power off their workstations
have interfaces for the users to login • Ethernet speed for this network is
• Commodity cluster uses high speed usually slower.
networks for communication such as  Typical in range of 10s Mbps
fast Ethernet at 100Mbps etc.

Introduction: 1-22
Cache Coherence
• When we are in a distributed environment, each CPU’s
cache needs to be consistent (continuously needs to be
updated for current values), which is known as cache
coherence
Snooping
• Snoopy protocols achieve data consistency between the
cache memory and the shared memory through a bus-
based memory system
• Write-invalidate and write-update policies are used for
maintaining cache consistency
Branch Prediction
• Branch prediction is a technique used in CPU design that
attempts to guess the outcome of a conditional operation
and prepare for the most likely result

• Issues in pipelining of branch prediction:


If the prediction is true then the pipeline will not be flushed and no clock
cycles will be lost
If the prediction is false then the pipeline is flushed and starts over with the
current instruction
Physical Organization of Parallel Platforms
Architecture of an Ideal Parallel Computer
• Parallel Random Access Machine (PRAM)
An extension to ideal sequential model: Random Access Machine (RAM)
PRAMs consist of p processors
A global memory
 Unbounded size
 Uniformly accessible to all processors with same address space
Processors share a common clock but may execute different instructions in
each cycle
Based on simultaneous memory access mechanisms, PRAM can further be
classified
Graphical Representation of PRAM
Parallel Random Access Machine (PRAM)
• PRAM has a set of similar type of processors
• Processors communicate with each other using the shared
memory
• N processors can perform independent operations on N
number of data in a given time, this might lead to
simultaneous access of same memory location by different
processors
To solve the simultaneous access of same memory location problem we
have PRAM classes
PRAM Classes
• PRAMs can be divided into four classes
Exclusive-Read, Exclusive-Write (EREW) PRAM
No two processors can perform read/write operations concurrently
Weakest PRAM model, provides minimum memory access concurrency
Concurrent-Read, Exclusive-Write (CREW) PRAM
All processors can read concurrently but can’t write at same time
Multiple write accesses to a memory location are serialized
Exclusive-Read, Concurrent-Write (ERCW) PRAM
No two processors can perform read operations concurrently, but can write
Concurrent-Read, Concurrent-Write (CRCW) PRAM
Most powerful PRAM model
PRAM Arbitration Protocols
• Concurrent reads do not create any semantic
inconsistencies

• But, What about concurrent write?

• Need of an arbitration (mediation) mechanism to resolve


concurrent write access
PRAM Arbitration Protocols (Cont.)
• Common
Write only if all values that processors are attempting to write are identical
• Arbitrary
Write the data from a randomly selected processor and ignore the rest
• Priority
Follow a predetermined priority order
Processor with highest priority succeeds and the rest fail
• Sum
Write the sum of the data items in all the write requests
The sum-based write conflict resolution model can be extended for any of
the associative operators, that is defined for data being written
Physical Complexity of an Ideal Parallel Computer
• Processors and memories are connected via switches

• Since these switches must operate in O(1) time at the level


of words, for a system of p processors and m words, the
switch complexity is O(mp)
Switches determine the memory word being accessed by each processor
Switch is a device that opens or closes access to certain data bank or word

• Clearly, for meaningful values of p and m, a true PRAM is


not realizable
Communication Costs in Parallel Machines
• Along with idling (doing nothing) and contention (conflict
e.g., resource allocation), communication is a major
overhead in parallel programs
• The communication cost is usually dependent on a
number of features including the following:
Programming model for communication
 Required pattern of the communication in the program
Network topology
Data handling and routing
Associated network protocols
• Usually, distributed systems suffer from major
communication overheads
Message Passing Costs in Parallel Computers
• The total time to transfer a message over a network
comprises of the following:

• Startup time (ts): Time spent at sending and receiving


nodes (preparing the message [adding headers, trailers,
and parity information] , executing the routing algorithm,
establishing interface between node and router, etc.)
Message Passing Costs in Parallel Computers (Cont)
• Per-hop time (th): This time is a function of number of hops
(steps) and includes factors such as switch latencies,
network delays, etc.
Also known as node latency
Also accounts for the latency to take decision of choosing next channel to
which this message shall be forwarded
• Per-word transfer time (tw): This time includes all
overheads that are determined by the length of the
message. This includes bandwidh of links, and buffering
overheads, etc.
• If channel bandwidth is r words/s then each word take tw= 1/r to traverse
the link
Message Passing Costs in Parallel Computers (Cont)
Store-and-Forward Routing
• A message traversing multiple hops is completely received at
intermediate hop before being forwarded to next hop
• The total communication cost for a message of size m words to
traverse l communication links is

• In most platforms, th is small and the above expression can be


approximated by

Cost of header transfer at each hop (step) th ts is startup time


mtw is cost of transferring m words over the link
Message Passing Costs in Parallel Computers (Cont)
Packet Routing
• Store-and-forward makes poor use of communication resources
• Packet routing breaks messages into packets and pipelines them
through the network
• Since packets may take different paths, each packet must carry
routing information, error checking, sequencing, and other related
header information
Error checking (parity information), sequencing (order number)
Related headers: layers headers, addressing headers
• The total communication time for packet routing is approximated
by:
• Here factor tw also accounts for overheads in packet headers
Message Passing Costs in Parallel Computers (Cont)
Cut-Through Routing
• Takes the concept of packet routing to an extreme by further
dividing messages into basic units called flits or flow control digits

• Since flits are typically small, the header information must be


minimized

• This is done by forcing all flits to take the same path, in sequence

• A tracer message first programs all intermediate routers. All flits


then take the same route
Message Passing Costs in Parallel Computers (Cont)
Cut-Through Routing (Cont.)
• Error checks are performed on entire message, as opposed to flits
• No sequence numbers are needed
Sequencing information is not needed as all the packets are following same
path which ensures in-order delivery
• The total communication time for cut-through routing is
approximated by:

• This is identical to packet routing, however, tw is typically much


smaller
Header of the message takes l* th to reach the destination and entire
message arrives in time m tw after the message header
Message Passing Costs in Parallel Computers (Cont.)

(a) through a store-and-forward


communication network

b) and (c) extending the concept to cut-


through routing

Shaded regions here represent the time where


message is in transit (travel)
The startup time associated with this message
transfer is assumed to be zero
Message Passing Costs in Parallel Computers (Cont)
Simplified Cost Model for Communicating Messages
• The cost of communicating a message between two nodes l hops
away using cut-through routing is given by:

• In this expression, th is typically smaller than ts and tw. For this


reason, the second term in the RHS does not show, particularly,
when m is large
• For these reasons, we can approximate the cost of message transfer
by:

For communication using flits, start-up time dominates the node latencies
Message Passing Costs in Parallel Computers (Cont)
Simplified Cost Model for Communicating Messages (Cont.)

• It is important to note that the original expression for


communication time is valid for only uncongested networks

• Different communication patterns congest different networks to


varying extents

• It is important to understand and account for this in the


communication time accordingly
Summary
• Motivating Parallelism
The Memory / Disk Speed Argument
 Speed mismatch between processor and DRAM can be reduced using parallel
platforms and aggregated caches

The Data Communication Argument


 Analyses on large scale data can be performed over the network using parallel
techniques
Summary (Cont.)
• Types of Parallelism
Data Parallelism
 Independent tasks applying the same operation to different elements of a data set

Functional Parallelism
Independent tasks applying different operations to different data
elements

Pipelining
 Effective method of attaining parallelism on the uniprocessor architectures
Summary (Cont.)
• Multi-Processor vs. Multi-Computer
Multi-Processor - Multiple CPUs with a shared memory
 Centralized Multi-Processor - All the memory is at one place and has Uniform Memory
Access (UMA) architecture
 Distributed Multi-processor - Distributed collection of memories forms one logical
address space. Use Non-Uniform Memory Access (NUMA) architecture
Multi-Computer - Each processor has disjoint memory
 Asymmetric Multi-Computers
 A front-end computer interacts with users and I/O devices
 The back-end processors are dedicated for processing
 Symmetric Multi-Computers
 Every computer executes same OS
 This enables multiple users to concurrently login, edit and compile their programs
Summary (Cont.)
• Cluster vs. Network of Workstations
Cluster - Co-located collection of computer with same version of OS
Network of Workstations - Dispersed collection of computers with different
versions of OS
• Cache Coherence
Cache continuously needs to be updated for current values
• Snooping
Snoopy protocols achieve data consistency between cache and memory
through a bus based memory system
Summary (Cont.)
• Branch Prediction
Used in CPU design to guess the outcome of a conditional operation and
prepare for the most likely result

• Architecture of an Ideal Parallel Computer – PRAM


Extension to ideal sequential model: Random Access Machine (RAM)
Consist of p processors
Global memory (Unbounded size, Uniformly accessible to all processors
with same address space)
Summary (Cont.)
• PRAM Classes
Exclusive-Read, Exclusive-Write (EREW) PRAM
Concurrent-Read, Exclusive-Write (CREW) PRAM
Exclusive-Read, Concurrent-Write (ERCW) PRAM
Concurrent-Read, Concurrent-Write (CRCW) PRAM
• PRAM Arbitration Protocols
Concurrent writes do create semantic inconsistencies
We need an arbitration mechanism to resolve concurrent write access
Common, Arbitrary, Priority, Sum protocols can be used
Summary (Cont.)
• Physical Complexity of an Ideal Parallel Computer
Processors and memories are connected via switches

Switches must operate in O(1) time at the level of words, for a system of p
processors and m words, the switch complexity is O(mp)

For meaningful values of p and m, a true PRAM is not realizable


Summary (Cont.)
• Communication Costs in Parallel Machines
• Message Passing Costs in Parallel Computers
Startup time, Per-hop time, Per-word transfer time
Store-and-Forward Routing

Packet Routing

Cut-Through Routing

Simplified Cost Model for Communicating Messages


Additional Resources
• Book: Introduction to Parallel Computing by Ananth Grama
and Anshul Gupta
Chapter 1: Introduction to Parallel Computing
 1.1 Motivating Parallelism

• Flynn, M., “Some Computer Organizations and Their


Effectiveness,” IEEE Transactions on Computers, Vol. C-21,
No. 9, September 1972.
• Kumar, V., Grama, A., Gupta, A., & Karypis, G.
(1994). Introduction to parallel computing (Vol. 110).
Redwood City, CA: Benjamin/Cummings.
Questions?

You might also like