Lecture 8 Miscellaneous Topics
Lecture 8 Miscellaneous Topics
• Summary
• Additional Resources
Motivating Parallelism
The Memory / Disk Speed Argument
Introduction: 1-17
Multi-Processor vs. Multi-Computer
(Cont.)
Distributed Multi-processor
• Distributed collection of memories
forms one logical address space
• Again, the same address on different
processors refers to the same
memory location
• Also known as Non-Uniform Memory
Access (NUMA) architecture
• Because, memory access time varies
significantly, depending on the
physical location of the referenced
address
Introduction: 1-18
Multi-Processor vs. Multi-Computer (Cont.)
Multi-Computer
• Distributed memory, multi-CPU computer
• Unlike NUMA architecture, a multicomputer has disjoint
local address spaces
• Each processor has direct access to their local memory
only
• Same address on different processors refers to two
different physical memory locations
• Processors interact with each other through passing
messages
Multi-Processor vs. Multi-Computer
(Cont.)
Asymmetric Multi-Computers
• A front-end computer that interacts
with users and I/O devices
• The back-end processors are
dedicatedly used for “number
crunching”
• Front-end computer executes a full,
multi-programmed OS and provides
all functions needed for program
development
• The backends are reserved for
executing parallel programs
Introduction: 1-20
Multi-Processor vs. Multi-Computer
(Cont.)
Symmetric Multi-Computers
• Every computer executes same OS
• Users may log into any of the
computers
• This enables multiple users to
concurrently login, edit and compile
their programs
• All the nodes can participate in
execution of a parallel program
Introduction: 1-21
Cluster vs. Network of Workstations
Cluster Network of workstations
• Usually a co-located collection of • A dispersed collection of computers
low-cost computers and switches, • Individual workstations may have
dedicated to running parallel jobs different Operating systems and
• All computer run the same version executable programs
of operating system • Users have the power to login and
• Some of the computers may not power off their workstations
have interfaces for the users to login • Ethernet speed for this network is
• Commodity cluster uses high speed usually slower.
networks for communication such as Typical in range of 10s Mbps
fast Ethernet at 100Mbps etc.
Introduction: 1-22
Cache Coherence
• When we are in a distributed environment, each CPU’s
cache needs to be consistent (continuously needs to be
updated for current values), which is known as cache
coherence
Snooping
• Snoopy protocols achieve data consistency between the
cache memory and the shared memory through a bus-
based memory system
• Write-invalidate and write-update policies are used for
maintaining cache consistency
Branch Prediction
• Branch prediction is a technique used in CPU design that
attempts to guess the outcome of a conditional operation
and prepare for the most likely result
• This is done by forcing all flits to take the same path, in sequence
For communication using flits, start-up time dominates the node latencies
Message Passing Costs in Parallel Computers (Cont)
Simplified Cost Model for Communicating Messages (Cont.)
Functional Parallelism
Independent tasks applying different operations to different data
elements
Pipelining
Effective method of attaining parallelism on the uniprocessor architectures
Summary (Cont.)
• Multi-Processor vs. Multi-Computer
Multi-Processor - Multiple CPUs with a shared memory
Centralized Multi-Processor - All the memory is at one place and has Uniform Memory
Access (UMA) architecture
Distributed Multi-processor - Distributed collection of memories forms one logical
address space. Use Non-Uniform Memory Access (NUMA) architecture
Multi-Computer - Each processor has disjoint memory
Asymmetric Multi-Computers
A front-end computer interacts with users and I/O devices
The back-end processors are dedicated for processing
Symmetric Multi-Computers
Every computer executes same OS
This enables multiple users to concurrently login, edit and compile their programs
Summary (Cont.)
• Cluster vs. Network of Workstations
Cluster - Co-located collection of computer with same version of OS
Network of Workstations - Dispersed collection of computers with different
versions of OS
• Cache Coherence
Cache continuously needs to be updated for current values
• Snooping
Snoopy protocols achieve data consistency between cache and memory
through a bus based memory system
Summary (Cont.)
• Branch Prediction
Used in CPU design to guess the outcome of a conditional operation and
prepare for the most likely result
Switches must operate in O(1) time at the level of words, for a system of p
processors and m words, the switch complexity is O(mp)
Packet Routing
Cut-Through Routing