Cis620 15 00
Cis620 15 00
Architectures
The simplest multiprocessors are based on a
single bus.
– Two or more CPUs and one or more memory
modules all use the same bus for communication.
– If the bus is busy when a CPU wants to read
memory, it must wait.
– Adding more CPUs results in more waiting.
– This can alleviated by having a private cache for
each CPU.
UMA Bus-Based SMP
Architectures
Snooping Caches
– With caches a CPU may have stale data in its
private cache.
– This problem is known as the cache coherence or
cache consistency problem.
– This problem can be controlled by algorithms called
cache coherence protocols.
• In all solutions, the cache controller is specially deigned
to allow it to eavesdrop on the bus, monitoring all bus
requests and taking action in certain cases.
• These devices are called snooping caches.
Snooping Caches
MESI Cache Coherence Protocol
– When a protocol has the property that not all writes
go directly through to memory (a bit is set instead
and the cache line is eventually written to memory)
we call it a write-back protocol.
– One popular write-back protocol is called the MESI
protocol.
• It is used by the Pentium II and other CPUs.
• Each cache entry can be in one of four states:
– Invalid - the cache entry does not contain valid data
– Shared - multiple caches may hold the line; memory is up to
date
MESI Cache Coherence Protocol
– Exclusive - no other cache holds the line; memory is up to date
– Modified - the entry is valid; memory is invalid; no copies
exist
• Initially all cache entries are invalid
• The first time memory is read, the cache line is marked
E (exclusive)
• If some other CPU reads the data, the first CPU sees this
on the bus, announces that it holds the data as well, and
both entries are marked S (shared)
• If one of the CPUs writes the cache entry, it tells all
other CPUs to invalidate their entries (I) and its entry is
now in the M (modify) state.
MESI Cache Coherence Protocol
• If some other CPU now wants to read the modified
line from memory, the cached copy is sent to
memory, and all CPUs needing it read it from
memory. They are marked as S.
• If we write to an uncached line and the write-
allocate is in use, we will load the line, write to it
and mark it as M.
• If write-allocate is not in use, the write goes directly
to memory and the line is not cached anywhere.
MESI Cache Coherence Protocol
UMA Multiprocessors Using
Crossbar Switches
– Even with all possible optimizations, the use of a
single bus limits the size of a UMA multiprocessor
to about 16 or 32 CPUs.
• To go beyond that, a different kind of interconnection
network is needed.
• The simplest circuit for connecting n CPUs to k
memories is the crossbar switch.
– Crossbar switches have long been used in telephone switches.
– At each intersection is a crosspoint - a switch that can be
opened or closed.
– The crossbar is a nonblocking network.
UMA Multiprocessors Using
Crossbar Switches
Sun Enterprise 1000
– An example of a UMA multiprocessor based on a
crossbar switch is the Sun Enterprise 1000.
• This system consists of a single cabinet with up to 64 CPUs.
• The crossbar switch is packaged on a circuit board with
eight plug in slots on each side.
• Each slot can hold up to four UltraSPARC CPUs and 4 GB
of RAM.
• Data is moved between memory and the caches on a 16 X
16 crossbar switch.
• There are four address buses used for snooping.
Sun Enterprise 1000
UMA Multiprocessors Using
Multistage Switching Networks
– In order to go beyond the limits of the Sun Enterprise
1000, we need to have a better interconnection network.
– We can use 2 X 2 switches to build large multistage
switching networks.
• One example is the omega network.
• The wiring pattern of the omega network is called the perfect
shuffle.
• The labels of the memory can be used for routing packets in
the network.
• The omega network is a blocking network.
UMA Multiprocessors Using
Multistage Switching Networks
UMA Multiprocessors Using
Multistage Switching Networks
NUMA Multiprocessors
– To scale to more than 100 CPUs, we have to give up
uniform memory access time.
– This leads to the idea of NUMA (NonUniform
Memory Access) multiprocessors.
• They share a single address space across all the CPUs, but
unlike UMA machines local access is faster than remote
access.
• All UMA programs run without change on NUMA
machines, but the performance is worse.
– When the access time to the remote machine is not hidden (by
caching) the system is called NC-NUMA.
NUMA Multiprocessors
– When coherent caches are present, the system is called CC-
NUMA.
– It is also sometimes known as hardware DSM since it is
basically the same as software distributed shared memory but
implemented by the hardware using a small page size.
• One of the first NC-NUMA machines was the Carnegie
Mellon Cm*.
– This system was implemented with LSI-11 CPUs (the LSI-11 was
a single-chip version of the DEC PDP-11).
– A program running out of remote memory took ten times as long
as one using local memory.
– Note that there is no caching in this type of system so there is no
need for cache coherence protocols.
NUMA Multiprocessors
Cache Coherent NUMA
Multiprocessors
– Not having a cache is a major handicap.
– One of the most popular approaches to building
large CC-NUMA (Cache Coherent NUMA)
multiprocessors currently is the directory-
based multiprocessor.
• Maintain a database telling where each cache line is
and what its status is.
• The db is kept in special-purpose hardware that
responds in a fraction of a bus cycle.
Cache Coherent NUMA
Multiprocessors
DASH Multiprocessor
– The first directory-based CC-NUMA
multiprocessor, DASH (Directory Architecture for
SHared Memory), was built at Stanford University
as a research project.
• It has heavily influenced a number of commercial
products such as the SGI Origin 2000
• The prototype consists of 16 clusters, each one containing
a bus, four MIPS R3000 CPUs, 16 MB of global memory,
and some I/O equipment.
• Each CPU snoops on its local bus, but not on any other
buses, so global coherence needs a different mechanism.
DASH Multiprocessor
DASH Multiprocessor
– Each cluster has a directory that keeps track of which
clusters currently have copies of its lines.
– Each cluster in DASH is connected to an interface that
allows the cluster to communicate with other clusters.
• The interfaces are connected in a rectangular grid.
• A cache line can be in one of three states
– UNCACHED
– SHARED
– MODIFIED
• The DASH protocols are based on ownership and
invalidation.
DASH Multiprocessor
• At every instant each cache line has a unique owner.
– For UNCACHED or SHARED lines, the line’s home
cluster is the owner
– For MODIFIED lines, the cluster holding the one and
only copy is the owner.
• Requests for a cache line work there way out from
the cluster to the global network.
• Maintaining memory consistency in DASH is fairly
complex and slow.
• A single memory access may require a substantial
number of packets to be sent.
Sequent NUMA-Q
Multiprocessor
– The DASH was an important project, but it was never
a commercial system.
– As an example of a commercial CC-NUMA
multiprocessor, consider the Sequent NUMA-Q
2000.
• It uses an interesting and important cache coherence
protocol called SCI (Scalable Coherent Interface).
• The NUMA-Q is based on the standard quad board sold by
Intel containing four Pentium Pro CPU chips and up to 4
GB of RAM.
– All these caches are kept coherent by using the MESI protocol.
Sequent NUMA-Q
Multiprocessor
Sequent NUMA-Q
Multiprocessor
– Each quad board is extended with an IQ-Link board
plugged into a slot designed for network controllers.
• The IQ-Link primarily implements the SCI protocol.
• It holds 32 MB of cache, a directory for the cache, a
snooping interface to the local quad board bus and a
custom chip called the data pump that connects it with
other IQ-Link boards.
– It pumps data from the input side to the output side, keeping
data aimed at its node and passing other data unmodified.
– Together all the IQ-link boards form a ring.
Sequent NUMA-Q
Multiprocessor
Distributed Shared Memory
– A collection of CPUs sharing a common paged
virtual address space is called DSM (Distributed
Shared Memory).
• When a CPU accesses a page in its own local RAM, the
read or write just happens without any further delay.
• If the page is in a remote memory, a page fault is
generated.
• The runtime system or OS sends a message to the node
holding the page to unmap it and send it over.
• Read-only pages may be shared.
Distributed Shared Memory
Distributed Shared Memory
– Pages, however, are an unnatural unit for sharing,
so other approaches have been tried.
– Linda provides processes on multiple machines
with a highly structured distributed shared memory.
• The memory is accessed through a small set of primitive
operations that can be added to existing languages such
as C and FORTRAN.
• The unifying concept behind Linda is that of an abstract
tuple space.
• Four operations are provided on tuples:
Distributed Shared Memory
• out, puts a tuple into the tuple space
• in, retrieves a tuple from the tuple space.
– The tuples are addresses by content, rather than by name.
• read is like in but it does not remove the tuple from the
tuple space.
• eval causes its parameters to be evaluated in parallel
and the resulting tuple to be deposited in the tuple space.
– Various implementations of Linda exist on
multicomputers.
• Broadcasting and directories are used for distributing the
tuples.
Distributed Shared Memory
Distributed Shared Memory
– Orca uses full-blown objects rather than tuples as
the unit of sharing.
– Objects consist of internal state plus operations for
changing the state.
– Each Orca method consists of a list of (guard, block-
of-statements) pairs.
• A guard is a Boolean expression that does not contain any
side effects, or the empty guard, which is simply true.
• When an operation is invoked, all of its guards are
evaluated in an unspecified order.
Distributed Shared Memory
• If all of them are false, the invoking process is
delayed until one becomes true.
• When a guard is found that evaluates to true, the
block of statements following it is executed.
• Orca has a fork statement to create a new process on
a user-specified processor.
• Operations on shared objects are atomic and
sequentially consistent.
• Orca integrates shared data and synchronization in a
way not present in page-based DSM systems.
Distributed Shared Memory