Cis620 15 00

The document discusses various architectures for multiprocessor systems, including UMA and NUMA designs. It describes bus-based UMA systems and limitations as CPUs are added. Cache coherence protocols like MESI are covered, as well as techniques to scale UMA systems using crossbar switches or multistage networks. NUMA systems are introduced to scale beyond 100 CPUs by distributing memory across nodes and allowing non-uniform access times. Cache-coherent NUMA designs using directories are discussed. Specific systems like DASH and Sequent NUMA-Q are summarized.

Uploaded by

Heval TariQ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views36 pages

Cis620 15 00

Uploaded by

Heval TariQ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

UMA Bus-Based SMP

Architectures
 The simplest multiprocessors are based on a
single bus.
– Two or more CPUs and one or more memory
modules all use the same bus for communication.
– If the bus is busy when a CPU wants to read
memory, it must wait.
– Adding more CPUs results in more waiting.
– This can alleviated by having a private cache for
each CPU.
UMA Bus-Based SMP
Architectures
Snooping Caches
– With caches a CPU may have stale data in its
private cache.
– This problem is known as the cache coherence or
cache consistency problem.
– This problem can be controlled by algorithms called
cache coherence protocols.
• In all solutions, the cache controller is specially deigned
to allow it to eavesdrop on the bus, monitoring all bus
requests and taking action in certain cases.
• These devices are called snooping caches.
Snooping Caches
MESI Cache Coherence Protocol
– When a protocol has the property that not all writes
go directly through to memory (a bit is set instead
and the cache line is eventually written to memory)
we call it a write-back protocol.
– One popular write-back protocol is called the MESI
protocol.
• It is used by the Pentium II and other CPUs.
• Each cache entry can be in one of four states:
– Invalid - the cache entry does not contain valid data
– Shared - multiple caches may hold the line; memory is up to
date
MESI Cache Coherence Protocol
– Exclusive - no other cache holds the line; memory is up to date
– Modified - the entry is valid; memory is invalid; no copies
exist
• Initially all cache entries are invalid
• The first time memory is read, the cache line is marked
E (exclusive)
• If some other CPU reads the data, the first CPU sees this
on the bus, announces that it holds the data as well, and
both entries are marked S (shared)
• If one of the CPUs writes the cache entry, it tells all
other CPUs to invalidate their entries (I) and its entry is
now in the M (modify) state.
MESI Cache Coherence Protocol
• If some other CPU now wants to read the modified
line from memory, the cached copy is sent to
memory, and all CPUs needing it read it from
memory. They are marked as S.
• If we write to an uncached line and the write-
allocate is in use, we will load the line, write to it
and mark it as M.
• If write-allocate is not in use, the write goes directly
to memory and the line is not cached anywhere.
MESI Cache Coherence Protocol
UMA Multiprocessors Using
Crossbar Switches
– Even with all possible optimizations, the use of a
single bus limits the size of a UMA multiprocessor
to about 16 or 32 CPUs.
• To go beyond that, a different kind of interconnection
network is needed.
• The simplest circuit for connecting n CPUs to k
memories is the crossbar switch.
– Crossbar switches have long been used in telephone switches.
– At each intersection is a crosspoint - a switch that can be
opened or closed.
– The crossbar is a nonblocking network.
UMA Multiprocessors Using
Crossbar Switches
Sun Enterprise 1000
– An example of a UMA multiprocessor based on a
crossbar switch is the Sun Enterprise 1000.
• This system consists of a single cabinet with up to 64 CPUs.
• The crossbar switch is packaged on a circuit board with
eight plug in slots on each side.
• Each slot can hold up to four UltraSPARC CPUs and 4 GB
of RAM.
• Data is moved between memory and the caches on a 16 X
16 crossbar switch.
• There are four address buses used for snooping.
Sun Enterprise 1000
UMA Multiprocessors Using
Multistage Switching Networks
– In order to go beyond the limits of the Sun Enterprise
1000, we need to have a better interconnection network.
– We can use 2 X 2 switches to build large multistage
switching networks.
• One example is the omega network.
• The wiring pattern of the omega network is called the perfect
shuffle.
• The labels of the memory can be used for routing packets in
the network.
• The omega network is a blocking network.
UMA Multiprocessors Using
Multistage Switching Networks
UMA Multiprocessors Using
Multistage Switching Networks
NUMA Multiprocessors
– To scale to more than 100 CPUs, we have to give up
uniform memory access time.
– This leads to the idea of NUMA (NonUniform
Memory Access) multiprocessors.
• They share a single address space across all the CPUs, but
unlike UMA machines local access is faster than remote
access.
• All UMA programs run without change on NUMA
machines, but the performance is worse.
– When the access time to the remote machine is not hidden (by
caching) the system is called NC-NUMA.
NUMA Multiprocessors
– When coherent caches are present, the system is called CC-
NUMA.
– It is also sometimes known as hardware DSM since it is
basically the same as software distributed shared memory but
implemented by the hardware using a small page size.
• One of the first NC-NUMA machines was the Carnegie
Mellon Cm*.
– This system was implemented with LSI-11 CPUs (the LSI-11 was
a single-chip version of the DEC PDP-11).
– A program running out of remote memory took ten times as long
as one using local memory.
– Note that there is no caching in this type of system so there is no
need for cache coherence protocols.
NUMA Multiprocessors
Cache Coherent NUMA
Multiprocessors
– Not having a cache is a major handicap.
– One of the most popular approaches to building
large CC-NUMA (Cache Coherent NUMA)
multiprocessors currently is the directory-
based multiprocessor.
• Maintain a database telling where each cache line is
and what its status is.
• The db is kept in special-purpose hardware that
responds in a fraction of a bus cycle.
Cache Coherent NUMA
Multiprocessors
DASH Multiprocessor
– The first directory-based CC-NUMA
multiprocessor, DASH (Directory Architecture for
SHared Memory), was built at Stanford University
as a research project.
• It has heavily influenced a number of commercial
products such as the SGI Origin 2000
• The prototype consists of 16 clusters, each one containing
a bus, four MIPS R3000 CPUs, 16 MB of global memory,
and some I/O equipment.
• Each CPU snoops on its local bus, but not on any other
buses, so global coherence needs a different mechanism.
DASH Multiprocessor
DASH Multiprocessor
– Each cluster has a directory that keeps track of which
clusters currently have copies of its lines.
– Each cluster in DASH is connected to an interface that
allows the cluster to communicate with other clusters.
• The interfaces are connected in a rectangular grid.
• A cache line can be in one of three states
– UNCACHED
– SHARED
– MODIFIED
• The DASH protocols are based on ownership and
invalidation.
DASH Multiprocessor
• At every instant each cache line has a unique owner.
– For UNCACHED or SHARED lines, the line’s home
cluster is the owner
– For MODIFIED lines, the cluster holding the one and
only copy is the owner.
• Requests for a cache line work there way out from
the cluster to the global network.
• Maintaining memory consistency in DASH is fairly
complex and slow.
• A single memory access may require a substantial
number of packets to be sent.
Sequent NUMA-Q
Multiprocessor
– The DASH was an important project, but it was never
a commercial system.
– As an example of a commercial CC-NUMA
multiprocessor, consider the Sequent NUMA-Q
2000.
• It uses an interesting and important cache coherence
protocol called SCI (Scalable Coherent Interface).
• The NUMA-Q is based on the standard quad board sold by
Intel containing four Pentium Pro CPU chips and up to 4
GB of RAM.
– All these caches are kept coherent by using the MESI protocol.
Sequent NUMA-Q
Multiprocessor
Sequent NUMA-Q
Multiprocessor
– Each quad board is extended with an IQ-Link board
plugged into a slot designed for network controllers.
• The IQ-Link primarily implements the SCI protocol.
• It holds 32 MB of cache, a directory for the cache, a
snooping interface to the local quad board bus and a
custom chip called the data pump that connects it with
other IQ-Link boards.
– It pumps data from the input side to the output side, keeping
data aimed at its node and passing other data unmodified.
– Together all the IQ-link boards form a ring.
Sequent NUMA-Q
Multiprocessor
Distributed Shared Memory
– A collection of CPUs sharing a common paged
virtual address space is called DSM (Distributed
Shared Memory).
• When a CPU accesses a page in its own local RAM, the
read or write just happens without any further delay.
• If the page is in a remote memory, a page fault is
generated.
• The runtime system or OS sends a message to the node
holding the page to unmap it and send it over.
• Read-only pages may be shared.
Distributed Shared Memory
Distributed Shared Memory
– Pages, however, are an unnatural unit for sharing,
so other approaches have been tried.
– Linda provides processes on multiple machines
with a highly structured distributed shared memory.
• The memory is accessed through a small set of primitive
operations that can be added to existing languages such
as C and FORTRAN.
• The unifying concept behind Linda is that of an abstract
tuple space.
• Four operations are provided on tuples:
Distributed Shared Memory
• out, puts a tuple into the tuple space
• in, retrieves a tuple from the tuple space.
– The tuples are addresses by content, rather than by name.
• read is like in but it does not remove the tuple from the
tuple space.
• eval causes its parameters to be evaluated in parallel
and the resulting tuple to be deposited in the tuple space.
– Various implementations of Linda exist on
multicomputers.
• Broadcasting and directories are used for distributing the
tuples.
Distributed Shared Memory
Distributed Shared Memory
– Orca uses full-blown objects rather than tuples as
the unit of sharing.
– Objects consist of internal state plus operations for
changing the state.
– Each Orca method consists of a list of (guard, block-
of-statements) pairs.
• A guard is a Boolean expression that does not contain any
side effects, or the empty guard, which is simply true.
• When an operation is invoked, all of its guards are
evaluated in an unspecified order.
Distributed Shared Memory
• If all of them are false, the invoking process is
delayed until one becomes true.
• When a guard is found that evaluates to true, the
block of statements following it is executed.
• Orca has a fork statement to create a new process on
a user-specified processor.
• Operations on shared objects are atomic and
sequentially consistent.
• Orca integrates shared data and synchronization in a
way not present in page-based DSM systems.
Distributed Shared Memory

MultiCore Architecture
100% (2)
MultiCore Architecture
44 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
51 pages
21CSE18 High Performance Computing Syllabus
No ratings yet
21CSE18 High Performance Computing Syllabus
2 pages
t1 Brief Essay On x86 Processors
No ratings yet
t1 Brief Essay On x86 Processors
1 page
Answer Chapter 3
No ratings yet
Answer Chapter 3
3 pages
Tinh-Toan-Song-Song - Thoai-Nam - Parallelprocessing - 09 - Scheduling - (Cuuduongthancong - Com)
No ratings yet
Tinh-Toan-Song-Song - Thoai-Nam - Parallelprocessing - 09 - Scheduling - (Cuuduongthancong - Com)
27 pages
Chapter 6
No ratings yet
Chapter 6
15 pages
Cerebras Wafer Scale Cluster datasheet - final
No ratings yet
Cerebras Wafer Scale Cluster datasheet - final
2 pages
Unit 4 - Parallel Computer Structures Word
No ratings yet
Unit 4 - Parallel Computer Structures Word
12 pages
MIT - Applied Parallel Computing - Alan Edelman
No ratings yet
MIT - Applied Parallel Computing - Alan Edelman
187 pages
What Is Difference Between Von Neumann and Non Von Neumann Machines?
No ratings yet
What Is Difference Between Von Neumann and Non Von Neumann Machines?
1 page
Exercises
No ratings yet
Exercises
33 pages
Parallel Algorithm by Rc4
No ratings yet
Parallel Algorithm by Rc4
22 pages
Introduction To Asynchronous Programming: 1 The Models
No ratings yet
Introduction To Asynchronous Programming: 1 The Models
6 pages
2.1.5. Thinking Concurrently
No ratings yet
2.1.5. Thinking Concurrently
4 pages
Parallel Algorithms For Logic Synthesis
No ratings yet
Parallel Algorithms For Logic Synthesis
7 pages
Datastage Designer
No ratings yet
Datastage Designer
322 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
MODULE 4 hpc
No ratings yet
MODULE 4 hpc
41 pages
Lectures On Pipeline and Vector Processing: Unit 6
No ratings yet
Lectures On Pipeline and Vector Processing: Unit 6
27 pages
Ab-Initio Good Practice Tips
100% (2)
Ab-Initio Good Practice Tips
62 pages
UNIT - I Notes
No ratings yet
UNIT - I Notes
48 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
Inter-Process Communication
No ratings yet
Inter-Process Communication
37 pages
Heterogeneous Computing To Enable The Highest Level of Safety in Automotive Systems - v1.2
No ratings yet
Heterogeneous Computing To Enable The Highest Level of Safety in Automotive Systems - v1.2
39 pages
SEIT (2015 Course) - 25-7-16 PDF
No ratings yet
SEIT (2015 Course) - 25-7-16 PDF
52 pages
Peripheral Devices:: Input-Output Organization
No ratings yet
Peripheral Devices:: Input-Output Organization
35 pages
William Stallings Computer Organization and Architecture
No ratings yet
William Stallings Computer Organization and Architecture
41 pages
Robotic Computing On FPGAs
No ratings yet
Robotic Computing On FPGAs
5 pages
An Implementation Study of Airborne Medium PRF Doppler Radar Signal Processing On A Massively Parallel SIMD Processor Architecture
No ratings yet
An Implementation Study of Airborne Medium PRF Doppler Radar Signal Processing On A Massively Parallel SIMD Processor Architecture
7 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6441)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (999)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1174)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Cis620 15 00

Uploaded by

Cis620 15 00

Uploaded by

UMA Bus-Based SMP

You might also like