0% found this document useful (0 votes)

44 views

CA Lecture 13

This document provides an overview of computer architecture as it relates to multiprocessors and thread-level parallelism. It begins with an introduction discussing increasing demands for parallel processors and challenges in multiprocessor architecture. It then covers taxonomy of parallel architectures including SISD, MISD, SIMD, and MIMD models. The rest of the document discusses different types of MIMD architectures like centralized shared memory, distributed memory, and hybrid models. It also covers communication models, cache coherence problems, and basic snoopy cache coherence protocols.

Uploaded by

Hadeeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

CA Lecture 13

Uploaded by

Hadeeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Computer Architecture

Multiprocessors and
Thread-Level Parallelism
Multiprocessors and
Thread-Level Parallelism
Introduction
Characteristics of Application Domains
Symmetric Shared-Memory Architectures
Performance of Symmetric Shared-Memory
Multiprocessors
Distributed Shared-Memory Architectures
Performance of Distributed Shared-Memory
Multiprocessors
Synchronization
Models of Memory Consistency: An Introduction
Multithreading: Exploiting Thread-Level Parallelism
within a Processor
2
Introduction
• Increasing demands of parallel processors
– Microprocessors are likely to remain the dominant uniprocessor
technology
• Connecting multiple microprocessors together is likely to be more cost-effective
than designing a custom parallel processor
– It’s unclear whether architectural innovation can be sustained
indefinitely
• Multiprocessors are another way to improve parallelism
– Server and embedded applications exhibit natural parallelism to be
exploited beyond desktop applications (ILP)
• Challenges to architecture research and development
– Death of advances in uniprocessor architecture?
– More multiprocessor architectures failing than succeeding
• more design spaces and tradeoffs

3
Taxonomy of Parallel Architectures
Flynn Categories
• SISD (Single Instruction Single Data)
– Uniprocessors
• MISD (Multiple Instruction Single Data)
– ???; multiple processors on a single data stream
• SIMD (Single Instruction Multiple Data)
– same instruction executed by multiple processors using different data streams
• Each processor has its data memory (hence multiple data)
• There’s a single instruction memory and control processor
– Simple programming model, Low overhead, Flexibility
– (Phrase reused by Intel marketing for media instructions ~ vector)
– Examples: vector architectures, Illiac-IV, CM-2
• MIMD (Multiple Instruction Multiple Data)
– Each processor fetches its own instructions and operates on its own data
– MIMD current winner: Concentrate on major design emphasis <= 128 processors
• Use off-the-shelf microprocessors: cost-performance advantages
• Flexible: high performance for one application, running many tasks simultaneously
– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

4
MIMD Class 1:
Centralized shared-memory multiprocessor

share a single centralized memory, interconnect processors and memory by a bus

• also known as “uniform memory access” (UMA) or
“symmetric (shared-memory) multiprocessor” (SMP)
– A symmetric relationship to all processors
– A uniform memory access time from any processor
• scalability problem: less attractive for large-scale processors
5
MIMD Class 2:
Distributed-memory multiprocessor

memory modules associated with CPUs

• Advantages:
– cost-effective way to scale memory bandwidth
– lower memory latency for local memory access
• Drawbacks
– longer communication latency for communicating data between processors
– software model more complex 6
MIMD Hybrid I (Clusters of SMP):
Distributed Shared Memory Multiprocessor

Proc. Proc. Proc. Proc. Proc. Proc.

Caches Caches Caches Caches Caches Caches

Node Interc. Network Node Interc. Network

Memory I/O Memory I/O

Cluster Interconnection Network

Physically separate memories addressed as one logically shared address space

– a memory reference can be made by any processor to any memory location
– also called NUMA (Nonuniform memory access)

7
MIMD Hybrid II (Multicomputers):
Message-Passing Multiprocessor
• Data Communication Models for Multiprocessors
– shared memory: access shared address space implicitly via load and store
operations
– message-passing: done by explicitly passing messages among the processors
• can invoke software with Remote Procedure Call (RPC)
• often via library, such as MPI: Message Passing Interface
• also called "Synchronous communication" since communication causes
synchronization between 2 processes
• Message-Passing Multiprocessor
– The address space can consist of multiple private address spaces that are
logically disjoint and cannot be addressed by a remote processor
– The same physical address on two different processors refers to two different
locations in two different memories
• Multicomputer (cluster): can even consist of completely separate
computers connected on a LAN
– cost-effective for applications that require little or no communication

8
Comparisons of Communication Models
Advantages of Shared-Memory Communication Model
• Compatibility with SMP hardware
• Ease of programming when communication patterns are complex or vary
dynamically during execution
• Ability to develop applications using familiar SMP model, attention only on
performance critical accesses
• Lower communication overhead, better use of bandwidth for small items, due to
implicit communication and memory mapping to implement protection in hardware,
rather than through the I/O system
• Hardware-controlled caching to reduce the frequency of remote communication by
caching of all data, both shared and private
Advantages of Message-Passing Communication Model
• The hardware can be simpler (esp. vs. NUMA)
• Communication explicit => simpler to understand; in shared memory it can be hard
to know when communicating and when not, and how costly it is
• Explicit communication focuses programmer attention on costly aspect of parallel
computation, sometimes leading to improved structure in multiprocessor program
• Synchronization is naturally associated with sending messages, reducing the
possibility for errors introduced by incorrect synchronization
• Easier to use sender-initiated communication, which may have some advantages
in performance 9
Symmetric Shared-Memory Architectures
Caching in shared-memory machines
• private data: used by a single processor
– When a private item is cached, its location is migrated to the cache
– Since no other processor uses the data, the program behavior is identical to that
in a uniprocessor
• shared data: used by multiple processor
– When shared data are cached, the shared value may be replicated in multiple
caches
– advantages: reduce access latency and memory contention
– induce a new problem: cache coherence

Coherence cache provides:

• migration: a data item can be moved to a local cache and used there in a
transparent fashion
• replication for shared data that are being simultaneously read
both are critical to performance in accessing shared data

10
Multiprocessor Cache Coherence Problem
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:
– “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read and
write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order
(could see older value after a newer value)

11
Two Classes of Cache Coherence Protocols
•Snooping Solution (Snoopy Bus)
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
•Directory-Based Schemes (Section 6.5)
– Directory keeps track of what is being shared in a centralized place (logically)
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes

12
Basic Snoopy Protocols
• Write strategies
– Write-through: memory is always up-to-date
– Write-back: snoop in caches to find most recent copy

• Write Invalidate Protocol

– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
• Read miss: further read will miss in the cache and fetch a new copy of the data

• Write Broadcast/Update Protocol (typically write through)

– Write to shared data: broadcast on bus, processors snoop, and update any
copies
– Read miss: memory/cache is always up-to-date

• Write serialization: bus serializes requests!

– Bus is single point of arbitration
13
Examples of Basic Snooping Protocols
Write Invalidate

Write Update

Assume neither cache initially holds X and the value of X in memory is 0 14

Comparisons of Basic Snoopy Protocols
• Multiple writes to the same word with no intervening reads
– multiple write broadcasts in an write update protocol
– only one initial invalidation in a write invalidate protocol

• With multiword cache blocks, each word written in a cache block

– A write broadcast for each word is required in an update protocol
– Only the first write to any word in the block needs to generate an invalidate
in an invalidation protocols
An invalidation protocol works on cache blocks, while an update protocol
must work on individual words

• Delay between writing a word in one processor and reading the

written value in another processor is usually less in a write update
scheme
– In an invalidation protocol, the reader is invalidated first, then later reads the
data

15
An Example Snoopy Protocol
Invalidation protocol, write-back cache
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data
– an extra state bit (shared/exclusive) associated with a valid bit and a
dirty bit for each block

• Each block of memory is in one state:

– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches

• Each processor snoops every address placed on the bus

– If a processor finds that is has a dirty copy of the requested cache block,
it provides that cache block in response to the read request
16
Cache Coherence Mechanism of the Example

Placing a write miss on the bus when a write hits in the shared state ensures an
exclusive copy (data not transferred) 17
Figure 6.11 State Transitions for Each Cache Block
Requests from CPU Requests from bus

•CPU may read/write hit/miss to the block • May receive read/write miss from
•May place write/read miss on bus bus
18
Cache Coherence
State Diagram

Figure 6.10 and Figure 6.12 (CPU in

black and bus in gray from Figure 6.11)

19
6.5 Distributed Shared-Memory Architectures
Distributed shared-memory architectures
• Separate memory per processor
– Local or remote access via memory controller
– The physical address space is statically distributed
Coherence Problems
• Simple approach: uncacheable
– shared data are marked as uncacheable and only private data are kept in
caches
– very long latency to access memory for shared data
• Alternative: directory for memory blocks
– The directory per memory tracks state of every block in every cache
• which caches have a copies of the memory block, dirty vs. clean, ...
– Two additional complications
• The interconnect cannot be used as a single point of arbitration like the bus
• Because the interconnect is message oriented, many messages must have
explicit responses
20
Distributed Directory Multiprocessor

To prevent directory becoming the bottleneck, we distribute directory entries with

memory, each keeping track of which processors have copies of their memory
blocks 21
Directory Protocols
• Similar to Snoopy Protocol: Three states
– Shared: 1 or more processors have the block cached, and the value in memory
is up-to-date (as well as in all the caches)
– Uncached: no processor has a copy of the cache block (not valid in any cache)
– Exclusive: Exactly one processor has a copy of the cache block, and it has
written the block, so the memory copy is out of date
• The processor is called the owner of the block

• In addition to tracking the state of each cache block, we must track

the processors that have copies of the block when it is shared
(usually a bit vector for each memory block: 1 if processor has copy)

• Keep it simple(r):
– Writes to non-exclusive data
=> write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent
22
Messages for Directory Protocols

•local node: the node where a request originates

•home node: the node where the memory location and directory entry of an address reside
•remote node: the node that has a copy of a cache block (exclusive or shared) 23
State Transition Diagram
for Individual Cache Block
• Comparing to snooping protocols:
– identical states
– stimulus is almost identical
– write a shared cache block is
treated as a write miss (without
fetch the block)
– cache block must be in exclusive
state when it is written
– any shared block must be up to
date in memory
• write miss: data fetch and selective
invalidate operations sent by the
directory controller (broadcast in
snooping protocols)

24
State Transition Diagram for
the Directory

Figure 6.29
Transition
diagram for
cache block

Three requests: read miss,

write miss and data write back
25
Directory Operations: Requests and Actions
• Message sent to directory causes two actions:
– Update the directory
– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current
value; only possible requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made only
sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing
node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
– Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the set
Sharers are sent invalidate messages, & Sharers is set to identity of requesting
processor. The state of the block is made Exclusive.

26
Directory Operations: Requests and Actions (cont.)
• Block is Exclusive: current value of the block is held in the cache
of the processor identified by the set Sharers (the owner) => three
possible directory requests:
– Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to
directory, where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
– Data write-back: owner processor is replacing the block and hence must write it
back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity
of new owner, and state of block is made Exclusive.
27

William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
34 pages
1.symmetric and Distributed Shared Memory Architectures
79% (19)
1.symmetric and Distributed Shared Memory Architectures
29 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
Chapter 4 TLP
No ratings yet
Chapter 4 TLP
46 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Module 4
No ratings yet
Module 4
40 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
MODULE 4 hpc
No ratings yet
MODULE 4 hpc
41 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
24 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
MULTIPROCTLPA
No ratings yet
MULTIPROCTLPA
99 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996
No ratings yet
Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996
59 pages
Shared Memory Architectures
No ratings yet
Shared Memory Architectures
34 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
10-Multithreading
No ratings yet
10-Multithreading
60 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
Lec13 Multiprocessors
No ratings yet
Lec13 Multiprocessors
69 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
CAQA6e ch5
No ratings yet
CAQA6e ch5
38 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
Week 5
No ratings yet
Week 5
52 pages
Unit 5
No ratings yet
Unit 5
89 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Lect10 SMPCC
No ratings yet
Lect10 SMPCC
27 pages
Multiprocessors and Thread
No ratings yet
Multiprocessors and Thread
4 pages
Parallel Arch 2
No ratings yet
Parallel Arch 2
9 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
ACA UNIT-5 Notes
No ratings yet
ACA UNIT-5 Notes
15 pages
Cache Coherence: Computer Science & Artificial Intelligence Lab
No ratings yet
Cache Coherence: Computer Science & Artificial Intelligence Lab
36 pages
CH17 COA9e Parallel Processing
No ratings yet
CH17 COA9e Parallel Processing
52 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Module 4
No ratings yet
Module 4
66 pages
PART17
No ratings yet
PART17
45 pages
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
No ratings yet
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
40 pages
KTMTSS Shared Memory Multiprocessor
No ratings yet
KTMTSS Shared Memory Multiprocessor
29 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
34 pages
Snoop-Based Multiprocessor Design
No ratings yet
Snoop-Based Multiprocessor Design
57 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
23 pages
Cache Coherence: - According To Webster's Dictionary
No ratings yet
Cache Coherence: - According To Webster's Dictionary
15 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Mastering the Art of Linux Kernel Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Linux Kernel Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
(CA) Rdbms Internal 1 Question Bank Atnms
No ratings yet
(CA) Rdbms Internal 1 Question Bank Atnms
3 pages
Age, Lda: Lntroduction and Lab
No ratings yet
Age, Lda: Lntroduction and Lab
36 pages
O Diabo No Imaginário Cristão
No ratings yet
O Diabo No Imaginário Cristão
129 pages
Qualifying Examination 3
No ratings yet
Qualifying Examination 3
4 pages
More Time To Foster A Love of Learning: - Laura Barton, Science Teacher, Fontbonne Hall Academy
No ratings yet
More Time To Foster A Love of Learning: - Laura Barton, Science Teacher, Fontbonne Hall Academy
2 pages
Unit 8 Internet Internet Services
50% (2)
Unit 8 Internet Internet Services
71 pages
Sem
No ratings yet
Sem
583 pages
Cyber Security: Ethical Hacking
No ratings yet
Cyber Security: Ethical Hacking
7 pages
ACH580 Start Stop DI
No ratings yet
ACH580 Start Stop DI
8 pages
Get Internetware A New Software Paradigm for Internet Computing 1st Edition Hong Mei PDF ebook with Full Chapters Now
100% (2)
Get Internetware A New Software Paradigm for Internet Computing 1st Edition Hong Mei PDF ebook with Full Chapters Now
55 pages
Kig12 GIS in Swiss
No ratings yet
Kig12 GIS in Swiss
1 page
Tender Digitization Answer Script
No ratings yet
Tender Digitization Answer Script
21 pages
Tarea 8
0% (2)
Tarea 8
13 pages
Intelliprop SS
No ratings yet
Intelliprop SS
2 pages
Typhoon J-9918A 50738 - E - Manual
No ratings yet
Typhoon J-9918A 50738 - E - Manual
19 pages
CLOUD COMPUTING LAB MANUAL V Semester
No ratings yet
CLOUD COMPUTING LAB MANUAL V Semester
63 pages
9780357506431 PoIS 7E PPT_Module 01
No ratings yet
9780357506431 PoIS 7E PPT_Module 01
48 pages
SEGA Mega Drive - Assembly Workshop
No ratings yet
SEGA Mega Drive - Assembly Workshop
87 pages
Job Application Form: Attach Passport Size Pictures
No ratings yet
Job Application Form: Attach Passport Size Pictures
1 page
Eth Network Creation
No ratings yet
Eth Network Creation
6 pages
Compas - User Guide - CSS v2.1
No ratings yet
Compas - User Guide - CSS v2.1
13 pages
Vertex Arrays 9
No ratings yet
Vertex Arrays 9
21 pages
SAVD Knight
No ratings yet
SAVD Knight
14 pages
Meet Fernando: About
No ratings yet
Meet Fernando: About
7 pages
IBM Z Systems Qualified DWDM Ciena 6500 Packet-Optical Platform Release 10.21
No ratings yet
IBM Z Systems Qualified DWDM Ciena 6500 Packet-Optical Platform Release 10.21
26 pages
ks5 Cs
No ratings yet
ks5 Cs
6 pages
Full Vol 1 Issue 4
No ratings yet
Full Vol 1 Issue 4
443 pages
Linux - Toc
No ratings yet
Linux - Toc
2 pages
SQL Queries
No ratings yet
SQL Queries
2 pages
Clips - Design Examples
No ratings yet
Clips - Design Examples
9 pages

CA Lecture 13

Uploaded by

CA Lecture 13

Uploaded by

Computer Architecture

share a single centralized memory, interconnect processors and memory by a bus

memory modules associated with CPUs

Proc. Proc. Proc. Proc. Proc. Proc.

Caches Caches Caches Caches Caches Caches

Node Interc. Network Node Interc. Network

Memory I/O Memory I/O

Cluster Interconnection Network

Physically separate memories addressed as one logically shared address space

Coherence cache provides:

• Write Invalidate Protocol

• Write Broadcast/Update Protocol (typically write through)

• Write serialization: bus serializes requests!

Assume neither cache initially holds X and the value of X in memory is 0 14

• With multiword cache blocks, each word written in a cache block

• Delay between writing a word in one processor and reading the

• Each block of memory is in one state:

• Each processor snoops every address placed on the bus

Figure 6.10 and Figure 6.12 (CPU in

To prevent directory becoming the bottleneck, we distribute directory entries with

• In addition to tracking the state of each cache block, we must track

•local node: the node where a request originates

Three requests: read miss,

You might also like