CA Lecture 13
CA Lecture 13
Multiprocessors and
Thread-Level Parallelism
Multiprocessors and
Thread-Level Parallelism
Introduction
Characteristics of Application Domains
Symmetric Shared-Memory Architectures
Performance of Symmetric Shared-Memory
Multiprocessors
Distributed Shared-Memory Architectures
Performance of Distributed Shared-Memory
Multiprocessors
Synchronization
Models of Memory Consistency: An Introduction
Multithreading: Exploiting Thread-Level Parallelism
within a Processor
2
Introduction
• Increasing demands of parallel processors
– Microprocessors are likely to remain the dominant uniprocessor
technology
• Connecting multiple microprocessors together is likely to be more cost-effective
than designing a custom parallel processor
– It’s unclear whether architectural innovation can be sustained
indefinitely
• Multiprocessors are another way to improve parallelism
– Server and embedded applications exhibit natural parallelism to be
exploited beyond desktop applications (ILP)
• Challenges to architecture research and development
– Death of advances in uniprocessor architecture?
– More multiprocessor architectures failing than succeeding
• more design spaces and tradeoffs
3
Taxonomy of Parallel Architectures
Flynn Categories
• SISD (Single Instruction Single Data)
– Uniprocessors
• MISD (Multiple Instruction Single Data)
– ???; multiple processors on a single data stream
• SIMD (Single Instruction Multiple Data)
– same instruction executed by multiple processors using different data streams
• Each processor has its data memory (hence multiple data)
• There’s a single instruction memory and control processor
– Simple programming model, Low overhead, Flexibility
– (Phrase reused by Intel marketing for media instructions ~ vector)
– Examples: vector architectures, Illiac-IV, CM-2
• MIMD (Multiple Instruction Multiple Data)
– Each processor fetches its own instructions and operates on its own data
– MIMD current winner: Concentrate on major design emphasis <= 128 processors
• Use off-the-shelf microprocessors: cost-performance advantages
• Flexible: high performance for one application, running many tasks simultaneously
– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
4
MIMD Class 1:
Centralized shared-memory multiprocessor
7
MIMD Hybrid II (Multicomputers):
Message-Passing Multiprocessor
• Data Communication Models for Multiprocessors
– shared memory: access shared address space implicitly via load and store
operations
– message-passing: done by explicitly passing messages among the processors
• can invoke software with Remote Procedure Call (RPC)
• often via library, such as MPI: Message Passing Interface
• also called "Synchronous communication" since communication causes
synchronization between 2 processes
• Message-Passing Multiprocessor
– The address space can consist of multiple private address spaces that are
logically disjoint and cannot be addressed by a remote processor
– The same physical address on two different processors refers to two different
locations in two different memories
• Multicomputer (cluster): can even consist of completely separate
computers connected on a LAN
– cost-effective for applications that require little or no communication
8
Comparisons of Communication Models
Advantages of Shared-Memory Communication Model
• Compatibility with SMP hardware
• Ease of programming when communication patterns are complex or vary
dynamically during execution
• Ability to develop applications using familiar SMP model, attention only on
performance critical accesses
• Lower communication overhead, better use of bandwidth for small items, due to
implicit communication and memory mapping to implement protection in hardware,
rather than through the I/O system
• Hardware-controlled caching to reduce the frequency of remote communication by
caching of all data, both shared and private
Advantages of Message-Passing Communication Model
• The hardware can be simpler (esp. vs. NUMA)
• Communication explicit => simpler to understand; in shared memory it can be hard
to know when communicating and when not, and how costly it is
• Explicit communication focuses programmer attention on costly aspect of parallel
computation, sometimes leading to improved structure in multiprocessor program
• Synchronization is naturally associated with sending messages, reducing the
possibility for errors introduced by incorrect synchronization
• Easier to use sender-initiated communication, which may have some advantages
in performance 9
Symmetric Shared-Memory Architectures
Caching in shared-memory machines
• private data: used by a single processor
– When a private item is cached, its location is migrated to the cache
– Since no other processor uses the data, the program behavior is identical to that
in a uniprocessor
• shared data: used by multiple processor
– When shared data are cached, the shared value may be replicated in multiple
caches
– advantages: reduce access latency and memory contention
– induce a new problem: cache coherence
10
Multiprocessor Cache Coherence Problem
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:
– “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read and
write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order
(could see older value after a newer value)
11
Two Classes of Cache Coherence Protocols
•Snooping Solution (Snoopy Bus)
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
•Directory-Based Schemes (Section 6.5)
– Directory keeps track of what is being shared in a centralized place (logically)
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes
12
Basic Snoopy Protocols
• Write strategies
– Write-through: memory is always up-to-date
– Write-back: snoop in caches to find most recent copy
Write Update
15
An Example Snoopy Protocol
Invalidation protocol, write-back cache
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data
– an extra state bit (shared/exclusive) associated with a valid bit and a
dirty bit for each block
Placing a write miss on the bus when a write hits in the shared state ensures an
exclusive copy (data not transferred) 17
Figure 6.11 State Transitions for Each Cache Block
Requests from CPU Requests from bus
•CPU may read/write hit/miss to the block • May receive read/write miss from
•May place write/read miss on bus bus
18
Cache Coherence
State Diagram
19
6.5 Distributed Shared-Memory Architectures
Distributed shared-memory architectures
• Separate memory per processor
– Local or remote access via memory controller
– The physical address space is statically distributed
Coherence Problems
• Simple approach: uncacheable
– shared data are marked as uncacheable and only private data are kept in
caches
– very long latency to access memory for shared data
• Alternative: directory for memory blocks
– The directory per memory tracks state of every block in every cache
• which caches have a copies of the memory block, dirty vs. clean, ...
– Two additional complications
• The interconnect cannot be used as a single point of arbitration like the bus
• Because the interconnect is message oriented, many messages must have
explicit responses
20
Distributed Directory Multiprocessor
• Keep it simple(r):
– Writes to non-exclusive data
=> write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent
22
Messages for Directory Protocols
24
State Transition Diagram for
the Directory
Figure 6.29
Transition
diagram for
cache block
26
Directory Operations: Requests and Actions (cont.)
• Block is Exclusive: current value of the block is held in the cache
of the processor identified by the set Sharers (the owner) => three
possible directory requests:
– Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to
directory, where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
– Data write-back: owner processor is replacing the block and hence must write it
back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity
of new owner, and state of block is made Exclusive.
27