We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41
MODULE 4
THREAD LEVEL PARALLELISM
Multiprocessing • Factors driven: lower efficiencies in silicon and energy use, Other than ILP, the only scalable and general-purpose to increase performance faster than the basic technology allows (from a switching perspective) is through multiprocessing. A growing interest in high-end servers as cloud computing • A growth in data-intensive applications driven by the availability of massive amounts of data on the Internet. • Highly compute- and data-intensive applications are being done on the cloud. • An improved understanding of how to use multiprocessors effectively, especially in server environments • The advantages of leveraging a design investment by replication rather than unique design; Thread Level Parallelism
TLP implies the existence of multiple program
counters and thus is exploited primarily through MIMDs • Multiprocessors defined as computers consisting of tightly coupled processors whose coordination and usage are typially controlled by a single operating system and that share memory through a shared address space. • Such systems exploit thread-level parallelism through two different software models. (a) Execution of a tightly coupled set of threads collaborating on a single task, which is typically called parallel processing. (b) Execution of multiple, relatively independent processes that may originate from one or more users, which is a form of request-level parallelism • Request-level parallelism may be exploited by a single application running on multiple processors, such as a database responding to queries, or multiple applications running independently, often called multiprogramming. Multiprocessor Architecture: Issues and Approach • To take advantage of an MIMD multiprocessor with n processors, we must usually have at least n threads or processes to execute. • Threads can also be used to exploit data-level parallelism, although the overhead is usually higher. • This overhead means that grain size must be sufficiently large to exploit the parallelism efficiently. • The overhead makes the exploitation of the parallelism prohibitively expensive in an MIMD. • Existing shared-memory multiprocessors fall into two classes, depending on the number of processors involved: (a) Symmetric (shared-memory) multiprocessors (SMPs), or centralized shared-memory multiprocessors UMA ------- features small to moderate numbers of cores, typically 32 or fewer; all processors have equal access to memory. (b) Distributed shared memory (DSM). NUMA • To support larger processor counts, memory must be distributed among the processors rather than centralized; Centralized shared-memory multiprocessors Architecture. Distributed Shared Memory • The term shared memory associated with both SMP and DSM refers to the fact that the address space is shared. • In contrast, the clusters and warehouse-scale computers look like individual computers connected by a network, and the memory of one processor cannot be accessed by another processor without the assistance of software protocols Challenges of Parallel Processing • Limited parallelism available in programs • Relatively high cost of communications. • The large latency of remote access in a parallel processor. • The problem of inadequate application parallelism must be attacked primarily in software with new algorithms that offer better parallel performance. Centralized Shared-Memory Architectures
• Key factor : multilevel caches can substantially reduce the
memory bandwidth demands of a processor. • Symmetric shared-memory machines usually support the caching of both shared and private data. • Private data are used by a single processor, while shared data are used by multiple processors. • When a private item is cached, its location is migrated to the cache. • When shared data are cached, the shared value may be replicated in multiple caches. • Caching of shared data, however, introduces a new problem: cache coherence. Multiprocessor Cache Coherence
• 2 Aspects: Coherence & Consistency
• Coherance : behaviour of reads and writes to same m/y location. • Consistency : behaviour of reads and writes w.r.t access to diff m/y location A memory system is coherent if: 1. A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. 2. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. • Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.
• Coherence and consistency are complementary: Coherence
defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to accesses to other memory locations Basic Schemes for Enforcing Coherence • In a coherent multiprocessor, the caches provide both migration and replication of shared data items. • The protocols to maintain coherence for multiple processors are called cache coherence protocols. • Key to implement a cache coherence protocol is tracking the state of any sharing of a data block. • The state of any cache block is kept using status bits associated with the block • There are two classes of protocols :
• (a) Directory Based
• (b) Snooping • Directory based: • The sharing status of a particular block of physical memory is kept in one location, called the directory • In SMP: centralized directory, associated with the memory or some other single serialization point. • In DSM : Distributed Directories • Snooping • Rather than keeping the state of sharing in a single directory, every cache that has a copy of the data from a block of physical memory could track the sharing status of the block.
• In an SMP, the caches are typically all accessible via some
broadcast medium (e.g., a bus connects the per-core caches to the shared cache or memory), • All cache controllers monitor or snoop on the medium to determine whether they have a copy of a block that is requested on a bus or switch access. Snooping Coherence Protocols • There are two ways to maintain the coherence requirement. • One method is to ensure that a processor has exclusive access to a data item before writing that item a write invalidate protocol • 2) Write update or write broadcast protocol.
Update all the cached copies of a data item when
that item is written. Write update protocol must broadcast all writes to shared cache lines, it consumes considerably more bandwidth. Recent multiprocessors have opted to implement a write invalidate protocol Basic Implementation Techniques • Implementing an invalidate protocol in a multicore requires bus, or another broadcast medium, to perform invalidates. • When a write to a block that is shared occurs, the writing processor must acquire bus access to broadcast its invalidation. • If two processors attempt to write shared blocks at the same time, their attempts to broadcast an invalidate operation will be serialized when they arbitrate for the bus. • The first processor to obtain bus access will cause any other copies of the block it is writing to be invalidated. If the processors were attempting to write the same block, the serialization enforced by the bus would also serialize their writes. The normal cache tags can be used to implement the process of snooping, and the valid bit for each block makes invalidation easy to implement. To track whether or not a cache block is shared, we can add an extra state bit associated with each cache block, just as we have a valid bit and a dirty bit • Adding a bit indicating whether the block is shared, we can decide whether a write must generate an invalidate. • When a write to a block in the shared state occurs, the cache generates an invalidation on the bus and marks the block as exclusive. • The core with the sole copy of a cache block is normally called the owner of the cache block. • When an invalidation is sent, the state of the owner’s cache block is changed from shared to unshared (or exclusive). • If another processor later requests this cache block, the state must be made shared again. • AN EXAMPLE PROTOCOL Extensions to the Basic Coherence Protocol
• The Basic coherance Protocol:
• MSI ( Modified, Shared, Invalidate) Protocol • The two Extensions are: • MESI adds the state Exclusive to the basic MSI protocol, yielding four states (Modified, Exclusive, Shared, and Invalid) • The exclusive state indicates that a cache block is resident in only a single cache but is clean. • If a block is in the E state, it can be written without generating any invalidates, which optimizes the case where a block is read by a single cache. before being written by that same cache.
• .The Intel i7 uses a variant of a MESI protocol, called MESIF,
which adds a state (Forward) to designate which sharing processor should respond to a request. • MOESI adds the state Owned to the MESI protocol to indicate that the associated block is owned by that cache and out-of-date in memory. Limitations in Symmetric Shared-Memory Multiprocessors and Snooping Protocols • As the number of processors in a multiprocessor grows, or as the memory demands of each processor grow, any centralized resource in the system can become a bottleneck. • The multicore chips use three different approaches: • 1. The IBM Power8, which has up to 12 processors in a single multicore, uses 8 parallel buses that connect the distributed L3 caches and up to 8 separate memory channels. • 2. The Xeon E7 uses three rings to connect up to 32 processors, a distributed L3 cache, and two or four memory channels (depending on the configuration). • 3. The Fujitsu SPARC64 X+ uses a crossbar to connect a shared L2 to up to 16 cores and multiple memory channels. • Techniques for increasing the snoop bandwidth: • The tags can be duplicated. This doubles the effective cache-level snoop bandwidth. • If the outermost cache on a multicore (typically L3) is shared, we can distribute that cache so that each processor has a portion of the memory and handles snoops for that portion of the address space. • 3. We can place a directory at the level of the outermost shared cache (say, L3) Performance of Symmetric Shared-Memory Multiprocessors • In a multicore using a snooping coherence protocol, several different phenomena combine to determine performance: • The overall cache performance is a combination of the behavior of uniprocessor cache miss traffic and the traffic caused by communication. • The processor count, cache size, and block size can affect these two components of the miss rate in different ways, leading to overall system behavior that is a combination of the two effects. • The misses that arise from interprocessor communication, which are often called coherence misses. • True sharing misses that arise from the communication of data through the cache coherence mechanism. • False sharing, arises from the use of an invalidation based coherence algorithm with a single valid bit per cache block. • If, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, causing false miss. A Commercial Workload • The memory system behavior of a 4-processor sharedmemory multiprocessor when running an online transaction processing workload. • The workload consists of a set of client processes that generate requests and a set of servers that handle them. • The server processes consume 85% of the user time, with the remaining going to the clients. Although the I/O latency is hidden by careful tuning and enough requests to keep the processor busy, the server processes typically block for I/O after about 25,000 instructions. • Overall, 71% of the execution time is spent in user mode, 18% in the operating system, and 11% idle, primarily waiting for I/O. • Of the commercial applications studied, the OLTP application stresses the memory system the hardest and shows significant challenges even when evaluated with much larger L3 caches • Increasing the block size from 32 to 256 bytes affects four of the miss rate components: • The true sharing miss rate decreases by more than a factor of 2, indicating some locality in the true sharing patterns. • The compulsory miss rate significantly decreases, as we would expect. • ■ The conflict/capacity misses show a small decrease (a factor of 1.26 compared to a factor of 8 increase in block size), indicating that the spatial locality is not high in the uniprocessor misses that occur with L3 caches larger than 2 MiB. • ■ The false sharing miss rate, although small in absolute terms, nearly doubles. Distributed Shared-Memory and Directory- Based Coherence The alternative to a snooping-based coherence protocol is a directory protocol.
The directory must be distributed, but the distribution must be
done in a way that the coherence protocol knows where to find the directory information for any cached block of memory.
A distributed directory retains the characteristic that the
sharing status of a block is always in a single known location Directory-Based Cache Coherence Protocols: The Basics
• Two primary operations that a directory
protocol must implement: • Handling a read miss and handling a write to a shared, clean cache block. • These states could be the following: • ■ Shared—One or more nodes have the block cached, and the value in memory is up to date (as well as in all the caches). • ■ Uncached—No node has a copy of the cache block. • ■ Modified—Exactly one node has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block