0% found this document useful (0 votes)
13 views

MODULE 4 hpc

Uploaded by

mohdshabeelvp14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

MODULE 4 hpc

Uploaded by

mohdshabeelvp14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

MODULE 4

THREAD LEVEL PARALLELISM


Multiprocessing
• Factors driven:
 lower efficiencies in silicon and energy use,
 Other than ILP, the only scalable and general-purpose to
increase performance faster than the basic technology allows
(from a switching perspective) is through multiprocessing.
 A growing interest in high-end servers as cloud computing
• A growth in data-intensive applications driven by the availability of
massive amounts of data on the Internet.
• Highly compute- and data-intensive applications are being done on the
cloud.
• An improved understanding of how to use multiprocessors effectively,
especially in server environments
• The advantages of leveraging a design investment by replication rather
than unique design;
Thread Level Parallelism

TLP implies the existence of multiple program


counters and thus is exploited primarily
through MIMDs
• Multiprocessors defined as computers consisting of tightly
coupled processors whose coordination and usage are typially
controlled by a single operating system and that share memory
through a shared address space.
• Such systems exploit thread-level parallelism through two
different software models.
(a) Execution of a tightly coupled set of threads collaborating
on a single task, which is typically called parallel processing.
(b) Execution of multiple, relatively independent processes
that may originate from one or more users, which is a form of
request-level parallelism
• Request-level parallelism may be exploited by a single
application running on multiple processors, such as a database
responding to queries, or multiple applications running
independently, often called multiprogramming.
Multiprocessor Architecture: Issues and
Approach
• To take advantage of an MIMD multiprocessor with n processors, we must
usually have at least n threads or processes to execute.
• Threads can also be used to exploit data-level parallelism, although the
overhead is usually higher.
• This overhead means that grain size must be sufficiently large to exploit
the parallelism efficiently.
• The overhead makes the exploitation of the parallelism prohibitively
expensive in an MIMD.
• Existing shared-memory multiprocessors fall into two classes,
depending on the number of processors involved:
(a) Symmetric (shared-memory) multiprocessors (SMPs), or
centralized shared-memory multiprocessors UMA
------- features small to moderate numbers of cores, typically 32
or fewer; all processors have equal access to memory.
(b) Distributed shared memory (DSM). NUMA
• To support larger processor counts, memory must be
distributed among the processors rather than centralized;
Centralized shared-memory multiprocessors
Architecture.
Distributed Shared Memory
• The term shared memory associated with both SMP and DSM
refers to the fact that the address space is shared.
• In contrast, the clusters and warehouse-scale computers look
like individual computers connected by a network, and the
memory of one processor cannot be accessed by another
processor without the assistance of software protocols
Challenges of Parallel Processing
• Limited parallelism available in programs
• Relatively high cost of communications.
• The large latency of remote access in a parallel processor.
• The problem of inadequate application parallelism must be
attacked primarily in software with new algorithms that offer
better parallel performance.
Centralized Shared-Memory Architectures

• Key factor : multilevel caches can substantially reduce the


memory bandwidth demands of a processor.
• Symmetric shared-memory machines usually support the caching
of both shared and private data.
• Private data are used by a single processor, while shared data are
used by multiple processors.
• When a private item is cached, its location is migrated to the
cache.
• When shared data are cached, the shared value may be
replicated in multiple caches.
• Caching of shared data, however, introduces a new problem:
cache coherence.
Multiprocessor Cache Coherence

• 2 Aspects: Coherence & Consistency


• Coherance : behaviour of reads and writes to same m/y location.
• Consistency : behaviour of reads and writes w.r.t access to diff
m/y location
A memory system is coherent if:
1. A read by processor P to location X that follows a write by P to X,
with no writes of X by another processor occurring between the write
and the read by P, always returns the value written by P.
2. A read by a processor to location X that follows a write by another
processor to X returns the written value if the read and write are
sufficiently separated in time and no other writes to X occur
between the two accesses.
• Writes to the same location are serialized; that is, two writes to
the same location by any two processors are seen in the same
order by all processors. For example, if the values 1 and then 2
are written to a location, processors can never read the value
of the location as 2 and then later read it as 1.

• Coherence and consistency are complementary: Coherence


defines the behavior of reads and writes to the same memory
location, while consistency defines the behavior of reads and
writes with respect to accesses to other memory locations
Basic Schemes for Enforcing Coherence
• In a coherent multiprocessor, the caches provide both
migration and replication of shared data items.
• The protocols to maintain coherence for multiple
processors are called cache coherence protocols.
• Key to implement a cache coherence protocol is
tracking the state of any sharing of a data block.
• The state of any cache block is kept using status bits
associated with the block
• There are two classes of protocols :

• (a) Directory Based


• (b) Snooping
• Directory based:
• The sharing status of a particular block of physical memory is
kept in one location, called the directory
• In SMP: centralized directory, associated with the memory or
some other single serialization point.
• In DSM : Distributed Directories
• Snooping
• Rather than keeping the state of sharing in a single directory,
every cache that has a copy of the data from a block of
physical memory could track the sharing status of the block.

• In an SMP, the caches are typically all accessible via some


broadcast medium (e.g., a bus connects the per-core caches to
the shared cache or memory),
• All cache controllers monitor or snoop on the medium to
determine whether they have a copy of a block that is
requested on a bus or switch access.
Snooping Coherence Protocols
• There are two ways to maintain the coherence requirement.
• One method is to ensure that a processor has exclusive access
to a data item before writing that item  a write invalidate
protocol
• 2) Write update or write broadcast protocol.

 Update all the cached copies of a data item when


that item is written.
 Write update protocol must broadcast all writes
to shared cache lines, it consumes considerably
more bandwidth.
 Recent multiprocessors have opted to implement
a write invalidate protocol
Basic Implementation Techniques
• Implementing an invalidate protocol in a multicore requires
bus, or another broadcast medium, to perform invalidates.
• When a write to a block that is shared occurs, the writing
processor must acquire bus access to broadcast its
invalidation.
• If two processors attempt to write shared blocks at the same
time, their attempts to broadcast an invalidate operation will
be serialized when they arbitrate for the bus.
• The first processor to obtain bus access will cause any other
copies of the block it is writing to be invalidated.
If the processors were attempting to write the same block,
the serialization enforced by the bus would also serialize their
writes.
The normal cache tags can be used to implement the process
of snooping, and the valid bit for each block makes
invalidation easy to implement.
To track whether or not a cache block is shared, we can add an
extra state bit associated with each cache block, just as we
have a valid bit and a dirty bit
• Adding a bit indicating whether the block is shared, we can
decide whether a write must generate an invalidate.
• When a write to a block in the shared state occurs, the cache
generates an invalidation on the bus and marks the block as
exclusive.
• The core with the sole copy of a cache block is normally
called the owner of the cache block.
• When an invalidation is sent, the state of the owner’s
cache block is changed from shared to unshared (or
exclusive).
• If another processor later requests this cache block,
the state must be made shared again.
• AN EXAMPLE PROTOCOL
Extensions to the Basic Coherence Protocol

• The Basic coherance Protocol:


• MSI ( Modified, Shared, Invalidate) Protocol
• The two Extensions are:
• MESI adds the state Exclusive to the basic MSI
protocol, yielding four states (Modified, Exclusive,
Shared, and Invalid)
• The exclusive state indicates that a cache block is
resident in only a single cache but is clean.
• If a block is in the E state, it can be written without generating
any invalidates, which optimizes the case where a block is
read by a single cache. before being written by that same
cache.

• .The Intel i7 uses a variant of a MESI protocol, called MESIF,


which adds a state (Forward) to designate which sharing
processor should respond to a request.
• MOESI adds the state Owned to the MESI protocol
to indicate that the associated block is owned by that
cache and out-of-date in memory.
Limitations in Symmetric Shared-Memory
Multiprocessors and Snooping Protocols
• As the number of processors in a multiprocessor grows, or as the memory
demands of each processor grow, any centralized resource in the system
can become a bottleneck.
• The multicore chips use three different approaches:
• 1. The IBM Power8, which has up to 12 processors in a single multicore,
uses 8 parallel buses that connect the distributed L3 caches and up to 8
separate memory channels.
• 2. The Xeon E7 uses three rings to connect up to 32 processors, a
distributed L3 cache, and two or four memory channels (depending on the
configuration).
• 3. The Fujitsu SPARC64 X+ uses a crossbar to connect a shared L2 to up
to 16 cores and multiple memory channels.
• Techniques for increasing the snoop bandwidth:
• The tags can be duplicated. This doubles the effective cache-level snoop
bandwidth.
• If the outermost cache on a multicore (typically L3) is shared, we can
distribute that cache so that each processor has a portion of the memory
and handles snoops for that portion of the address space.
• 3. We can place a directory at the level of the outermost shared cache (say,
L3)
Performance of Symmetric Shared-Memory
Multiprocessors
• In a multicore using a snooping coherence protocol, several
different phenomena combine to determine performance:
• The overall cache performance is a combination of the
behavior of uniprocessor cache miss traffic and the traffic
caused by communication.
• The processor count, cache size, and block size can affect
these two components of the miss rate in different ways,
leading to overall system behavior that is a combination of the
two effects.
• The misses that arise from interprocessor communication,
which are often called coherence misses.
• True sharing misses that arise from the communication of
data through the cache coherence mechanism.
• False sharing, arises from the use of an invalidation based
coherence algorithm with a single valid bit per cache block.
• If, the word being written and the word read are different and
the invalidation does not cause a new value to be
communicated, but only causes an extra cache miss, causing
false miss.
A Commercial Workload
• The memory system behavior of a 4-processor sharedmemory
multiprocessor when running an online transaction processing workload.
• The workload consists of a set of client processes that generate requests
and a set of servers that handle them.
• The server processes consume 85% of the user time, with the remaining
going to the clients. Although the I/O latency is hidden by careful tuning
and enough requests to keep the processor busy, the server processes
typically block for I/O after about 25,000 instructions.
• Overall, 71% of the execution time is spent in user mode, 18% in the
operating system, and 11% idle, primarily waiting for I/O.
• Of the commercial applications studied, the OLTP application stresses the
memory system the hardest and shows significant challenges even when
evaluated with much larger L3 caches
• Increasing the block size from 32 to 256 bytes affects four of
the miss rate components:
• The true sharing miss rate decreases by more than a factor of
2, indicating some locality in the true sharing patterns.
• The compulsory miss rate significantly decreases, as we would
expect.
• ■ The conflict/capacity misses show a small decrease (a
factor of 1.26 compared to a factor of 8 increase in block size),
indicating that the spatial locality is not high in the
uniprocessor misses that occur with L3 caches larger than 2
MiB.
• ■ The false sharing miss rate, although small in absolute
terms, nearly doubles.
Distributed Shared-Memory and Directory-
Based Coherence
 The alternative to a snooping-based coherence protocol is a
directory protocol.

 The directory must be distributed, but the distribution must be


done in a way that the coherence protocol knows where to find
the directory information for any cached block of memory.

 A distributed directory retains the characteristic that the


sharing status of a block is always in a single known location
Directory-Based Cache Coherence Protocols: The Basics

• Two primary operations that a directory


protocol must implement:
• Handling a read miss and handling a write to a
shared, clean cache block.
• These states could be the following:
• ■ Shared—One or more nodes have the block cached, and the
value in memory is up to date (as well as in all the caches).
• ■ Uncached—No node has a copy of the cache block.
• ■ Modified—Exactly one node has a copy of the cache block,
and it has written the block, so the memory copy is out of date.
The processor is called the owner of the block

You might also like