0% found this document useful (0 votes)

13 views

MODULE 4 hpc

Uploaded by

mohdshabeelvp14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

MODULE 4 hpc

Uploaded by

mohdshabeelvp14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

MODULE 4

THREAD LEVEL PARALLELISM

Multiprocessing
• Factors driven:
 lower efficiencies in silicon and energy use,
 Other than ILP, the only scalable and general-purpose to
increase performance faster than the basic technology allows
(from a switching perspective) is through multiprocessing.
 A growing interest in high-end servers as cloud computing
• A growth in data-intensive applications driven by the availability of
massive amounts of data on the Internet.
• Highly compute- and data-intensive applications are being done on the
cloud.
• An improved understanding of how to use multiprocessors effectively,
especially in server environments
• The advantages of leveraging a design investment by replication rather
than unique design;
Thread Level Parallelism

TLP implies the existence of multiple program

counters and thus is exploited primarily
through MIMDs
• Multiprocessors defined as computers consisting of tightly
coupled processors whose coordination and usage are typially
controlled by a single operating system and that share memory
through a shared address space.
• Such systems exploit thread-level parallelism through two
different software models.
(a) Execution of a tightly coupled set of threads collaborating
on a single task, which is typically called parallel processing.
(b) Execution of multiple, relatively independent processes
that may originate from one or more users, which is a form of
request-level parallelism
• Request-level parallelism may be exploited by a single
application running on multiple processors, such as a database
responding to queries, or multiple applications running
independently, often called multiprogramming.
Multiprocessor Architecture: Issues and
Approach
• To take advantage of an MIMD multiprocessor with n processors, we must
usually have at least n threads or processes to execute.
• Threads can also be used to exploit data-level parallelism, although the
overhead is usually higher.
• This overhead means that grain size must be sufficiently large to exploit
the parallelism efficiently.
• The overhead makes the exploitation of the parallelism prohibitively
expensive in an MIMD.
• Existing shared-memory multiprocessors fall into two classes,
depending on the number of processors involved:
(a) Symmetric (shared-memory) multiprocessors (SMPs), or
centralized shared-memory multiprocessors UMA
------- features small to moderate numbers of cores, typically 32
or fewer; all processors have equal access to memory.
(b) Distributed shared memory (DSM). NUMA
• To support larger processor counts, memory must be
distributed among the processors rather than centralized;
Centralized shared-memory multiprocessors
Architecture.
Distributed Shared Memory
• The term shared memory associated with both SMP and DSM
refers to the fact that the address space is shared.
• In contrast, the clusters and warehouse-scale computers look
like individual computers connected by a network, and the
memory of one processor cannot be accessed by another
processor without the assistance of software protocols
Challenges of Parallel Processing
• Limited parallelism available in programs
• Relatively high cost of communications.
• The large latency of remote access in a parallel processor.
• The problem of inadequate application parallelism must be
attacked primarily in software with new algorithms that offer
better parallel performance.
Centralized Shared-Memory Architectures

• Key factor : multilevel caches can substantially reduce the

memory bandwidth demands of a processor.
• Symmetric shared-memory machines usually support the caching
of both shared and private data.
• Private data are used by a single processor, while shared data are
used by multiple processors.
• When a private item is cached, its location is migrated to the
cache.
• When shared data are cached, the shared value may be
replicated in multiple caches.
• Caching of shared data, however, introduces a new problem:
cache coherence.
Multiprocessor Cache Coherence

• 2 Aspects: Coherence & Consistency

• Coherance : behaviour of reads and writes to same m/y location.
• Consistency : behaviour of reads and writes w.r.t access to diff
m/y location
A memory system is coherent if:
1. A read by processor P to location X that follows a write by P to X,
with no writes of X by another processor occurring between the write
and the read by P, always returns the value written by P.
2. A read by a processor to location X that follows a write by another
processor to X returns the written value if the read and write are
sufficiently separated in time and no other writes to X occur
between the two accesses.
• Writes to the same location are serialized; that is, two writes to
the same location by any two processors are seen in the same
order by all processors. For example, if the values 1 and then 2
are written to a location, processors can never read the value
of the location as 2 and then later read it as 1.

• Coherence and consistency are complementary: Coherence

defines the behavior of reads and writes to the same memory
location, while consistency defines the behavior of reads and
writes with respect to accesses to other memory locations
Basic Schemes for Enforcing Coherence
• In a coherent multiprocessor, the caches provide both
migration and replication of shared data items.
• The protocols to maintain coherence for multiple
processors are called cache coherence protocols.
• Key to implement a cache coherence protocol is
tracking the state of any sharing of a data block.
• The state of any cache block is kept using status bits
associated with the block
• There are two classes of protocols :

• (a) Directory Based

• (b) Snooping
• Directory based:
• The sharing status of a particular block of physical memory is
kept in one location, called the directory
• In SMP: centralized directory, associated with the memory or
some other single serialization point.
• In DSM : Distributed Directories
• Snooping
• Rather than keeping the state of sharing in a single directory,
every cache that has a copy of the data from a block of
physical memory could track the sharing status of the block.

• In an SMP, the caches are typically all accessible via some

broadcast medium (e.g., a bus connects the per-core caches to
the shared cache or memory),
• All cache controllers monitor or snoop on the medium to
determine whether they have a copy of a block that is
requested on a bus or switch access.
Snooping Coherence Protocols
• There are two ways to maintain the coherence requirement.
• One method is to ensure that a processor has exclusive access
to a data item before writing that item  a write invalidate
protocol
• 2) Write update or write broadcast protocol.

 Update all the cached copies of a data item when

that item is written.
 Write update protocol must broadcast all writes
to shared cache lines, it consumes considerably
more bandwidth.
 Recent multiprocessors have opted to implement
a write invalidate protocol
Basic Implementation Techniques
• Implementing an invalidate protocol in a multicore requires
bus, or another broadcast medium, to perform invalidates.
• When a write to a block that is shared occurs, the writing
processor must acquire bus access to broadcast its
invalidation.
• If two processors attempt to write shared blocks at the same
time, their attempts to broadcast an invalidate operation will
be serialized when they arbitrate for the bus.
• The first processor to obtain bus access will cause any other
copies of the block it is writing to be invalidated.
If the processors were attempting to write the same block,
the serialization enforced by the bus would also serialize their
writes.
The normal cache tags can be used to implement the process
of snooping, and the valid bit for each block makes
invalidation easy to implement.
To track whether or not a cache block is shared, we can add an
extra state bit associated with each cache block, just as we
have a valid bit and a dirty bit
• Adding a bit indicating whether the block is shared, we can
decide whether a write must generate an invalidate.
• When a write to a block in the shared state occurs, the cache
generates an invalidation on the bus and marks the block as
exclusive.
• The core with the sole copy of a cache block is normally
called the owner of the cache block.
• When an invalidation is sent, the state of the owner’s
cache block is changed from shared to unshared (or
exclusive).
• If another processor later requests this cache block,
the state must be made shared again.
• AN EXAMPLE PROTOCOL
Extensions to the Basic Coherence Protocol

• The Basic coherance Protocol:

• MSI ( Modified, Shared, Invalidate) Protocol
• The two Extensions are:
• MESI adds the state Exclusive to the basic MSI
protocol, yielding four states (Modified, Exclusive,
Shared, and Invalid)
• The exclusive state indicates that a cache block is
resident in only a single cache but is clean.
• If a block is in the E state, it can be written without generating
any invalidates, which optimizes the case where a block is
read by a single cache. before being written by that same
cache.

• .The Intel i7 uses a variant of a MESI protocol, called MESIF,

which adds a state (Forward) to designate which sharing
processor should respond to a request.
• MOESI adds the state Owned to the MESI protocol
to indicate that the associated block is owned by that
cache and out-of-date in memory.
Limitations in Symmetric Shared-Memory
Multiprocessors and Snooping Protocols
• As the number of processors in a multiprocessor grows, or as the memory
demands of each processor grow, any centralized resource in the system
can become a bottleneck.
• The multicore chips use three different approaches:
• 1. The IBM Power8, which has up to 12 processors in a single multicore,
uses 8 parallel buses that connect the distributed L3 caches and up to 8
separate memory channels.
• 2. The Xeon E7 uses three rings to connect up to 32 processors, a
distributed L3 cache, and two or four memory channels (depending on the
configuration).
• 3. The Fujitsu SPARC64 X+ uses a crossbar to connect a shared L2 to up
to 16 cores and multiple memory channels.
• Techniques for increasing the snoop bandwidth:
• The tags can be duplicated. This doubles the effective cache-level snoop
bandwidth.
• If the outermost cache on a multicore (typically L3) is shared, we can
distribute that cache so that each processor has a portion of the memory
and handles snoops for that portion of the address space.
• 3. We can place a directory at the level of the outermost shared cache (say,
L3)
Performance of Symmetric Shared-Memory
Multiprocessors
• In a multicore using a snooping coherence protocol, several
different phenomena combine to determine performance:
• The overall cache performance is a combination of the
behavior of uniprocessor cache miss traffic and the traffic
caused by communication.
• The processor count, cache size, and block size can affect
these two components of the miss rate in different ways,
leading to overall system behavior that is a combination of the
two effects.
• The misses that arise from interprocessor communication,
which are often called coherence misses.
• True sharing misses that arise from the communication of
data through the cache coherence mechanism.
• False sharing, arises from the use of an invalidation based
coherence algorithm with a single valid bit per cache block.
• If, the word being written and the word read are different and
the invalidation does not cause a new value to be
communicated, but only causes an extra cache miss, causing
false miss.
A Commercial Workload
• The memory system behavior of a 4-processor sharedmemory
multiprocessor when running an online transaction processing workload.
• The workload consists of a set of client processes that generate requests
and a set of servers that handle them.
• The server processes consume 85% of the user time, with the remaining
going to the clients. Although the I/O latency is hidden by careful tuning
and enough requests to keep the processor busy, the server processes
typically block for I/O after about 25,000 instructions.
• Overall, 71% of the execution time is spent in user mode, 18% in the
operating system, and 11% idle, primarily waiting for I/O.
• Of the commercial applications studied, the OLTP application stresses the
memory system the hardest and shows significant challenges even when
evaluated with much larger L3 caches
• Increasing the block size from 32 to 256 bytes affects four of
the miss rate components:
• The true sharing miss rate decreases by more than a factor of
2, indicating some locality in the true sharing patterns.
• The compulsory miss rate significantly decreases, as we would
expect.
• ■ The conflict/capacity misses show a small decrease (a
factor of 1.26 compared to a factor of 8 increase in block size),
indicating that the spatial locality is not high in the
uniprocessor misses that occur with L3 caches larger than 2
MiB.
• ■ The false sharing miss rate, although small in absolute
terms, nearly doubles.
Distributed Shared-Memory and Directory-
Based Coherence
 The alternative to a snooping-based coherence protocol is a
directory protocol.

 The directory must be distributed, but the distribution must be

done in a way that the coherence protocol knows where to find
the directory information for any cached block of memory.

 A distributed directory retains the characteristic that the

sharing status of a block is always in a single known location
Directory-Based Cache Coherence Protocols: The Basics

• Two primary operations that a directory

protocol must implement:
• Handling a read miss and handling a write to a
shared, clean cache block.
• These states could be the following:
• ■ Shared—One or more nodes have the block cached, and the
value in memory is up to date (as well as in all the caches).
• ■ Uncached—No node has a copy of the cache block.
• ■ Modified—Exactly one node has a copy of the cache block,
and it has written the block, so the memory copy is out of date.
The processor is called the owner of the block

William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
34 pages
Chapter 4 TLP
No ratings yet
Chapter 4 TLP
46 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
1.symmetric and Distributed Shared Memory Architectures
79% (19)
1.symmetric and Distributed Shared Memory Architectures
29 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Lec13 Multiprocessors
No ratings yet
Lec13 Multiprocessors
69 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Week 5
No ratings yet
Week 5
52 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
MULTIPROCTLPA
No ratings yet
MULTIPROCTLPA
99 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
Multiprocessors and Thread
No ratings yet
Multiprocessors and Thread
4 pages
CAQA6e ch5
No ratings yet
CAQA6e ch5
38 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
Module 4
No ratings yet
Module 4
40 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Week_5
No ratings yet
Week_5
35 pages
IJARCCE-46_cachemesiwithverilog
No ratings yet
IJARCCE-46_cachemesiwithverilog
5 pages
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
No ratings yet
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
79 pages
Cache Coherence: Computer Science & Artificial Intelligence Lab
No ratings yet
Cache Coherence: Computer Science & Artificial Intelligence Lab
36 pages
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
No ratings yet
Thread-Level Parallelism: A Quantitative Approach, Sixth Edition
40 pages
Cache Coherence: - According To Webster's Dictionary
No ratings yet
Cache Coherence: - According To Webster's Dictionary
15 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
R12 U5 MultiProcessor Architectures
No ratings yet
R12 U5 MultiProcessor Architectures
47 pages
Lecture 5
No ratings yet
Lecture 5
15 pages
Final Report: Multicore Processors
No ratings yet
Final Report: Multicore Processors
12 pages
PART17
No ratings yet
PART17
45 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
24 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
34 pages
A Survey of Cache Coherence Mechanisms in Shared M
No ratings yet
A Survey of Cache Coherence Mechanisms in Shared M
27 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
CH20 COA11e
No ratings yet
CH20 COA11e
40 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
18bce2429 Da 2 Cao
No ratings yet
18bce2429 Da 2 Cao
13 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
ACA UNIT-5 Notes
No ratings yet
ACA UNIT-5 Notes
15 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
Slot28 CH17 ParallelProcessing 32 Slides
No ratings yet
Slot28 CH17 ParallelProcessing 32 Slides
32 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
CH17 COA10e
No ratings yet
CH17 COA10e
45 pages
CH17 COA9e Parallel Processing
No ratings yet
CH17 COA9e Parallel Processing
52 pages
Shared Memory Architectures
No ratings yet
Shared Memory Architectures
34 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Optimized Caching Techniques: Application for Scalable Distributed Architectures
From Everand
Optimized Caching Techniques: Application for Scalable Distributed Architectures
Peter Jones
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Facilities Management Procurement Analysis Report: Process Simplification
No ratings yet
Facilities Management Procurement Analysis Report: Process Simplification
20 pages
AACE Vs RICS
No ratings yet
AACE Vs RICS
20 pages
Extrinsic Semiconductor
No ratings yet
Extrinsic Semiconductor
3 pages
Internship Report
No ratings yet
Internship Report
11 pages
Safety Data Sheet For Acetone: Lee Chang Yung Chemical Industry Corporation
No ratings yet
Safety Data Sheet For Acetone: Lee Chang Yung Chemical Industry Corporation
7 pages
A Proposal For Case Study Methodology in Supply Chain Integration Research
No ratings yet
A Proposal For Case Study Methodology in Supply Chain Integration Research
2 pages
Toll Pharmacy
No ratings yet
Toll Pharmacy
11 pages
اﻹبتكار واﻹبداع في قطاع المقاولات - دراسة التجربة الوطنية و تجارب دولية رائدة
No ratings yet
اﻹبتكار واﻹبداع في قطاع المقاولات - دراسة التجربة الوطنية و تجارب دولية رائدة
17 pages
III Q4 EXAM Answer Key
100% (1)
III Q4 EXAM Answer Key
2 pages
Food Pantries List Dec 2023
No ratings yet
Food Pantries List Dec 2023
8 pages
MCA Sem 5 (Rev) Syllabus 2010
No ratings yet
MCA Sem 5 (Rev) Syllabus 2010
19 pages
What Is A Boat Note and Where It Is Used
No ratings yet
What Is A Boat Note and Where It Is Used
8 pages
Birmingham Big City Plan
100% (1)
Birmingham Big City Plan
96 pages
DLL Applied Econ August 26-30, 2019
100% (3)
DLL Applied Econ August 26-30, 2019
5 pages
Donated Fixed Asset PDF
No ratings yet
Donated Fixed Asset PDF
3 pages
The 2nd Day
No ratings yet
The 2nd Day
2 pages
The Articles of Confederation PROJECT
No ratings yet
The Articles of Confederation PROJECT
6 pages
WFP Uganda Country Brief July 2023
No ratings yet
WFP Uganda Country Brief July 2023
2 pages
Oracle PL SQL Interview Questions For 3 Years Exp
91% (11)
Oracle PL SQL Interview Questions For 3 Years Exp
20 pages
(2005) Advances in Active Radar Seeker Technology
100% (2)
(2005) Advances in Active Radar Seeker Technology
8 pages
Numerical Modelling of The Process of Tube Bending With Local Induction Heating
No ratings yet
Numerical Modelling of The Process of Tube Bending With Local Induction Heating
4 pages
HI5003 Tutorial Assessment T2.2020
No ratings yet
HI5003 Tutorial Assessment T2.2020
5 pages
Senior Permit Application
0% (1)
Senior Permit Application
2 pages
Issap Cib
No ratings yet
Issap Cib
34 pages
Eesha - M - Resume RA Associate R
No ratings yet
Eesha - M - Resume RA Associate R
2 pages
Oracle Cloud Infrastructure Associate Arch Part 1 Quiz
No ratings yet
Oracle Cloud Infrastructure Associate Arch Part 1 Quiz
75 pages
Tourism: Question Loop Speaking Activity
No ratings yet
Tourism: Question Loop Speaking Activity
5 pages
MSOFTX3000 Product Documentation V200R008C02 - 16 20161114110306
No ratings yet
MSOFTX3000 Product Documentation V200R008C02 - 16 20161114110306
13 pages
G.R. No. 194167
No ratings yet
G.R. No. 194167
7 pages
A Summer Training Report in Leon Food Products Puttur
No ratings yet
A Summer Training Report in Leon Food Products Puttur
3 pages

MODULE 4 hpc

Uploaded by

MODULE 4 hpc

Uploaded by

MODULE 4

THREAD LEVEL PARALLELISM

TLP implies the existence of multiple program

• Key factor : multilevel caches can substantially reduce the

• 2 Aspects: Coherence & Consistency

• Coherence and consistency are complementary: Coherence

• (a) Directory Based

• In an SMP, the caches are typically all accessible via some

 Update all the cached copies of a data item when

• The Basic coherance Protocol:

• .The Intel i7 uses a variant of a MESI protocol, called MESIF,

 The directory must be distributed, but the distribution must be

 A distributed directory retains the characteristic that the

• Two primary operations that a directory

You might also like