Part - B Unit - 5 Multiprocessors and Thread - Level Parallelism
Part - B Unit - 5 Multiprocessors and Thread - Level Parallelism
com 1
PART - B
UNIT - 5
Basics of synchronization
UNIT V
1. MIMDs offer flexibility. With the correct hardware and software support, MIMDs
can function as single-user multiprocessors focusing on high performance for one
application, as multiprogrammed multiprocessors running many tasks simultaneously, or
as some combination of these functions.
With an MIMD, each processor is executing its own instruction stream. In many cases,
each processor executes a different process. Recall from the last chapter, that a process is
an segment of code that may be run independently, and that the state of the process
contains all the information necessary to execute that program on a processor. In a
multiprogrammed environment, where the processors may be running independent tasks,
each process is typically independent of the processes on other processors. It is also
useful to be able to have multiple processors executing a single program and sharing the
code and most of their address space. When multiple processes share code and data in
this way, they are often called threads
. Today, the term thread is often used in a casual way to refer to multiple loci of
execution that may run on different processors, even when they do not share an address
space. To take advantage of an MIMD multiprocessor with n processors, we must usually
have at least n threads or processes to execute. The independent threads are typically
identified by the programmer or created by the compiler. Since the parallelism in this
situation is contained in the threads, it is called thread-level parallelism.
distinction is that such parallelism is identified at a high-level by the software system and
that the threads consist of hundreds to millions of instructions that may be executed in
parallel. In contrast, instruction level parallelism is identified by primarily by the
hardware, though with software help in some cases, and is found and exploited one
instruction at a time.
Existing MIMD multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictate a memory organization and interconnect
strategy. We refer to the multiprocessors by their memory organization, because what
constitutes a small or large number of processors is likely to change over time.
The first group, which we call
Because there is a single main memory that has a symmetric relationship to all
processos and a uniform access time from any processor, these multiprocessors are often
called symmetric (shared-memory) multiprocessors ( SMPs), and this style of architecture
is sometimes called UMA for uniform memory access. This type of centralized
sharedmemory architecture is currently by far the most popular organization.
VTUlive.com 6
Distributing the memory among the nodes has two major benefits. First, it is a
costeffective way to scale the memory bandwidth, if most of the accesses are to the local
memory in the node. Second, it reduces the latency for accesses to the local memory.
These two advantages make distributed memory attractive at smaller processor counts as
processors get ever faster and require more memory bandwidth and lower memory
latency. The key disadvantage for a distributed memory architecture is that
communicating data between processors becomes somewhat more complex and has
higher latency, at least when there is no contention, because the processors no longer
share a single centralized memory. As we will see shortly, the use of distributed memory
leads to two different paradigms for interprocessor communication. Typically, I/O as well
as memory is distributed among the nodes of the multiprocessor, and the nodes may be
small SMPs (2–8 processors). Although the use of multiple processors in a node together
with a memory and a network interface is quite useful from the cost-efficiency viewpoint.
VTUlive.com 7
• Suppose you want to achieve a speedup of 80 with 100 processors. What fraction
of the original computation can be sequential?
Message-Passing Multiprocessor
- The address space can consist of multiple private address spaces that are
logically disjoint and cannot be addressed by a remote processor
Multicomputer (cluster):
Cache Coherence
Unfortunately, caching shared data introduces a new problem because the view of
memory held by two different processors is through their individual caches, which,
without any additional precautions, could end up seeing two different values. I.e, If two
different processors have two different values for the same location, this difficulty is
generally referred to as cache coherence problem
VTUlive.com 9
• Informally:
– “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read
and write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order (could see older
value after a newer value)
The above three properties are sufficient to ensure coherence,When a written value will
be seen is also important. This issue is defined by memory consistency model. Coherence
and consistency are complementary.
• migration: a data item can be moved to a local cache and used there in a
transparent fashion.
• replication for shared data that are being simultaneously read.
• both are critical to performance in accessing shared data.
To over come these problems, adopt a hardware solution by introducing a
protocol tomaintain coherent caches named as Cache Coherence Protocols
These protocols are implemented for tracking the state of any sharing of a data block.
Two classes of Protocols
• Directory based
• Snooping based
VTUlive.com 10
Directory based
• Sharing status of a block of physical memory is kept in one location called the
directory.
• Directory-based coherence has slightly higher implementation overhead than
snooping.
• It can scale to larger processor count.
Snooping
• Every cache that has a copy of data also has a copy of the sharing status of the
block.
• No centralized state is kept.
• Caches are also accessible via some broadcast medium (bus or switch)
• Cache controller monitor or snoop on the medium to determine whether or not
they have a copy of a block that is represented on a bus or switch access.
Snooping protocols are popular with multiprocessor and caches attached to single
shared memory as they can use the existing physical connection- bus to memory, to
interrogate the status of the caches. Snoop based cache coherence scheme is implemented
on a shared bus. Any communication medium that broadcasts cache misses to all the
processors.
Write Invalidate
VTUlive.com 11
Write Update
Example Protocol
Conclusion
• “End” of uniprocessors speedup => Multiprocessors
• Parallelism challenges: % parallalizable, long latency to remote memory
• Centralized vs. distributed memory
– Small MP vs. lower latency, larger BW for Larger MP
• Message Passing vs. Shared Address
– Uniform access time vs. Non-uniform access time
• Snooping cache over shared medium for smaller MP by invalidating other
cached copies on write
• Sharing cached data _ Coherence (values returned by a read), Consistency
(when a written value will be returned by a read)
• Shared medium serializes writes _ Write consistency
Implementation Complications
• Write Races:
– Cannot update cache until bus is obtained
• Otherwise, another processor may get bus first,
and then write the same cache block!
– Two step process:
• Arbitrate for bus
• Place miss on bus and complete operation
– If miss occurs to block while waiting for bus, handle miss (invalidate
may be needed) and then restart.
– Split transaction bus:
• Bus transaction is not atomic:
VTUlive.com 14
Performance Measurement
• Overall cache performance is a combination of
– Uniprocessor cache miss traffic
– Traffic caused by communication – invalidation and subsequent cache
misses
• Changing the processor count, cache size, and block size can affect these two
components of miss rate
• Uniprocessor miss rate: compulsory, capacity, conflict
• Communication miss rate: coherence misses
– True sharing misses + false sharing misses
Example Result
VTUlive.com 15
Directory Protocols
• Why Synchronize?
Need to know when it is safe for different processes to use shared data
• Issues for Synchronization:
– Uninterruptable instruction to fetch and update memory (atomic
operation);
– User level synchronization operation using this primitive;
– For large scale MPs, synchronization can be a bottleneck; techniques to
reduce contention and latency of synchronization
• Key idea: allow reads and writes to complete out of order, but to use
synchronization operations to enforce ordering, so that a synchronized program behaves
as if the processor were sequentially consistent
– By relaxing orderings, may obtain performance advantages
– Also specifies range of legal compiler optimizations on shared data
– Unless synchronization points are clearly defined and programs are
synchronized, compiler could not interchange read and write of 2 shared data items
because might affect the semantics of the program
• 3 major sets of relaxed orderings:
1. W_R ordering (all writes completed before next read)
• Because retains ordering among writes, many programs that operate under
sequential consistency operate under this model, without additional
synchronization. Called processor consistency
2. W _ W ordering (all writes completed before next write)
3. R _ W and R _ R orderings, a variety of models depending on ordering
restrictions and how synchronization operations enforce ordering
• Many complexities in relaxed consistency models; defining precisely what it means for
a write to complete; deciding when processors can see values that it has written