0% found this document useful (0 votes)
11 views

21cs401 CA Unit V

The document discusses parallel processing and multicore computers, detailing various architectures such as SISD, SIMD, MISD, and MIMD, along with their advantages and disadvantages. It also covers symmetric multiprocessing, hardware multithreading, and non-uniform memory access, emphasizing the importance of cache coherence and system architecture in enhancing performance. Key issues such as scheduling, synchronization, and reliability are highlighted as critical factors in the design and implementation of these systems.

Uploaded by

Vignesh MG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

21cs401 CA Unit V

The document discusses parallel processing and multicore computers, detailing various architectures such as SISD, SIMD, MISD, and MIMD, along with their advantages and disadvantages. It also covers symmetric multiprocessing, hardware multithreading, and non-uniform memory access, emphasizing the importance of cache coherence and system architecture in enhancing performance. Key issues such as scheduling, synchronization, and reliability are highlighted as critical factors in the design and implementation of these systems.

Uploaded by

Vignesh MG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT-V PARALLEL PROCESSING AND MULTICORE COMPUTERS

Parallel Processing: Use of Multiple Processors - Symmetric Multiprocessors - Cache Coherence -


Multithreading and Chip Multiprocessors - Clusters – Non-uniform Memory Access Computers - Vector
Computation - Multicore Organization.

5.1 Parallel Processing: Multiple Processors [FLYNN’S CLASSIFICATION]


A Taxonomy of Parallel Processor Architectures

SISD
• Single Instruction stream, Single Data stream.
• Example of SISD is uniprocessor.
• It has a single control unit and producing a single stream of instruction.
• It has one processing unit and the processing has more than one functional unit these are under the
supervision of one control unit.
• It has one memory unit.

SIMD
• It has one instruction and multiple data stream.
• It has a single control unit and producing a single stream of instruction and multi stream of data.
• It has more than one processing unit and each processing unit has its own associative data memory unit.
• In this organization, multiple processing elements work under the control of a single control unit.
• A single machine instruction controls the simultaneous execution of a number of processing element.
• Each instruction to be executed on different sets of data by different processor.
• The same instruction is applied to many data streams, as in a vector processor.
• All the processing elements of this organization receive the same instruction broadcast from the CU.
• Main memory can also be divided into modules for generating multiple data streams acting as a distributed
memory as shown in figure.
• Therefore, all the processing elements simultaneously execute the same instruction and are said to be 'lock-
stepped' together.
• Each processor takes the data from its own memory and hence it has on distinct data streams.
• Every processor must be allowed to complete its instruction before the next instruction is taken for
execution. Thus, the execution of instructions is synchronous.
• Example of SIMD is Vector Processor and Array Processor.

Advantage of SIMD:
• The original motivation behind SIMD was to amortize the cost of the control unit over dozens of
execution units.
• Another advantage is the reduced instruction bandwidth and space.
• SIMD needs only one copy of the code that is being simultaneously executed while message-passing
MIMDs may need a copy in every processor, and shared memory MIMD will need multiple instruction
caches.
• SIMD works best when dealing with arrays in for loops because parallelism achieved by performing the
same operation on independent data.
• SIMD is at its weakest in case or switch statements, where each execution unit must perform a different
operation on its data, depending on what data it has. Execution units with the wrong data must be disabled
so that units with proper data may continue.
MISD
• Multiple Instruction and Single Data stream (MISD)
• In this organization, multiple processing elements are organized under the control of multiple control units.
• Each control unit is handling one instruction stream and processed through its corresponding processing
element.
• But each processing element is processing only a single data stream at a time.
• Therefore, for handling multiple instruction streams and single data stream, multiple control units and
multiple processing elements are organized in this classification.
• All processing elements are interacting with the common shared memory for the organization of single data
stream as shown in figure.
• The only known example of a computer capable of MISD operation is the C.mmp built by Carnegie-Mellon
University.

MIMD
• Multiple Instruction streams and Multiple Data streams (MIMD). In this organization, multiple processing
elements and multiple control units are organized.
• Compared to MISD the difference is that now in this organization multiple instruction streams operate on
multiple data streams.
• Therefore, for handling multiple instruction streams, multiple control units and multiple processing
elements are organized such that multiple processing elements are handling multiple data streams from the
main memory as shown in figure.
• The processors work on their own data with their own instructions. Tasks executed by different processors
can start or finish at different times.
• They are not lock-stepped, as in SIMD computers, but run asynchronously.
• This classification actually recognizes the parallel computer. That means in the real sense MIMD
organization is said to be a Parallel computer.

5.2 SYMMETRIC MULTIPROCESSORS


• In symmetric multiprocessing, multiple processors share a common memory and operating system. All of
these processors work in a cycle to execute processes. The operating system treats all the processors
equally, and no processor is reserved for special purposes.
• Symmetric multiprocessing is also known as tightly coupled multiprocessing as all the CPU’s are
connected at the bus level and have access to a shared memory.
• All the parallel processors in symmetric multiprocessing have their private cache memory to decrease
system bus traffic and also reduce the data access time.
• Symmetric multiprocessing systems allow a processor to execute any process no matter where its data is
located in memory. The only stipulation is that a process should not be executing on two or more processors
at the same time.
• In general, the symmetric multiprocessing system does not exceed 16 processors as this amount can be
comfortably handled by the operating system.
Uses of Symmetric Multiprocessing
• Symmetric multiprocessing is useful for time sharing systems as these have multiple processes running
in parallel. So, these processes can be scheduled on parallel processors using symmetric multiprocessing.
• Symmetric processing is not that useful in personal computers unless multithreaded programming is taken
into account. The multiple threads can be scheduled on the parallel processors.
• Time sharing systems that use multithreading programming can also make use of symmetric
multiprogramming.
Advantages of Symmetric Multiprocessing
• Throughput: The throughput of the system is increased in symmetric multiprocessing. As there are
multiple processors, more processes are executed. Hence increased degree of throughput(processes
executed in unit time).
• Reliability: Symmetric multiprocessing systems are much more reliable than single processor systems.
Even if a processor fails, the system still endures. Only its efficiency is decreased a little.
• Incremental growth: A user can enhance the performance of a system by adding an additional
processor.
• Scaling: Vendors can offer a range of products with different price and performance characteristics
based on the number of processors configured in the system.
Disadvantages
• Complex design: Since all the processors are treated equally by OS, so designing and management of such
OS become difficult.
• Costlier: As all the processors share the common main memory. So a large main memory is required to
accommodate all these processors.
Characteristics of SMP
• Identical: All the processors are treated equally i.e. all are identical.
• Communication: Shared memory is the mode of communication among processors.
• Complexity: Are complex in design, as all units share same memory and data bus.
• Expensive: They are costlier in nature.
An SMP can be defined as a standalone computer system with the following characteristics:
1. There are two or more similar processors of comparable capability.
2. These processors share the same main memory and I/O facilities and are interconnected by a bus
or other internal connection scheme, such that memory access time is approximately the same for each
processor.
3. All processors share access to I/O devices, either through the same channels or through different
channels that provide paths to the same device.
4. All processors can perform the same functions (hence the term symmetric).
5. The system is controlled by an integrated operating system that provides interaction between
processors and their programs at the job, task, file, and data element levels.
6. In an SMP, individual data elements can constitute the level of interaction, and there can be a high
degree of cooperation between processes.
Applications
1. Time Sharing System
2. Multithreading
System Architecture

• There are two or more processors. Each processor is self-contained, including a control unit, ALU,
registers, and, typically, one or more levels of cache.
• Each processor has access to a shared main memory and the I/O devices through some form of
interconnection mechanism.
• The processors can communicate with each other through memory. It may also be possible for
processors to exchange signals directly.
• The memory is often organized so that multiple simultaneous accesses to separate blocks of memory
are possible.
• In some configurations, each processor may also have its own private main memory and I/O channels
in addition to the shared resources. All processors share access to I/O devices, either through the same
channels or through different channels that provide paths to the same device.
• All processors can perform the same functions (hence the term symmetric).
• The system is controlled by an integrated operating system that provides interaction between
processors and their programs at the job, task, file, and data element levels.
• Typically, workstation and PC SMPs have two levels of cache, with the L1 cache internal (same chip
as the processor) and the L2 cache either internal or external. Some processors now employ a L3 cache
as well.
• The use of caches introduces some new design considerations. Because each local cache
contains an image of a portion of memory, if a word is altered in one cache, it could conceivably
invalidate a word in another cache. To prevent this, the other processors must be alerted that an update
has taken place. This problem is known as the cache coherence problem and is typically addressed in
hardware rather than by the operating system.
Key Issues
1. Simultaneous concurrent processes
2. Scheduling
3. Synchronization
4. Memory management
5. Reliability and fault tolerance
5.4 HARDWARE MULTITHREADING
Multithreading
• A mechanism by which the instruction streams is divided into several smaller streams (threads) and can be
executed in parallel is called multithreading.
Hardware Multithreading
• Increasing utilization of a processor by switching to another thread when one thread is stalled is known as
hardware multithreading.
Thread
• A thread includes the program counter, the register state, and the stack. It is a lightweight process; whereas
threads commonly share a single address space, processes don’t.
Thread Switch
• The act of switching processor control from one thread to another within the same process. It is much less
costly than a processor switch.
Process
• A process includes one or more threads, the address space, and the operating system state. Hence, a process
switch usually invokes the operating system, but not a thread switch.
What are the approaches to hardware multithreading?
There are two main approaches to hardware multithreading.
1. Fine-grained Multithreading
2. Coarse-grained Multithreading
Fine-grained Multithreading
• A version of hardware multithreading that implies switching between threads after every instruction
resulting in interleaved execution of multiple threads. It switches from one thread to another at each clock
cycle.
• This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that clock
cycle.
• To make fine-grained multithreading practical, the processor must be able to switch threads on every clock
cycle.
Advantage
• Vertical waste is eliminated.
• Pipeline hazards cannot arise.
• Zero switching overhead
• Ability to hide latency within a thread i.e., it can hide the throughput losses that arise from both short and
long stalls.
• Instructions from other threads can be executed when one thread stalls.
• High execution efficiency
• Potentially less complex than alternative high performance processors.
Disadvantage
• Clock cycles are wasted if a thread has little operation to execute.
• Needs a lot of threads to execute.
• It is expensive than coarse-grained multithreading.
• It slows down the execution of the individual threads, since a thread that is ready to execute without stalls
will be delayed by instructions from other threads.
Coarse-grained Multithreading
• Coarse-grained multithreading was invented as an alternative to fine-grained multithreading.
• A version of hardware multithreading that implies switching between threads only after significant events,
such as a last-level cache miss.
• This change relieves the need to have thread switching be extremely fast and is much less likely to slow
down the execution of an individual thread, since instructions from other threads will only be issued when
a thread encounters a costly stall.
Advantage
• To have very fast thread switching.
• Doesn’t slow down thread.
Disadvantage
• It is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs.
• Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied.
• New thread must fill pipeline before instructions can complete.
• Due to this start-up overhead, coarse-grained multithreading is much more useful for reducing the penalty
of high-cost stalls, where pipeline refill is negligible compared to the stall time.
Simultaneous multithreading (SMT)
• It is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically
scheduled pipelined processor to exploit thread-level parallelism at the same time it exploits instruction
level parallelism.
• The key insight that motivates SMT is that multiple-issue processors often have more functional unit
parallelism available than most single threads can effectively use.
• Since SMT relies on the existing dynamic mechanisms, it does not switch resources every cycle.
• Instead, SMT is always executing instructions from multiple threads, to associate instruction slots and
renamed registers with their proper threads.
• The four threads at the top show how each would execute running alone on a standard superscalar processor
without multithreading support.
• The three examples at the bottom show how they would execute running together in three multithreading
options.
• The horizontal dimension represents the instruction issue capability in each clock cycle.
• The vertical dimension represents a sequence of clock cycles.
• An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle.
• The shades of gray and color correspond to four different threads in the multithreading processors.
• The additional pipeline start-up effects for coarse multithreading, which are not illustrated in this figure,
would lead to further loss in throughput for coarse multithreading.
Advantage
• It is ability to boost utilization by dynamically scheduling functional units among multiple threads.
• It increases hardware design facility.
• It produces better performance and add resources to a fine grained manner.
Disadvantage
It cannot improve performance if any of the shared resources are the limiting bottlenecks for the
performance.

Figure: How four threads use the issue slots of a superscalar processor in different approaches?

■ Chip multiprocessing:
In this case, multiple cores are implemented on a single chip and each core handles separate
threads. The advantage of this approach is that the available logic area on a chip is used
effectively without depending on ever-increasing complexity in pipeline design.
■ Blade Servers
A common implementation of the cluster approach is the blade server. A blade server is a server
architecture that houses multiple server modules (“blades”) in a single chassis. It is widely used in
data centers to save space and improve system management. Either self- standing or rack mounted,
the chassis provides the power supply, and each blade has its own processor, memory, and hard disk.
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
5.6 NONUNIFORM MEMRY ACCESS
■ Uniform memory access (UMA): All processors have access to all parts of main memory using loads
and stores. The memory access time of a processor to all regions of memory is the same. The access
times experienced by different processors are the same.
■ Nonuniform memory access (NUMA): All processors have access to all parts of main memory
using loads and stores. The memory access time of a processor differs depending on which region
of main memory is accessed. The last statement is true for all processors; however, for different
processors, which memory regions are slower and which are faster differ.
■ Cache- coherent NUMA (CC- NUMA): A NUMA system in which cache coherence is maintained
among the caches of the various processors. A NUMA system without cache coherence is more or
less equivalent to a cluster.
Motivation
■ The objective with NUMA is to maintain a transparent system wide memory while permitting
multiple multiprocessor nodes, each with its own bus or other internal interconnect system.
System Architecture
■ Non-uniform memory access, or NUMA, is a method of configuring a cluster of microprocessors in
a multiprocessing system so they can share memory locally.
■ The idea is to improve the system's performance and allow it to expand as processing needs evolve.
■ In a NUMA setup, the individual processors in a computing system share local memory and can work
together.
■ Data can flow smoothly and quickly since it goes through intermediate memory instead of a main bus.
■ The NUMA architecture is common in multiprocessing systems. These systems include multiple hardware
resources including memory, input/output devices, chipset, networking devices and storage devices (in
addition to processors).
■ Each collection of resources is a node. Multiple nodes are linked via a high-speed interconnect or bus.
■ Every NUMA system contains a coherent global memory and I/O address space that can be accessed by all
processors in the system.
■ The other components can vary, although at least one node must have memory, one must have I/O
resources, and one must have processors.
■ In this type of memory architecture, a processor is assigned a specific local memory for its own use, and
this memory is placed close to the processor.
■ The signal paths are shorter, which is why these processors can access local memory faster than non-local
memory. Also, since there is no sharing of non-local memory, there is an appreciable drop in delays
(latency) when multiple access requests come in for the same memory location.
How non-uniform memory access works
■ When a processor looks for data at a certain memory address, it first looks in the L1 cache on the
microprocessor.
■ Then it moves to the larger L2 cache chip and finally to a third level of cache (L3). The
NUMA configuration provides this third level.
■ If the processor still cannot find the data, it will look in the remote memory located near the other
microprocessors.
Advantages and disadvantages of NUMA
One of the biggest advantages of NUMA is the fast movement of data and lower latency in the multiprocessing
system.
Additionally, NUMA reduces data replication and simplifies programming.
The parallel computers in a NUMA architecture are highly scalable and responsive to data allocation in local
memories.
One disadvantage of NUMA is that it can be expensive.
The lack of programming standards for larger configurations can make implementation challenging.
5.7 VECTOR COMPUTATION
• SIMD is called vector architecture.
• It is also a great match to problems with lots of data-level parallelism.i.e. Parallelism achieved by
performing the same operation on independent data.
• Rather than having 64 ALUs perform 64 additions simultaneously, like the old array processors.
• The vector architectures pipelined the ALU to get good performance at lower cost.
• The basic idea of vector architecture is to collect data elements from memory, put them in order into a large
set of registers, operate on them sequentially in registers using pipelined execution units, and then write the
results back to memory.
• A key feature of vector architectures is then a set of vector registers. Thus, vector architecture might have
32 vector registers, each with 64-bit elements.
• The following figure illustrates how to improve vector performance by using parallel pipelines to execute
a vector add instruction.
• The figure using multiple functional units to improve the performance of a single vector add instruction, C
= A + B.
• The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle.
• The vector processor (b) on the right has four add pipelines or lanes and can complete four additions per
cycle.

Vector Lane
• One or more vector functional units and a portion of the vector register file. Inspired by lanes on highways
that increase traffic speed, multiple lanes execute vector operations simultaneously.
• Figure shows the structure of a four-lane vector unit. Thus, going to four lanes from one lane reduces the
number of clocks per vector instruction by roughly a factor of four.
• The figure shows three vector functional units: an FP add, an FP multiply, and a load-store unit.
• For multiple lanes to be advantageous, both the applications and the architecture must support long vectors.
• The elements within a single vector add instructions are interleaved across the four lanes.
• The vector-register storage is divided across the lanes, with each lane holding every fourth element of each
vector register.
• Each of the vector arithmetic units contains four execution pipelines, one per lane, which acts in concert to
complete a single vector instruction.
5.8 MULTICORE ORGANIZATION
The main variables in a multicore organization are as follows:
■ The number of core processors on the chip
■ The number of levels of cache memory
■ How cache memory is shared among cores
■ Whether simultaneous multithreading (SMT) is employed
• In this organization, the only on-chip cache is L1 cache, with each core having its own dedicated L1 cache.
Almost invariably, the L1 cache is divided into instruction and data caches for performance reasons,
while L2 and higher level caches are unified.
• Similar allocation of chip space to memory, but with the use of a shared L2 cache. The Intel
Core Duo has this organization. Finally, as the amount of cache memory available on the chip
continues to grow, performance considerations dictate splitting off a separate, shared L3 cache, with
dedicated L1 and L2 caches for each core processor. The Intel Core i7 is an example of this
organization.

You might also like