0% found this document useful (0 votes)
46 views

CO Module 5 Notes

This document discusses parallel processing and pipelining techniques to increase computational speed. It describes parallel processing approaches like SIMD and MIMD based on instruction and data streams. Pipelining is defined as decomposing a process into overlapping sub-operations executed concurrently in dedicated segments. An example 5-stage pipeline is presented to multiply and add numbers, showing how throughput increases with full pipeline utilization. General considerations of pipeline design, speedup calculations, and space-time diagrams are also covered.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

CO Module 5 Notes

This document discusses parallel processing and pipelining techniques to increase computational speed. It describes parallel processing approaches like SIMD and MIMD based on instruction and data streams. Pipelining is defined as decomposing a process into overlapping sub-operations executed concurrently in dedicated segments. An example 5-stage pipeline is presented to multiply and add numbers, showing how throughput increases with full pipeline utilization. General considerations of pipeline design, speedup calculations, and space-time diagrams are also covered.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Module 5

PIPELINE & VECTOR PROCESSING AND MULTI PROCESSORS


Parallel Processing, Pipelining, Arithmetic Pipeline, Instruction Pipeline, RISC Pipeline,
Vector Processing, Array Processors.
5.1 Parallel Processing:

• Parallel processing is a term used to denote a large class of techniques that are used to provide
simultaneous data-processing tasks for the purpose of increasing the computational speed
of a computer system.

• The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput, that is, the amount of processing that can be accomplished during a
given interval of time.

• The amount of hardware increases with parallel processing, and with it, the cost of the system
increases.

• Parallel processing can be viewed from various levels of complexity.

◼ At the lowest level, we distinguish between parallel and serial operations by the type of
registers used. e.g., shift registers and registers with parallel load

◼ At a higher level, it can be achieved by having a multiplicity of functional units that


perform identical or different operations simultaneously.

• Fig. below shows one possible way of separating the execution unit into eight
functional units operating in parallel.

◼ A multifunctional organization is usually associated with a complex control unit


to coordinate all the activities among the various components.
The operands in the registers are applied to one of the units depending on the operation specified by the
instruction associated with the operands. The operation performed in each functional unit is indicated in each
block of the diagram. The adder and integer multiplier perform the arithmetic operations with integer
numbers. The floating-point operations are separated into three circuits operating in parallel. The logic, shift,
and increment operations can be performed concurrently on different data. All units are independent of each
other, so one number can be shifted while another number is being incremented. A multifunctional
organization is usually associated with a complex control unit to coordinate all the activities among the
various components.

• There are a variety of ways that parallel processing can be classified. it can be considered
from the
o Internal organization of the processors
o Interconnection structure between processors
o The flow of information through the system
One classification introduced by M. ]. Flynn considers the organization of a computer system by
the number of instructions and data items that are manipulated simultaneously. The normal operation
of a computer is to fetch instructions from memory and execute them in the processor.
The sequence of instructions read from memory constitutes an instruction stream. The
operations performed on the data in the processor constitute a data stream. Parallel processing may
occur in the instruction stream, in the data stream, or in both. Flynn’s classification divides computers
into four majors groups as follows:
• Single instruction stream, single data stream (SISD)
• Single instruction stream, multiple data stream (SIMD)
• Multiple instruction stream, single data stream (MISD)
• Multiple instruction stream, multiple data stream (MIMD)
Single instruction stream, single data stream (SISD)
• Represents the organization of a single computer containing a control unit, a
processor unit, and a memory unit.
• Instructions are executed sequentially and the system may or may not have internal parallel processing
capabilities.
• parallel processing may be achieved by means of multiple functional units or by pipeline processing.

Single instruction stream, multiple data stream (SIMD)


• Represents an organization that includes many processing units under the
supervision of a common control unit.
• All processors receive the same instruction from the control unit but operate on
different items of data.
• The shared memory unit must contain multiple modules so that it can communicate with all the
processors simultaneously.

Multiple instruction stream, single data stream (MISD)


• MISD structure is only of theoretical interest since no practical system has been
constructed using this organization.
Multiple instruction stream, multiple data stream (MIMD)
• MIMD organization refers to a computer system capable of processing several
programs at the same time. e.g., multiprocessor and multicomputer system
We consider parallel processing under the following main topics:

o Pipeline processing
▪ Is an implementation technique where arithmetic sub operations or the phases of
a computer instruction cycle overlap in execution.
o Vector processing
▪ Deals with computations involving large vectors and matrices.
o Array processing
▪ Perform computations on large arrays of data.

5.2 Pipelining
Pipelining is a technique of decomposing a sequential process into sub operations, with each sub
process being executed in a special dedicated segment that operates concurrently with all other segments.
The name “pipeline” implies a flow of information analogous to an industrial assembly line. It is
characteristic of pipelines that several computations can be in progress in distinct segments at the same
time.
• Perhaps the simplest way of viewing the pipeline structure is to imagine that each segment consists
of an input register followed by a combinational circuit.
o The register holds the data.
o The combinational circuit performs the sub operation in the particular segment.
• A clock is applied to all registers after enough time has elapsed to perform all segment activity.
• Example
• The pipeline organization will be demonstrated by means of a simple example.
o To perform the combined multiply and add operations with a stream of numbers
Ai * Bi + Ci for i = 1, 2, 3, …, 7

• Each sub operation is to be implemented in a segment within a pipeline.

R1 Ai, R2 Bi Input Ai and Bi


R3 R1 * R2, R4 Ci multiply and input Ci
R5 R3 + R4 Add Ci to product

The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table
9-1. The first clock pulse transfers A1 and 31 into R1 andR2. The second clock pulse transfers the product of
R1 and R2 into R3 and C1into R4. The same clock pulse transfers A2 and B2 into R1 and R2. The third clock
pulse operates on all three segments simultaneously. It places A3 and B3into R1 and R2, transfers the product
of R1 and R2 into R3, transfers C2 intoR4, and places the sum of R3 and R4 into R5. It takes three clock
pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces a new
output and moves the data one step down the pipeline. This happens as long as new input data flow into the
system. When no more input data are available, the clock must continue until the last output emerges out of the
pipeline.
• Each segment has one or two registers and a combinational circuit as shown in Fig. 9-2.

The five registers are loaded with new data every clock pulse.

The effect of each clock is shown in Table 9-1.

General considerations:
Any operation that can be decomposed into a sequence of sub operations of about the same complexity
can be implemented by a pipeline processor.
• The general structure of a four-segment pipeline is illustrated in Fig. 9-3.
• We define a task as the total operation performed going through all the segments in the pipeline.
• The behavior of a pipeline can be illustrated with a space-time diagram.
• It shows the segment utilization as a function of time.

The space-time diagram of a four-segment pipeline is demonstrated in Fig. 9-4.

• Where a k-segment pipeline with a clock cycle time tp is used to execute n tasks.
o The first task T1 requires a time equal to ktp to complete its operation.
o The remaining n-1 tasks will be completed after a time equal to (n-1) tp
o Therefore, to complete n tasks using a k-segment pipeline requires k+(n-1) clock
cycles.
• Consider a non-pipeline unit that performs the same operation and takes a time equalto tn
to complete each task.
o the total time required for n tasks is ntn.
• The speedup of a pipeline processing over an equivalent non pipeline processing is
defined by the ratio S = ntn/(k+n-1)tp .
• If n becomes much larger than k-1, the speedup becomes S = tn/tp.
• If we assume that the time it takes to process a task is the same in the pipeline and
nonpipelined circuits, i.e.,tn = kt p, the speedup reduces to S=ktp/tp=k.
• This shows that the theoretical maximum speedup that a pipeline can provide is k, where
k is the number of segments in the pipeline.

• To duplicate the theoretical speed advantage of a pipeline process by means of


multiple functional units, it is necessary to construct k identical units that will be
operating in parallel.
• This is illustrated in Fig. 9-5, where four identical circuits are connected in parallel.
• Instead of operating with the input data in sequence as in a pipeline, the parallel circuits accept four
input data items simultaneously and perform four tasks at the same time.

5.3 Arithmetic Pipeline


• There are various reasons why the pipeline cannot operate at its maximum
theoretical rate.
o Different segments may take different times to complete their sub operation.
o It is not always correct to assume that a nonpipe circuit has the same time delay as
that of an equivalent pipeline circuit.
• There are two areas of computer design where the pipeline organization is
applicable.
o Arithmetic pipeline
o Instruction pipeline

Arithmetic Pipeline: Introduction


• Pipeline arithmetic units are usually found in very high-speed computers
o Floating–point operations, multiplication of fixed-point numbers, and similar computations
in scientific problem
• Floating–point operations are easily decomposed into suboperations as demonstrated in Sec. 10-5.
• An example of a pipeline unit for floating-point addition and subtraction is showed in the
following:
o The inputs to the floating-point adder pipeline are two normalized floating- point binary
number

A and B are two fractions that represent the mantissas, a and b are the exponents.
• The floating-point addition and subtraction can be performed in four segments, as
shown in Fig. 9-6.
• The suboperations that are performed in the four segments are:
o Compare the exponents
▪ The larger exponent is chosen as the exponent of the result.
▪ Align the mantissas

The exponent difference determines how many times the mantissa associated with the smaller exponent must be
shifted to the right.
o Add or subtract the mantissas
o Normalize the result
o When an overflow occurs, the mantissa of the sum or difference is shifted right and the exponent
incremented by one.
o If an underflow occurs, the number of leading zeros in the mantissa determines the number of left
shifts in the mantissa and the number that must be subtracted from the exponent.
4.4 Instruction Pipeline
• Introduction:
• Pipeline processing can occur not only in the data stream but in the instruction as well.
• Consider a computer with an instruction fetch unit (FIFO)and an instruction execution unit(PC) designed
to provide a two-segment pipeline.
• Computers with complex instructions require other phases in addition to above phases to process an
instruction completely.
•In the most general case, the computer needs to process each instruction with the
following sequence of steps.
o Fetch the instruction from memory.
o Decode the instruction.
o Calculate the effective address.
o Fetch the operands from memory.
o Execute the instruction.
o Store the result in the proper place.
• There are certain difficulties that will prevent the instruction pipeline from operating at its
maximum rate.
o Different segments may take different times to operate on the incoming
information.
o Some segments are skipped for certain operations.
o Two or more segments may require memory access at the same time, causing one segment to
wait until another is finished with the memory.
Example: four-segment instruction pipeline:
• Assume that:
o The decoding of the instruction can be combined with the calculation of the effective address
into one segment.
o The instruction execution and storing of the result can be combined into one
segment.
• Fig 9-7 shows how the instruction cycle in the CPU can be processed with a four-
segment pipeline.
• Thus, up to four sub operations in the instruction cycle can overlap and up to four different instructions
can be in progress of being processed at the same time.
• An instruction in the sequence may be causes a branch out of normal sequence.
o In that case the pending operations in the last two segments are completed and all
information stored in the instruction buffer is deleted.
o Similarly, an interrupt request will cause the pipeline to empty and start again from a new
address value.
• Fig. 9-8 shows the operation of the instruction pipeline.

The time in the horizontal axis is divided into steps of equal duration. The four segments
are represented in the diagram with an abbreviated symbol.
1. F1 is the segment that fetches an instruction.
2. DA is the segment that decodes the instruction and calculates the effective address.
3. Fo is the segment that fetches the operand.
4. EX is the segment that executes the instruction.

It is assumed that the processor has separate instruction and data memories so that the operation in F1 and
PC can proceed at the same time. In the absence of a branch instruction, each segment operates on different
instructions. Thus, in step 4, instruction 1 is being executed in segment EX; the operand for instruction 2 is
being fetched in segment FO; instruction 3 is being decoded in segment DA; and instruction 4 is being
fetched from memory in segment FI.
Assume now that instruction 3 is a branch instruction. As soon as this instruction is decoded in segment DA
in step 4, the transfer from F1 to DA of the other instructions is halted until the branch instruction is executed
in step6. If the branch is taken, a new instruction is fetched in step7 If the branch is not taken, the instruction
fetched previously in step 4 can be used. The pipeline then continues until a new branch instruction is
encountered.
Another delay may occur in the pipeline if the EX-segment needs to store the result of the operation in
the data memory while the FO segment needs to fetch an operand. In that case, segment FO must wait
until segment EX has finished its operation.
• In general, there are three major difficulties that cause the instruction pipeline to deviate
from its normal operation.
o Resource conflicts caused by access to memory by two segments at the same time.
▪ Can be resolved by using separate instruction and data memories
o Data dependency conflicts arise when an instruction depends on the result of a previous
instruction, but this result is not yet available.
o Branch difficulties arise from branch and other instructions that change the value of PC.
Data dependency:
o A difficulty that may cause a degradation of performance in an instruction pipeline is due to possible
collision of data or address.
▪ A data dependency occurs when an instruction needs data that are not yet available.
▪ An address dependency may occur when an operand address cannot be calculated
because the information needed by the addressing mode is not available.
o Pipelined computers deal with such conflicts between data dependencies in a variety of ways.
o Hardware interlocks: an interlock is a circuit that detects instructions whose resource
operands are destinations of instructions farther up in the pipeline.
▪ This approach maintains the program sequence by using hardware to insert the required
delays.
o Operand forwarding: uses special hardware to detect a conflict and then avoid it by routing
the data through special paths between pipeline segments.
▪ This method requires additional hardware paths through multiplexers as well
as the circuit that detects the conflict.
o Delayed load: the compiler for such computers is designed to detect a data conflict
and reorder the instructions as necessary to delay the loading of the conflicting data by
inserting no-operation instructions.
Handling of branch instructions

• One of the major problems in operating an instruction pipeline is the occurrence of


branch instructions.
▪ An unconditional branch always alters the sequential program flow by loading the
program counter with the target address.
▪ In a conditional branch, the control selects the target instruction if the condition
is satisfied or the next sequential instruction if the condition is not satisfied.
• Pipelined computers employ various hardware techniques to minimize the performance
degradation caused by instruction branching.

• Prefetch target instruction: To prefetch the target instruction in addition to the instruction
following the branch. Both are saved until the branch is executed.
• Branch target buffer(BTB): The BTB is an associative memory included in the fetch segment of
the pipeline.
• Each entry in the BTB consists of the address of a previously executed branch instruction and the
target instruction for that branch.
• It also stores the next few instructions after the branch target instruction.
• When the pipeline decodes a branch instruction, it searches the associative memory BTB for the
address of the instruction.
• If it is in the BTB, the instruction is available directly and prefetch continues from the new path.
• If the instruction is not in the BTB, the pipeline shifts to a new instruction stream and stores the
target instruction in the BTB.
• Advantage
• This scheme is that branch instructions that have occurred previously are readily available in the
pipeline without interruption.
• Loop buffer: This is a small very high speed register file maintained by the instruction fetch segment
of the pipeline.
• When a program loop is detected in the program, it is stored in the loop buffer in its entirety, including
all branches.
• The program loop can be executed directly without having to access memory until the loop mode is
removed by the final branching out.
• Branch prediction: Pipeline with branch prediction uses some additional logic to guess the
outcome of a conditional branch instruction before it is executed.
• The pipeline then begins prefetching the instruction stream from the predicted path.
• A correct prediction eliminates the wasted time caused by branch penalties.
• Delayed branch: in this procedure, the compiler detects the branch instructions and rearranges
the machine language code sequence by inserting useful instructions that keep the pipeline
operating without interruptions.
▪ A procedure employed in most RISC processors.
▪ e.g. no-operation instruction
5.5 Vector Processing
• In many science and engineering applications, the problems can be formulated in terms of vectors
and matrices that lend themselves to vector processing.
• Computers with vector processing capabilities are in demand in specialized applications. e.g
Long-range weather forecasting
o Petroleum explorations
o Seismic data analysis
o Medical diagnosis
o Artificial intelligence and expert systems
o Image processing
o Mapping the human genome

• To achieve the required level of high performance it is necessary to utilize the fastest and most
reliable hardware and apply innovative procedures from vector and parallel processing
techniques.

Vector Operations
• Many scientific problems require arithmetic operations on large arrays of numbers.
• A vector is an ordered set of a one-dimensional array of data items.
• A vector V of length n is represented as a row vector by V=[v1,v2,…,Vn].
• To examine the difference between a conventional scalar processor and a vector processor,
consider the following Fortran DO loop:
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
• This is implemented in machine language by the following sequence of operations.

Initialize I=0
20 Read A(I)
Read B(I)
Store C(I) = A(I)+B(I)
Increment I = I + 1
If I 100 go to 20
Continue

• A computer capable of vector processing eliminates the overhead associated with the time
it takes to fetch and execute the instructions in the program loop.

C(1:100) = A(1:100) + B(1:100)

• A possible instruction format for a vector instruction is shown in Fig. 9-11.


o This assumes that the vector operands reside in memory.

• It is also possible to design the processor with a large number of registers andstore all
operands in registers prior to the addition operation.
o The base address and length in the vector instruction specify a group of CPU
registers.
Matrix Multiplication
• The multiplication of two n x n matrices consists of n2 inner products or n3 multiply-
add operations.
• Consider, for example, the multiplication of two 3 x 3 matrices A and B.

o c11= a11b11+ a12b21+ a13b31


o This requires three multiplication and (after initializing c11 to 0) three additions.
• In general, the inner product consists of the sum of k product terms of the form C =
A1B1+A2B2+A3B3+…+AkBk.
o In a typical application k may be equal to 100 or even 1000.
• The inner product calculation on a pipeline vector processor is shown in Fig. 9-12.

Memory Interleaving

• Pipeline and vector processors often require simultaneous access to memory from two or more
sources.
• An instruction pipeline may require the fetching of an instruction and an operand at the same time
from two different segments so an arithmetic pipeline usually requires or more operands to enter the
pipeline at the same time.
• Instead of using two memory buses for simultaneous access, the memory can be partitioned into a
number of module connected to a common memory address and data buses.
• A memory module is a memory array together with its own address and data registers. Each memory
array has its own address register AR and data register DR.
• The AR receive information from a common address bus and the DR communicate with a bidirectional
data bus.
• Fig. 9-13 shows a memory unit with four modules.

➢ The two least significant bits of the address can be used to distinguish between the four modules.
➢ The modular system permits on module to initiate a memory access while other modules are in the processor
reading or writing a word and each module

Supercomputers
• A commercial computer with vector instructions and pipelined floating-point arithmetic operations is
referred to as a supercomputer.
o To speed up the operation, the components are packed tightly together to minimize the
distance that the electronic signals have to travel.

• This is augmented by instructions that process vectors and combinations of scalars and vectors.

• A supercomputer is a computer system best known for its high computational speed, fast and large
memory systems, and the extensive use of parallel processing.
o It is equipped with multiple functional units and each unit has its own pipeline
Configuration.

• It is specifically optimized for the type of numerical calculations involving vectors and matrices of
floating-point numbers.
• They are limited in their use to a number of scientific applications, such as numerical weather
forecasting, seismic wave analysis, and space research.
• A measure used to evaluate computers in their ability to perform a given number of floating-
point operations per second is referred to as flops.
• A typical supercomputer has a basic cycle time of 4 to 20 ns.
• The examples of supercomputer:
o Cray-1: it uses vector processing with 12 distinct functional units in parallel; a large
number of registers (over 150); multiprocessor configuration (Cray X- MP and Cray
Y-MP)
o Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300 megaflops
6.6 Array Processors: Introduction

• An array processor is a processor that performs computations on large arrays ofdata.


• The term is used to refer to two different types of processors.
o Attached array processor:
▪ Is an auxiliary processor.
▪ It is intended to improve the performance of the host computer in specific
numerical computation tasks.
o SIMD array processor:
▪ Has a single- instruction multiple-data organization.
▪ It manipulates vector instructions by means of multiple functional units
responding to a common instruction.

Attached Array Processor

• Its purpose is to enhance the performance of the computer by providing vector processing for complex
scientific applications.
o Parallel processing with multiple functional units
• Fig. 9-14 shows the interconnection of an attached array processor to a host computer.

• The host computer is a general-purpose commercial computer and the attached processor is a
back-end machine driven by the host computer. The array processor is connected through an
input-output controller to the computer and the computer treats it like an external interface.
• The data for the attached processor are transferred from main memory to a local memory through a
high-speed bus. The general-purpose computer without the attached processor serves the users that need
conventional data processing. The system with the attached processor satisfies the needs for complex
arithmetic applications.

For example, when attached to a VAX 11 computer, the FSP-164/MAX from Floating- Point
Systems increases the computing power of the VAX to 100megaflops.
• The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputer.

SIMD Array Processor


• A SIMD array processor is a computer with multiple processing units operating in parallel.
• A general block diagram of an array processor is shown in Fig. 9-15.
o It contains a set of identical processing elements (PEs), each having a local memory M.
o Each PE includes an ALU, a floating-point arithmetic unit, and working
registers.
o Vector instructions are broadcast to all PEs simultaneously.
• Masking schemes are used to control the status of each PE during the execution of vector
instructions.
o Each PE has a flag that is set when the PE is active and reset when the PE is inactive.

• For example, the ILLIAC IV computer developed at the University of Illinois and
manufactured by the Burroughs Corp.
o Are highly specialized computers.
o They are suited primarily for numerical problems that can be expressed in vector or matrix
form.

You might also like