Lecture 8 Unit 4 Pipeline and Vector Processing 2019
Lecture 8 Unit 4 Pipeline and Vector Processing 2019
AND
VECTOR PROCESSING
Unit – 4
2019
Dr. Bonomali Khuntia
Department of Computer Science
Berhampur University
PIPELINING AND VECTOR PROCESSING
Parallel Processing
Pipelining
Arithmetic Pipeline
Instruction Pipeline
RISC Pipeline
Vector Processing
Array Processors
Parallel Processing
Instruction stream
Characteristics
- Standard von Neumann machine
- Instructions and data are stored in memory
- One operation at a time
Limitations
Von Neumann bottleneck
PARALLEL PROCESSING
Parallel processing denotes techniques used to
provide simultaneous data-processing tasks for
increasing the computational speed of computer
system.
A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time.
The system may have two or more ALUs and be able
to execute two or more instructions at the same time.
Parallel Processing
PARALLEL PROCESSING
A multifunctional organization is usually associated
with a complex control unit to coordinate all the
activities among the various components.
Adder-sub tractor
Integer multiply
Logic unit
Shift unit
To Memory Incrementer
Processor
Floating-point
register Add-subtract
Floating-point
multiply
Floating-point
divide
Parallel Processing
PARALLEL COMPUTERS
Flynn's Classification
Based on the multiplicity of Instruction Streams and Data Streams
Instruction Stream
Sequence of Instructions read from memory
Data Stream
Operations performed on the data in the processor
Parallel processing may occur in the instruction stream, the data stream,
or both .
Number of Data Streams
Single Multiple
Instruction stream
Characteristics:
One control unit, one processor unit, and one memory unit
Parallel processing may be achieved by means of:
multiple functional units
pipeline processing
Parallel Processing
M CU P
M CU P
Memory
• •
• •
• •
M CU P Data stream
Instruction stream
Characteristics
- There is no computer at present that can be classified as MISD
Parallel Processing
Control Unit
Instruction stream
Data stream
Alignment network
Characteristics
Only one copy of the program exists.
A single controller executes one instruction at a time.
Parallel Processing
Interconnection Network
Shared Memory
Characteristics:
Multiple processing units (multiprocessor system)
Execution of multiple instructions on multiple data
Classification Summary
1. SISD: Instructions are executed sequentially. Parallel processing may
be achieved by means of multiple functional units or by pipeline
processing.
2. SIMD: Includes multiple processing units with a single control unit.
All processors receive the same instruction, but operate on different
data.
3. MISD: There is no computer at present that can be classified as MISD
4. MIMD: A computer system capable of processing several programs at
the same time.
PIPELINING
A technique of decomposing a sequential process into suboperations, with
each sub-process being executed in a partial dedicated segment that operates
concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Clock cycles
1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
Segment 2 T1 T2 T3 T4 T5 T6 No matter how many
segments, once the pipeline is
3 T1 T2 T3 T4 T5 T6
full, it takes only one clock
4 T1 T2 T3 T4 T5 T6 period to obtain an output.
Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Speedup
Sk: Speedup tn
Sk = n*tn / (k + n - 1)*tp lim Sk = ( = k, if tn = k * tp )
n tp
Pipelining
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system with 4 identical function units
P1 P2 P3 P4
Pipelining
Challenges in Pipeline
ARITHMETIC PIPELINE
Exponents Mantissas
a b A B
Floating-point adder R R
Z = 0.10324 x 10 4
Instruction Pipeline
INSTRUCTION PIPELINE
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place
The pipeline may not perform at its maximum rate due to:
Different segments taking different times to operate
Some segment being skipped for certain operations
Memory access conflicts
INSTRUCTION PIPELINE
Conventional
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Instruction Pipeline
Decode instruction
Segment2: and calculate There are some difficulties that
effective address
will prevent the instruction pipeli
Branch? ne from operating at its maximu
yes
no m rate.
Fetch operand
Segment3: from memory
Interrupt yes
Interrupt?
handling
no
Update PC
Empty pipe
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Instruction Pipeline
Reasons for the pipeline to deviate from its normal operation are:
VECTOR PROCESSING
Vector Processor (computer): Ability to process vectors,
and related data structures such as matrices and multi-
dimensional arrays, much faster than conventional
computers.
Vector Processors may also be pipelined.
VECTOR OPERATIONS
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Conventional computer
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I+ 1
If I 100 goto 20
Vector computer
C(1:100) = A(1:100) + B(1:100)
Matrix Multiplication
Source
A
Matrix Multiplication
All segment registers in the multiplier and adder are initialized to zero.
After the first four cycles, the product begins to be added to the output of
the adder.
During the next four cycles, 0is added to the four products entering the
adder pipeline.
At the end of eight cycle, the first four products are in the four adder
segments.
At the beginning of the ninth cycle, the output of the adder is A1B1 and the
output of the multiplier is A5B5 . The tenth cycle starts the addition A2B2 +
A6B6 and so on.
Source
A
Matrix Multiplication
When there are no product terms to be added, the system
inserts four 0s into the multiplier pipeline.
The adder pipeline will have one partial product in each of its
four segments as in equation below. The four partial sums are
then added to form the final sum.
Source
A
MEMORY INTERLEAVING
Pipeline and vector processors often require
simultaneous access to memory from two or more
sources.
An instruction pipeline may require the fetching of
an instruction and an operand at the same time from
two different segments.
An arithmetic pipeline usually requires two or more
operands to enter the pipeline at the same time.
Instead of using two memory buses for
simultaneous access, the memory can be partitioned
into a number of modules connected to common
memory address and data buses.
Vector Processing
AR AR AR AR
DR DR DR DR
Array processor
An array processor is a processor that is employed to
compute on large arrays of data. The term is used to
refer to two different types of processors.
An attached array processor
SIMD Array Processor
Both are used to manipulate vectors, but they differ in
internal organizations.
Array Processor
High-speed memory to
Main memory Local memory
Memory bus
Array Processor
PE1 M1
Master control
unit
PE2 M2
PE3 M3
Main memory
PEn Mn
Array Processor