Ca Part 3
Ca Part 3
VECTOR PROCESSOR
Chapter at a Glance
Vector Processor operations
processors are specialized, heavily pipelined processors that perform efficient that can
Vector for applications
entire vectors and matrices at once. This class of processor is suited
on
high degree of parallelism. Register-register vector processors require all operations
benefit from a Memory-memory vector processors allow
source and destination operands.
to use registers as arithmetic unit. A vector is a fixed-length, one
to be routed directly to the
operands from memory series of scalar quantities. Various
arithmetic
dimensional array of values, or an ordered
defined over vectors, including addition, subtraction, and multiplication.
operations are
are:
different types of vector instructions. They
Vector Instructions: There are three vector registers one or more than one vector
instruction: From different
1. Vector-vector result is send to another vector register. This
functional pipeline unit and
operands enter in a instruction as shown in the given figure
below,
operation is called vector-vector
type of vector registers and it can define by the following
two
where V, , V, V, are different vector
mapping functions f; and f;.
f,:V,’V, and f,: V, x Ve’V,
instruction: In vector scalar instructions the input operands of the functional
2. Vector-scalar register both and produce a vector output as
shown
unit enter from scalar register and vector It
below, where V,, V, are different vector registers and S, is a scalar register.
in the figure
function f;.
can also define by the following mapping
f:SxV.’ Vb vector load or
defined by
Vector-Memory instruction: vector -memory instruction can be defined by the
3. register and memory. it can also
vector store operations between vector
following two mapping functions f, and f,. store]
f,:M’V [vector load ]and f,: V’ M [ vector
Pipeline Chaining are
chaining is a linking process that occurs when results obtained from one pipeline unit
Pipeline functional pipe. In other words, intermnediate
directly fed into the operand registers of another
used even before the vector operation is
can be
results do not have to be restored into memory and
issued as soon as the first result becomes
completed. Chaining permits successive operations to be
registers must be properiy
available as an operand. The desired functional pipes and operand
until the demanded resources
reserved; otherwise, chaining operations have to be suspended
become available.
CA-84
COMPUTER ARCHITECTURE
Superscalar, Superpipeline and Superscalar
architecture
Superpipelined
Superscalar and super-pipelined processors utilize parallelism to achieve peak performance
that can be several times higher than that of conventional scalar
processors. In order for this
potential to be translated into the speedup of real programs, the compiler must be able to
schedule instructions so that the parallel hardware is effectively utilized.
Superscalar processing has multiple functional units are kept busy by multiple instructions.
Current technology has enabled, and at the same time created the need to issue instructions in
parallel. As execution pipelines have approached the limits of speed, parallel execution has
been required to improve performance.
VLIW Architecture
Very Long Instruction Word (VLIW) refers to a CPUarchitecture designed to take advantage of
instruction level parallelism. A processor that executes every instruction one after the other i.e. a
non-pipelined scalar architecture may use processor resources inefficiently, potentially leading to
poor performance. The performance can be improved by executing different sub-steps of
sequential instructions simultaneously, or even executing multiple instructions entirely
simultaneously as in superscalar architectures. The VLIW approach executes operation in parallel
based on a fixed schedule determined when programs are compiled. Since determining the order
of execution of operations is handled by the compiler.
9. Ina)which
SIMD
type of processor array processing is possible? WBUT 2019]
b) MIMD c) MISD d) SISD
Answer: (a)
CA-86
COMPUTER ARCHITECTURE
elements are all adjacent, then fetching the vector from a set of heavily interleaved
memory banks works very wel. The high latency of initiating a main memory access
versus accessing a cache is amortized, because a single access is initiated for the entire
vector rather than to a single word. Thus, the cost of the latencyto main memory is seen
only once for the entire vector, rather than once for each word of the vector. In this way
we can speed up memory access in case of vector processing.
2. Discuss vector instruction format. WBUT 2006]
OR,
Define the various types of vector instructions. WBUT 2010, 2014
OR,
Describe different types of vector instructions. [WBUT 2018]
Answer:
There are three different types of vector instructions basically basis of their mathematical
mapping as given below.
Vector-vector instruction: From different vector registers one or more than one vector
operands enter in a functional pipeline unit and result is send to another vector register.
This type of vector operation is called vector-vector instruction as shown in the given
figure below, where Va, Vb, Vc are different vector registers and it can define by the
following twomapping functions fl and f2.
fl: Va ’ Vb and f2: Vb x Vc ’ Va
Vb Register Vc Register Va Register
CA-87
POPULAR PUBLICATIONS
Menory
Vectór Store
Fig: Vector memory instructions
superscalar, superpipeline and superscalar superpipelined
3. Compare [WBUT 2007]
architecture.
Answer:
parallelism to achieve peak
Superscalar and super-pipelined processors utilizeof conventional scalar processors.
performance that can be several times higher than that
Superscalar machines can issue several instructions aper cycle. A systemi was developed
for series of benchmarks. The average
and used to measure instruction-level parallelism simulations suggest that this metric
degree of super pipelining metricis introduced. Ouralready exploit allof the instruction
machines
is already high for many machines. These applications, even 'without parallel
level parallelism available in many non-numeric
instruction issue or higher degrees of pipelining.
Superscalar
Superscalar processing has multiple functional units are kept busy by multiple
instructions. As execution pipelines have approached the limits of speed, parallel
Super-pipelined machines can issue
execution has been required to improve performance.times shorter than the: latency of any
cycle
only one instruction per cycle, but they havemachines employ a singlke fetch-decode
functional unit. In some cases superscalar example, still
the Ultra SPARC splits execution
dispatch pipe that drives all of the units. For
CA-88
COMPUTER ARCHITECTURE
after the third| stage of a unified pipeline. However, it is becoming more common to have
ultiple fetch-decode-dispatch pipes feeding the functional units. Superscalar operation
is imited by the number of independent operations that can be extracted from an
instruction stream.
Super-pipeline
Given a pipeline stage time T, it may be possible to execute at a higher rate by starting
onerations at intervals of T/n. This can be accomplished in twO ways:
. Further divide each of the pipeline stages into nsubstages.
Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into
segments with uniform latency. The second approach could be viewed in a sense as
staggered superscalar operation, and has associated with it all of the same requirements
except that instructions and data can be fetched with a slight offset in time.
Super-pipelining is limited by the speed of logic, and the frequency of unpredictable
branches. Stage time cannot productively grow shorter than the inter stage latch time, and
Sothis is alimit for the number of stages. The MIPS R4000 is sometimes called asuper
pipelined machine. The benefit of such extensive pipelining is really only gained for very
regular applications such as graphics.
Superscalar-Super-pipeline
We may also combine superscalar operation with super-pipelining and the result is
potentially the product of the speedup factors. However, it is even more dificult to
interlock between parallel pipes that are .divided into many stages. Also, the memory
subsystem must be able to sustain alevel of instruction throughput corresponding to the
total throughput of the multiple pipelines -- stretching the processor/memory performance
gap even more. Of course, with so many pipes and so many stages, branch penalties
become huge, and branch prediction becomes a serious bottleneck.
But the real problem may be in finding the parallelisn required to keep all of the pipes
and stages busy between branches. Consider that a machine with 12 pipelines of 20
stages must always have access to a window of 240 instructions that are scheduled so as
to avoid all hazards, and that the average of 40 branches that would be present in a block
of that size are all correctly predicted sufficiently in advance to avoid stalling in the
prefetch unit.
The VLIW approach executes operation in parallel based on a fixed schedule determined
when programs are compiled. Since determining the order of execution of operations is
handled by the compiler.
EDiscuss about strip mining and vector stride in vector processors.
[WBUT 2008, 2012]
Answer:
Vector lengths do not often correspond to the length of the vector registers. For shorter
vectors,we can use a vector length register applied to each vector operation. If a vector to
be processed has a length greater than that of the vector registers, then strip-mining is
used, whereby the original vector is divided into equal size segments i.e. equal to the size
of the vector registers and these segments are processed in sequence. The process of
strip-mining is usually performed by the compiler but in some architecture it could be
done by the hardware. The strip-mined loop consists of a sequence of convoys.
The vector elements are ordered to have a fixed addressing increment between successive
elements called as stride or skew distance. i.e. It is the distance separating elements in
memory that will be adjacent in a vector register. The value of the stride could be
different for different variable. When a vector is loaded into a vector register then the
stride is 1, meaning that all the elements of vector are adjacent. Non-unit strides can
cause major problems for the memory system, which is based on unit stride (i.e. all the
elements are one after another in different interleaved memory banks). Caches deal with
unit stride, and behave badly for non-unit stride. To account for non-unit, stride, most
systems have a stride register that the memory system can use for loading elements of a
vector register. However, the memory interleaving may not support rapid loading. The
vector strides technique is used when the elements of vectors are not adjacent.
6. What is vector processor? Give the block diagram to indicate the architecture of
a typical Vector Processor with multiple function pipes.
WBUT 2008, 2010-short note]
Answer:
Vector processors are specialized, heavily pipelined processors that perform efficient
operations on entire vectors and matrices at once. This class of processor is suited for
applications.that can benefit from a high degree of parallelism. Register-register vector
processors require all operations to use registers as source and destination operands.
Memory-memory vector processors allow operands from memory to be routed directly to
the arithmetic unit: Avector is a fixed-length, one-dimensional array of values, or an
ordered series of scalar quantities. Various arithmetic operations are defined over vectors,
including addition, subtraction, and multiplication.
Avector processor includes a set of vector registers for storing data to be used in the
execution of instructions and a vector functional unit coupled to tfe vector registers for
executing instructions. The functional unit executes instructions using operation codes
provided to it which operation codes include a field referencing a special register. The
Special register contains information about the length and starting point for each vector
CA-91
POPULAR PUBLICATIONS
instruction. A series of new instructions toenable rapid handling of image pixel data are
provided. Scalar Processor
Scalar Ppel
Register
nstruction
processing Ppe 2
unit (IPU)
Pipep
High
Speed
Man Vector
Instruction
Memory Controller
Pipe l
Vector
Pipe 2
Vector
Access Register
controller
Pipe m
Vector Processo:
CA-92
COMPUTER ARCHITECTURE
Answer:
Avector
processor is a processor that can operate on entire vectors with one instruction,
consider the
of:some instructions specify complete vectors. For example,
i.e. the operands
following add instruction:
C=A+B
In hoth scalar and vector machines this means add the contents of A to the contents of B
C." In a scalar machine the operands are numbers, but in vector
and put the sum in
compute
processors the operands are vectors and the instruction directs the machine to called the
the pair-wise sum of each pair of vector elements, A processor register, usually
vector length register, tells the processor how many individual additions to perform when
it adds the vectors. A key division of vector processors arises from the way the
instructions access their operands. In the memory to memory organization the operands
are fetched from memory and routed directly to the functional unit. Results are streamed
back out to memory as the operation proceeds. In the register to register organization
operands are first loaded into a set of vector registers, each of which can hold a segment
of a register, for example 64 elements. The vector operation then proceeds by fetching the
operands from the vector registers and returning the results to a vector register.
10. What is vector chaining? How can, it speedup the processing? Explain with
suitable example. WBUT 2018]
CA-93
POPULAR PUBLICATIONS
Answer:
In computing, chaining is a technique used in computer architecture in which scalar and
vector registers generate interim results which can be used immediately, withor
additional memory references which reduce computational speed. i.e. in chaining data
forwarding from one vector functional unit to another unit without waiting for completing
the previous instruction.
Without chaining, the system must wait for last element of result to be written before
starting dependent instruction as shown in fig. 1(a). It takes three unit of time for 'Load
'Mul' and 'Add' instructions. But with chaining technique, the system can start
dependent instruction as soon as first result appears as shown in fig. 1(b). The pipelined
vector processors have an optimization called chaining: When avector load operation
executes, whose result is used by a vector multiply, and that result by a vector add, the
machine would not wait with the vector multiply until the vector load is finished, but pass
the first value of the vector right from the load unit to the multiply unit, and the first
value of that result directly to the add unit.
Load
Mul
Time
Add
Fig. 1(a)
Load
Mul
Add
Fig. 1(b)
In vector-register operations, all vector operations-except load and store are among
vector-register architecture,
the vector registers. All major vector computers use
including the Cray Research processors (Cray-1, Cray-2). In memory-memory vecto
operations, all vector operations are memory to memory. The first vectorcomputers were
of this type, as were CDC's vector computers.
CA-94
COMPUTER ARCHITECTURE
Gather and scatter are used to process sparse matrices/vectors. The gather operation uses
a base address and a set of indices to access from memory "few" of the elements of a
large vector intoone of the vector registers. The scatter operation does the opposite. The
masking operations allow conditional execution of an instruction based on a"masking"
register.
2. a) What are strip mining and vector stride, in respect to vector processors?
b) Both vector processors and array processors are specialized to operate on
vectors. What are the main differences between them? WBUT 2005, 2010]
OR,
Differentiate between vector processor and array processor with example.
[WBUT 2018]
Answer:
a) Now, we discuss about the different address position in memory of adjacent elements
In a vector and these addresses may not be sequential. For vector processors without
Caches, we need a technigque to fetch elements of a vector that are not adjacent in
memory. A vector instruction is said to be stride i, if the distance between its two
Successive data references is i words or i double words apart. This distance separating
elements that are to be gathered into a single register is called the stride. Once a vector is
loaded into a vector register it acts as if it had logically adjacent elements.
Ihe vector stride, like the vector starting address, can be put in a general-purpose
Tegister. Then instruction can be used to fetch the vector into a vector register. In some
vector processors the loads and stores always have a stride value stored in a register, so
hat only a single load arnd a single store instruction are required. Complications in the
emory system can occur from supporting strides greater than one. When multiple
CA-95
POPULAR PUBLICATIONS
accesses contend for a bank, á memory bank conflict occurs and one access must be
stalled. A bank conflict, and hence a stall, will occur if
Number of banks < Bank busy time
Least common multiple (Stride, Number of banks)
When a vector has a lengthgreater than the vector registers then segmentation' of the long
vector into fixed length segments is necessary. This technique is called strip-mining. One
vector segment is processed at a time. As an example, the vector segment length is 64
elements in Cray computers. Until the entire vector elements in each segment are
processed, the vector register cannot be assigned to another vector operation. Strip
mining is restricted by the number of available vector, registers and vector chaining.
b) The SIMD-1 Array Processor consists of aMemory, an Array Cointrol Unit (ACU) and
the one-dimensional SIMD array of simple processing elements (PEs). The figures show
a 4-processor array. The figures shows the initial image seen when the model is loaded.
MEMORY
Cycle
Phase
PC main, O
|Array Conta Unt cc
AC IR NOP
PE-IR NOP
PEC rr
SIMD ATy
CA-97
POPULAR PUBLICATIONS
Degree of
parallelism
(Vectorization)
passes a vector to a functional unit, whereas an array processor passes each element of a
vector to a different arithmetic unit.
CA-100
COMPUTER ARCHITECTURE
Vector-Memory instruction: vector
-memory
or vector store operations between vector registerinstruction can itbecan
and memory. defined
also by vectorbyload
defined the
following two mapping functions fË and f,
f :M’ V [vector load] and f,: V ’ M [vector
store]
CA-101
POPULAR PUBLICATIONS
Main memory
FP add/subtract
Vector
load-store
FP multiply
FP divide
Vector
Integer
registes
Logieal
Scalar
registers
Fig: Vector register Architecture
functional
In the above figure, there are eight 64-element' vector registers, and all the
units are vector functional units.
and can
Vector functional units: In a vector functional unit each unit is fully pipelined
hazards. In
needed to detect
start a new operation on every clock cycle. A control unit is
the above figure, there are five functional units.
a vector to or
Vector load-store unit: This is a vector memory unit that loads or stores
pipelined, so that words can be
from memory. Here, vector loads and stores are fullya bandwidth of one word per clock
moved between the vector registers and memory with
cycle, after an initial latency.
data as input to the vector
Set of scalar registers: Scalar registers can also provide vector load-store unit.
functional units, as well as compute 'addresses to pass to the
Vector Execution Time: The execution time of a sequence of vector operations primarily
depends on three factors:
The length of the operand vectors,
Structural hazards among the operations and
Data dependences.
We can compute the time for a single vector instruction depending on the vector length
and the initiation rate that is the rate at which a vector unit consumes newoperands and
produces new results. All modern supercomputers have vector functional units with
multiple parallel pipelines that can produce two or more results per clock cycle.
e) Vector Stride: Refer to Question No. 5 of Short Answer Type Questions.
CA-102
COMPUTER ARCHITECTURE
CA-103