0% found this document useful (0 votes)
17 views

Ca Part 3

Uploaded by

Subhadip Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Ca Part 3

Uploaded by

Subhadip Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

POPULAR PUBLICATIONS

VECTOR PROCESSOR

Chapter at a Glance
Vector Processor operations
processors are specialized, heavily pipelined processors that perform efficient that can
Vector for applications
entire vectors and matrices at once. This class of processor is suited
on
high degree of parallelism. Register-register vector processors require all operations
benefit from a Memory-memory vector processors allow
source and destination operands.
to use registers as arithmetic unit. A vector is a fixed-length, one
to be routed directly to the
operands from memory series of scalar quantities. Various
arithmetic
dimensional array of values, or an ordered
defined over vectors, including addition, subtraction, and multiplication.
operations are
are:
different types of vector instructions. They
Vector Instructions: There are three vector registers one or more than one vector
instruction: From different
1. Vector-vector result is send to another vector register. This
functional pipeline unit and
operands enter in a instruction as shown in the given figure
below,
operation is called vector-vector
type of vector registers and it can define by the following
two
where V, , V, V, are different vector
mapping functions f; and f;.
f,:V,’V, and f,: V, x Ve’V,
instruction: In vector scalar instructions the input operands of the functional
2. Vector-scalar register both and produce a vector output as
shown
unit enter from scalar register and vector It
below, where V,, V, are different vector registers and S, is a scalar register.
in the figure
function f;.
can also define by the following mapping
f:SxV.’ Vb vector load or
defined by
Vector-Memory instruction: vector -memory instruction can be defined by the
3. register and memory. it can also
vector store operations between vector
following two mapping functions f, and f,. store]
f,:M’V [vector load ]and f,: V’ M [ vector
Pipeline Chaining are
chaining is a linking process that occurs when results obtained from one pipeline unit
Pipeline functional pipe. In other words, intermnediate
directly fed into the operand registers of another
used even before the vector operation is
can be
results do not have to be restored into memory and
issued as soon as the first result becomes
completed. Chaining permits successive operations to be
registers must be properiy
available as an operand. The desired functional pipes and operand
until the demanded resources
reserved; otherwise, chaining operations have to be suspended
become available.

CA-84
COMPUTER ARCHITECTURE
Superscalar, Superpipeline and Superscalar
architecture
Superpipelined
Superscalar and super-pipelined processors utilize parallelism to achieve peak performance
that can be several times higher than that of conventional scalar
processors. In order for this
potential to be translated into the speedup of real programs, the compiler must be able to
schedule instructions so that the parallel hardware is effectively utilized.
Superscalar processing has multiple functional units are kept busy by multiple instructions.
Current technology has enabled, and at the same time created the need to issue instructions in
parallel. As execution pipelines have approached the limits of speed, parallel execution has
been required to improve performance.
VLIW Architecture
Very Long Instruction Word (VLIW) refers to a CPUarchitecture designed to take advantage of
instruction level parallelism. A processor that executes every instruction one after the other i.e. a
non-pipelined scalar architecture may use processor resources inefficiently, potentially leading to
poor performance. The performance can be improved by executing different sub-steps of
sequential instructions simultaneously, or even executing multiple instructions entirely
simultaneously as in superscalar architectures. The VLIW approach executes operation in parallel
based on a fixed schedule determined when programs are compiled. Since determining the order
of execution of operations is handled by the compiler.

Multiple Choice Type Questions


1. Which of the following types of instructions are useful in
handling sparse
vectors or sparse matrices often encountered in practical vector processing
application? WBUT 2007]
a) Vector-Scalar instruction b) Masking instruction
c) Vector-memory instructions d) None of these
Answer: (b)
2. The vector stride value is required
WBUT 2009, 2011, 2018]
a) to deal with the length of vectors b) to find the parallelism in vectors
c) to access the elements in multi-dimensional vectors
d) to execute vector instruction
Answer: (a)
3. Basic difference between Vector and Array processors is
[WBUT 2010, 2014)
a) pipelining b) interconnection network c) register d) none of these
Answer: (a)
4. Stride in Vector processor is used to
a) differentiate different data types
WBUT 2010, 2014]
b) registers
c) differentiate different data d) none of these
Answer: (c)
CA-85
POPULAR PUBLICATIONS

5. Array process is present in


a) MIMD b) MISD c) SISD
WBUT 2013, 2017
d) SIMD
Answer: (d)
6. The vector stride value is
required
a) to deal with the length of vectors WBUT 2015]
b) to find the parallelism in vectors
c) to access the elements in
d) none of these multi-dimensional vectors
Answer: (c)
7. The task of a vectorizing compiler is
a) to find the length of vectors [WBUT 2015]
b) to convert sequentialscalar instructions into vector instructions
c) to process multi-dimensional vectors
d) to execute vector instructions
Answer: (d)
8. Array processors perform computations to exploit [WBUT 2015]
a) temporal parallelism b) spatial parallelism
c) sequential behavior of programs d) modularity of programs
Answer: (b)

9. Ina)which
SIMD
type of processor array processing is possible? WBUT 2019]
b) MIMD c) MISD d) SISD
Answer: (a)

Short Answer Type Questions


1. How do you speed up memory access in case of vector processing?
[WBUT 2005, 2007]
Answer:
Let r be the vector speed ratio and f be the vectorization ratio. For example, if the time it
takes to add a vector of 64 integers using the scalar unit is 10 times the time it takes to do
it using the vector unit, then r = 10. Moreover, if the total number of operations in a
program is 100and only 10 of these are scalar (after vectorization), then f=90(i.e. 90%
of the work is done by the vector unit). It follows that the achievable speedup is:
(Time without the vector unit) / Time with the vector unit
In general, the speedup is:
r/[(1-fr +l
Soeven if theperformance of the vector unit is extremely high (r= oo) we get a speedup
less than 1/(1-), which suggests that the ratio f is crucial to performance since it poses i
limit on the attainable speedup. This ratio depends on the eficiency of the compilation.
Vector instructions that access memory have a known access pattern. If the vector's

CA-86
COMPUTER ARCHITECTURE

elements are all adjacent, then fetching the vector from a set of heavily interleaved
memory banks works very wel. The high latency of initiating a main memory access
versus accessing a cache is amortized, because a single access is initiated for the entire
vector rather than to a single word. Thus, the cost of the latencyto main memory is seen
only once for the entire vector, rather than once for each word of the vector. In this way
we can speed up memory access in case of vector processing.
2. Discuss vector instruction format. WBUT 2006]
OR,
Define the various types of vector instructions. WBUT 2010, 2014
OR,
Describe different types of vector instructions. [WBUT 2018]
Answer:
There are three different types of vector instructions basically basis of their mathematical
mapping as given below.

Vector-vector instruction: From different vector registers one or more than one vector
operands enter in a functional pipeline unit and result is send to another vector register.
This type of vector operation is called vector-vector instruction as shown in the given
figure below, where Va, Vb, Vc are different vector registers and it can define by the
following twomapping functions fl and f2.
fl: Va ’ Vb and f2: Vb x Vc ’ Va
Vb Register Vc Register Va Register

Fig: Vector -vector instruction

Vector-scalar instruction: In vector scalar instructions the input operands of the


unctional unit enter from scalar register and vector register both and produce a vector
Qutput as shown in the figure below, where Va, Vb are different vector registers and Sa
IS a scalar register. It can also define by the following mapping function f1.
f1:S x Va ’ Vb

CA-87
POPULAR PUBLICATIONS

Fig: Vector-scalar instruction


Vector-Memory instruction: vector -memory instruction can be defined by vector load
or vector store operations between vector register and memory. it can also defined by the
following two mapping functions fl and f2.
fl :MV [vector load] and f2 : V ’ M [vector store]
Vector Load VRegister

Menory

Vectór Store
Fig: Vector memory instructions
superscalar, superpipeline and superscalar superpipelined
3. Compare [WBUT 2007]
architecture.
Answer:
parallelism to achieve peak
Superscalar and super-pipelined processors utilizeof conventional scalar processors.
performance that can be several times higher than that
Superscalar machines can issue several instructions aper cycle. A systemi was developed
for series of benchmarks. The average
and used to measure instruction-level parallelism simulations suggest that this metric
degree of super pipelining metricis introduced. Ouralready exploit allof the instruction
machines
is already high for many machines. These applications, even 'without parallel
level parallelism available in many non-numeric
instruction issue or higher degrees of pipelining.
Superscalar
Superscalar processing has multiple functional units are kept busy by multiple
instructions. As execution pipelines have approached the limits of speed, parallel
Super-pipelined machines can issue
execution has been required to improve performance.times shorter than the: latency of any
cycle
only one instruction per cycle, but they havemachines employ a singlke fetch-decode
functional unit. In some cases superscalar example, still
the Ultra SPARC splits execution
dispatch pipe that drives all of the units. For
CA-88
COMPUTER ARCHITECTURE

after the third| stage of a unified pipeline. However, it is becoming more common to have
ultiple fetch-decode-dispatch pipes feeding the functional units. Superscalar operation
is imited by the number of independent operations that can be extracted from an
instruction stream.

Super-pipeline
Given a pipeline stage time T, it may be possible to execute at a higher rate by starting
onerations at intervals of T/n. This can be accomplished in twO ways:
. Further divide each of the pipeline stages into nsubstages.
Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into
segments with uniform latency. The second approach could be viewed in a sense as
staggered superscalar operation, and has associated with it all of the same requirements
except that instructions and data can be fetched with a slight offset in time.
Super-pipelining is limited by the speed of logic, and the frequency of unpredictable
branches. Stage time cannot productively grow shorter than the inter stage latch time, and
Sothis is alimit for the number of stages. The MIPS R4000 is sometimes called asuper
pipelined machine. The benefit of such extensive pipelining is really only gained for very
regular applications such as graphics.
Superscalar-Super-pipeline
We may also combine superscalar operation with super-pipelining and the result is
potentially the product of the speedup factors. However, it is even more dificult to
interlock between parallel pipes that are .divided into many stages. Also, the memory
subsystem must be able to sustain alevel of instruction throughput corresponding to the
total throughput of the multiple pipelines -- stretching the processor/memory performance
gap even more. Of course, with so many pipes and so many stages, branch penalties
become huge, and branch prediction becomes a serious bottleneck.
But the real problem may be in finding the parallelisn required to keep all of the pipes
and stages busy between branches. Consider that a machine with 12 pipelines of 20
stages must always have access to a window of 240 instructions that are scheduled so as
to avoid all hazards, and that the average of 40 branches that would be present in a block
of that size are all correctly predicted sufficiently in advance to avoid stalling in the
prefetch unit.

4. Compare superscalar, super-pipeline and VLIW techniques.


WBUT 2008, 2011, 2014, 2016]
Answer:
Superscalar
A superscalar processor executes more than one instruction during a clock cycle by
Simultaneously dispatching multiple instructions to redundant functional units on the
processor. A superscalar CPUarchitecture implements a form of parallelism called
CA-89
POPULAR PUBLICATIONS

instruction-level parallelism within a single processor. Superscalar processing has


multiple functional units are kept busy by multiple instructions. In some cases superscalar
machines still employ a single fetch-decode-dispatch pipe that drives all of the units.
Superscalar operation is limited by the number of independent operations that can be
extracted from an instruction stream. It has been shown in early studies on simpler
processor models, that this is limited, mostly by branches, to a small number. The
superscalar technique is traditionally associated with several identifying characteristics.
These characteristics are,
Instructions are issued from a sequential instruction stream
CPU hardware dynamically checks for data dependencies between instructions at
run time (versus software checking at compile time)
Accepts multiple instructions per clock cycle
Super-pipeline
Given a pipeline stage time T, it may be possible to execute at a higher rate by starting
operations at intervals of T/n. This can be accomplished in two ways:
Further divide each of the pipeline stages into n substages.
Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into
viewed in a sense as
segments with uniform latency. The second approach could be
staggered superscalar operation, and has associated with it all of the same requirements
offset in time.
except that instructions and data can be fetched with a.slight frequency of unpredictable
Super-pipelining is limited by the speed of logic, and the
inter stage latch time, and
branches. Stage time cannot productively grow shorter than the
sometimes called a super
so this is a limit for the number of stages. The MIPS R4000 is gained for very
pipelined machine. The benefit of such extensive pipelining is really only there is little
regular applications such as graphics. On more iregular applications,
performance advantage.
VLIW
parallelism, but differ
Superscalar and VLIW architectures both exhibit instruction- level
throughput than would otherwise be
in their approach. It thereby allows faster CPUunit is not a separate CPUcore but an
possible at the same clock rate. Each functional
a bit shifter, or a
execution resource within a single CPUsuch as an arithmetic logic unit,
dispatcher
multiplier. Superscalar CPU design emphasizes improving intheuseinstruction
at alltimes.
accuracy, and allowing it to keep the multiple functional units
architecture designed to take
Very Long Instruction Word (VLIW) refers to a CPU instruction one
advantageof instruction level parallelism. A processor that executes every resources
after the other i.e. a non-pipelined scalar architecture may use processor
improved
inefficiently, potentially leading to poor performance. The performance can be or eve
by executing different sub-steps of sequential instructions simultaneously, architectures.
executing multiple instructions entirely simultaneously as in superscalar
CA-90
COMPUTER ARCHITECTURE

The VLIW approach executes operation in parallel based on a fixed schedule determined
when programs are compiled. Since determining the order of execution of operations is
handled by the compiler.
EDiscuss about strip mining and vector stride in vector processors.
[WBUT 2008, 2012]
Answer:
Vector lengths do not often correspond to the length of the vector registers. For shorter
vectors,we can use a vector length register applied to each vector operation. If a vector to
be processed has a length greater than that of the vector registers, then strip-mining is
used, whereby the original vector is divided into equal size segments i.e. equal to the size
of the vector registers and these segments are processed in sequence. The process of
strip-mining is usually performed by the compiler but in some architecture it could be
done by the hardware. The strip-mined loop consists of a sequence of convoys.
The vector elements are ordered to have a fixed addressing increment between successive
elements called as stride or skew distance. i.e. It is the distance separating elements in
memory that will be adjacent in a vector register. The value of the stride could be
different for different variable. When a vector is loaded into a vector register then the
stride is 1, meaning that all the elements of vector are adjacent. Non-unit strides can
cause major problems for the memory system, which is based on unit stride (i.e. all the
elements are one after another in different interleaved memory banks). Caches deal with
unit stride, and behave badly for non-unit stride. To account for non-unit, stride, most
systems have a stride register that the memory system can use for loading elements of a
vector register. However, the memory interleaving may not support rapid loading. The
vector strides technique is used when the elements of vectors are not adjacent.
6. What is vector processor? Give the block diagram to indicate the architecture of
a typical Vector Processor with multiple function pipes.
WBUT 2008, 2010-short note]
Answer:
Vector processors are specialized, heavily pipelined processors that perform efficient
operations on entire vectors and matrices at once. This class of processor is suited for
applications.that can benefit from a high degree of parallelism. Register-register vector
processors require all operations to use registers as source and destination operands.
Memory-memory vector processors allow operands from memory to be routed directly to
the arithmetic unit: Avector is a fixed-length, one-dimensional array of values, or an
ordered series of scalar quantities. Various arithmetic operations are defined over vectors,
including addition, subtraction, and multiplication.
Avector processor includes a set of vector registers for storing data to be used in the
execution of instructions and a vector functional unit coupled to tfe vector registers for
executing instructions. The functional unit executes instructions using operation codes
provided to it which operation codes include a field referencing a special register. The
Special register contains information about the length and starting point for each vector
CA-91
POPULAR PUBLICATIONS

instruction. A series of new instructions toenable rapid handling of image pixel data are
provided. Scalar Processor

Scalar Ppel
Register
nstruction
processing Ppe 2
unit (IPU)

Pipep
High
Speed
Man Vector
Instruction
Memory Controller

Pipe l

Vector
Pipe 2
Vector
Access Register
controller
Pipe m

Vector Processo:

Fig: The architecture of a vector processor with multiple function pipes


Why do vector
7. Explain the concept of strip mining used in vector processors.
processors use memory banks? WBUT 2009]
Answer:
of
When a vector has a length greater than that of the vector registers then segmentation
the long vector into fixed length segments is necessary. This technique is called strip
segment
mining. One vector segment is processed at a time. As an example, the vector segment
length is 64 elements in Cray computers. Until the entire vector elements in each Strip
are processed, the vector register cannot be assigned to another vector operation.
mining is restricted by the number of available vector, registers and vector chaining.vector
a
To allow faster access to vector elements stored in menmory, the memory of
associate
processor is often divided into memory banks. Interleaved memory banks
successive memory addresses with successive banks cyclically. One memory access (load
Each
or store) of a data value in a memory bank takes several clock cycles to complete.access.
memory bank allows only one data value to be read or stored in a single memory
But more than one memory bank may be accessed at the same time. When the
elements
of a vector stored in an interleaved memory are read into a vector register, the reads are
staggered across the memory banks so that one vector element is read from a bank per
clock cycle. If one memory access takes n clock cycles, then n elements of a vector may
be fetched at a cost of one memory access. This is n times faster than the same number O1
memory accesses to a single bank.
[WBUT 2009]
8. What is Vector array processor? Explain with example.

CA-92
COMPUTER ARCHITECTURE

Answer:
Avector
processor is a processor that can operate on entire vectors with one instruction,
consider the
of:some instructions specify complete vectors. For example,
i.e. the operands
following add instruction:
C=A+B

In hoth scalar and vector machines this means add the contents of A to the contents of B
C." In a scalar machine the operands are numbers, but in vector
and put the sum in
compute
processors the operands are vectors and the instruction directs the machine to called the
the pair-wise sum of each pair of vector elements, A processor register, usually
vector length register, tells the processor how many individual additions to perform when
it adds the vectors. A key division of vector processors arises from the way the
instructions access their operands. In the memory to memory organization the operands
are fetched from memory and routed directly to the functional unit. Results are streamed
back out to memory as the operation proceeds. In the register to register organization
operands are first loaded into a set of vector registers, each of which can hold a segment
of a register, for example 64 elements. The vector operation then proceeds by fetching the
operands from the vector registers and returning the results to a vector register.

9. Discuss different types of vector instruction. WBUT 2011]


Answer:
There are three different types of vector instructions basically basis of their mathematical
mapping as given below.
1. Vector-vector instruction: From different vector registers one or more than one
vector operands enter in a functional pipeline unit and result is send to another vector
register. This type of vector operation is called vector-vector instruction as shown in
the given figure below, where V,, Vb, V, are different vector registers and it can
define by the following two mapping functions f; andf,.
f:V,’ V, and f:V,x Ve V,
2. Vector-scalar instruction: In vector scalar instructions the input operands of the
functional unit enter from scalar register and vector register both and produce a
vector output as shown in the figure below, where Va, V, are different vector
registers and S, is a scalar register. It can also define by the following mapping
function f,.
f:SxV,’V
J. Vector-Memory instruction: vector -memory instruction can be defined by vector
load or vector store operations between vector register and memory. it can also
defined bythe following two mapping functions f; and f,.
fË :M’ V [vector load ] and f,: V ’ M [vector store]

10. What is vector chaining? How can, it speedup the processing? Explain with
suitable example. WBUT 2018]

CA-93
POPULAR PUBLICATIONS

Answer:
In computing, chaining is a technique used in computer architecture in which scalar and
vector registers generate interim results which can be used immediately, withor
additional memory references which reduce computational speed. i.e. in chaining data
forwarding from one vector functional unit to another unit without waiting for completing
the previous instruction.
Without chaining, the system must wait for last element of result to be written before
starting dependent instruction as shown in fig. 1(a). It takes three unit of time for 'Load
'Mul' and 'Add' instructions. But with chaining technique, the system can start
dependent instruction as soon as first result appears as shown in fig. 1(b). The pipelined
vector processors have an optimization called chaining: When avector load operation
executes, whose result is used by a vector multiply, and that result by a vector add, the
machine would not wait with the vector multiply until the vector load is finished, but pass
the first value of the vector right from the load unit to the multiply unit, and the first
value of that result directly to the add unit.
Load

Mul
Time
Add

Fig. 1(a)

Load

Mul

Add

Fig. 1(b)

Long Answer Type Questions


Give different fields in a
1. What are the different types of vector operations?. WBUT 2005, 2013]
vector instruction.
Answer:
There are two primary types of vector operations:
Vector-register operations
Memory-memory vector operations

In vector-register operations, all vector operations-except load and store are among
vector-register architecture,
the vector registers. All major vector computers use
including the Cray Research processors (Cray-1, Cray-2). In memory-memory vecto
operations, all vector operations are memory to memory. The first vectorcomputers were
of this type, as were CDC's vector computers.
CA-94
COMPUTER ARCHITECTURE

The vector instructions of the following types:


Vector-vector instructions:
f1: Vi --> VË (e.g. MOVE Va, Vb)
f2: Vjx Vk --> Vi (e.g. ADD Va, Vb, Vc)
Vector-scalar instructions:
f3: s X Vi--> VË (e.g. ADD R1, Va, Vb)
Vector-memory instructions:
f4: M --> V (e.g. Vector Load)
f5: V --> M (e.g. Vector Store)
Vector reduction instructions:
f6: V-->s (e.g. ADD V, s)
f7: Vix VË -->s (e.g. DOT Va, Vb, s)
Gather and Scatter instructions:
f8: Mx Va --> Vb (e.g. gather)
f9: Va x Vb --> M (c.g. scatter)
Masking instructions:
fa: Va x Vm -->Vb (e.g. MMOVE V1, V2, V3)

Gather and scatter are used to process sparse matrices/vectors. The gather operation uses
a base address and a set of indices to access from memory "few" of the elements of a
large vector intoone of the vector registers. The scatter operation does the opposite. The
masking operations allow conditional execution of an instruction based on a"masking"
register.
2. a) What are strip mining and vector stride, in respect to vector processors?
b) Both vector processors and array processors are specialized to operate on
vectors. What are the main differences between them? WBUT 2005, 2010]
OR,
Differentiate between vector processor and array processor with example.
[WBUT 2018]
Answer:
a) Now, we discuss about the different address position in memory of adjacent elements
In a vector and these addresses may not be sequential. For vector processors without
Caches, we need a technigque to fetch elements of a vector that are not adjacent in
memory. A vector instruction is said to be stride i, if the distance between its two
Successive data references is i words or i double words apart. This distance separating
elements that are to be gathered into a single register is called the stride. Once a vector is
loaded into a vector register it acts as if it had logically adjacent elements.
Ihe vector stride, like the vector starting address, can be put in a general-purpose
Tegister. Then instruction can be used to fetch the vector into a vector register. In some
vector processors the loads and stores always have a stride value stored in a register, so
hat only a single load arnd a single store instruction are required. Complications in the
emory system can occur from supporting strides greater than one. When multiple
CA-95
POPULAR PUBLICATIONS

accesses contend for a bank, á memory bank conflict occurs and one access must be
stalled. A bank conflict, and hence a stall, will occur if
Number of banks < Bank busy time
Least common multiple (Stride, Number of banks)

When a vector has a lengthgreater than the vector registers then segmentation' of the long
vector into fixed length segments is necessary. This technique is called strip-mining. One
vector segment is processed at a time. As an example, the vector segment length is 64
elements in Cray computers. Until the entire vector elements in each segment are
processed, the vector register cannot be assigned to another vector operation. Strip
mining is restricted by the number of available vector, registers and vector chaining.

b) The SIMD-1 Array Processor consists of aMemory, an Array Cointrol Unit (ACU) and
the one-dimensional SIMD array of simple processing elements (PEs). The figures show
a 4-processor array. The figures shows the initial image seen when the model is loaded.
MEMORY
Cycle
Phase

PC main, O
|Array Conta Unt cc
AC IR NOP
PE-IR NOP
PEC rr

SIMD ATy

arithmetic processor. It has 16 general


The ACUis a simple load/store, register-registerCondition code Register (CC) and an
purpose registers, a Program Counter (PC), a has two fields: label and offset. The
Instruction Register (AC-IR). The Program Counter
zero. The ACU also uses two other
label field is initially set to "main" and the offset to
registers, the Processing Element Instruction Rgister (PE-IR) and the Processing
Element Control register (PEC) which are global registers used to communicate with the
SIMD Array. The Processing Elements operate in lock step, i.e. each active PE
(determined by the state of its PEC bit) obeys the same instruction at the same time.
sends the new ACC value to
Whenever a PE ACCis updated by a PE instruction, the PE
each of its neighbors.
reverses the order of the values
When first loaded, the model contains a progam which Elements (initially in locations 0 and
held in memory locations 0 and.2 of the Processing in locationland 3 of each of their
2 of each of their memories) and, leaves the results
memories.
able to run mathematical operations on
A vector processor is also a CPUdesign that is contrast to a scalar processor whicn
multiple data elements simultaneously. This is in instructions that perforii
handles one element at a time. A computer with built-in
CA-96
COMPUTER ARCHITECTURE

multiple calculations on vectors (one-dimensional arrays) simultaneously. It is


used to
processor
solve the same or similar problems as an array processor; however, a vector
passes a vector to a functional unit, whereas an array processor passes
each element of a
unit
vector to a different arithmetic
is
Vector processors are based on a single-instruction, multiple data architecture that
distinctly different than SIMD extension to scalar/superscalar processors. Each vector
data path has some data independence from the others allowing data path dependent
onerations. This allows easier control for wider machines. Single chip vector processors
can still be low power and easy to program even with eight parallel vector units. For
many communications algorithms, characterized by high data parallelism, vector single
instruction machines end up being the ideal balance of instruction/programming
simplicity and compactness, while still supporting complex processing requirements and
high performance.
over
3. a) How do vector processors improve the speed of instruction execution
scalar processors? Illustrate with an example.
b) What is vectorizing compiler? Why do we need it in a vector processor?
[WBUT 2015]
Answer:
a) Many performance optimization schemes are used in vector processors. Memory banks
are used to reduce load/store latency. Strip mining is used to generate code so that vector
operation is possible for vector operands whose size is less than or greater than the size of
vector registers. Vector chaining -the equivalent of forwarding in vector processors - is
used in case of data dependency among vector instructions. Special scatter and gather
instructions are provided to efficiently operate on sparse matrices.
b) An intelligent compiler must be develop to detect the concurrency among vector
instructions which can be realized with pipelining or with the chaining of pipelines. A
vectorizing compiler would regenerate parallelism lost in the use of sequential languages.
It is desirable to use high level programming languages with rich parallel constructs on
vector processors. The following four stages have been recognized in the development of
parallelism in advanced programming. The parameter in parentheses indicates the degree
of parallelism explorable at each stage:
Parallel algorithm (A)
High-level language (L)
Efficient object code(0)
Target machine code(M)
The degree of parallelism refers to the number of independent operations that can be
performed simultaneously. In the ideal situation with well developed parallel user
languages, we should expect A>L>0> M, as in the figure below.

CA-97
POPULAR PUBLICATIONS

Degree of
parallelism

Parallel Parallel Object code Machine code


algorithm language
Fig: The ideal case of using parallel algorithm

At present any parallelism in an algorithm is lost when it is expressed in a sequential


high-level language. In order to promote parallel processing in machine hardware, an
intelligent compiler is needed to regenerate the parallelism through vectorization as
shown in figure below.
Parallelism

(Vectorization)

Parallel Sequential Object code Machine code


algorithm language
Fig:the case of using vectorizing compiler and sequential language
vector instructions is called
The process to replace a block of sequential . codethisby regeneration of parallelism is
vectorization and the system software which does
called a vectorizing compiler.
[WBUT 2017]
4. Differentiate between Vectored and Non-vectored interrupts.
Answer:
Interrupt Service
A vectored interrupt is where the CPUactually knows the address of the unique vector via
Routine in advance. All it needs is that the interrupting device sends its
a data bus and through its VO interface to the CPU. The CPUtakes this vector, checks an
device. So the
interrupt table in memory, and then carries out the correct ISR for that software
vectored interrupt allows the CPU to be able to know what ISR to carry out in
CA-98
COMPUTER ARCHITECTURE

(mermory). A non-vectored interrupt is where the interrupting device never sends an


interrupt vector. An interrupt is received by the CPU, and it jumps the program counter to
in hardware. So, the difference between vectored and non-vectored
a fixed address
interrupt is that in vectored interrupt the new address is generated by the processor
automatically. For instance, if 8085 microprocessor is interrupted through RST 5.5 pin
the processor will multiply 5.5by 8 and automatically convert it to Hex address. While in
a non-vectored interrupt it 1S necessary for the user to provide the address of subroutine
using the CALL instruction.
6 Explain low-order interleaved memory and its advantages. [WBUT 2018]
Answer:
The idea of interleaving memory is that memory is divided into banks.
Fach bank is to beconsidered as having the sane addressable unit as the main memory.
in low-order interleaving, consecutive addresses in the memory will be found in different
memory banks.
Consider a 64 -word memory that is 4-way interleaved. This means that there are four
memory banks, each holding 16 words.
If this memory is also low-order interleaved, we have the following allocation of words
tobanks.
Bank 0: Words 00, 04, 08, 12, 16, 20, 24, ..., 60
Bank 1: Words 01, 05, 09, 13, 17, 21, 25, ..., 61
Bank 2: Words 02, 06, 10, 14, 18, 22, 26,..., 62
Bank 3 Words 03, 07, 11, 15, 19, 23, 27, ..,63
Again, we have not yet specified the size of the memory word.
In traditional (lat) layouts, memory banks can be allocated a continuous block of
memory addresses, which is very simple for the memory controller and gives equal
performance in completely random access scenarios, when compared to performance
levels achieved through interleaving.
6. Write short notes on the following:
a) Scalar and vector processors WBUT 2006, 2007, 2018]
b) Memory to memory vector architecture [WBUT 2010]
c) Vectorizing compilers WBUT 2010]
d) Vector registers architectures [WBUT 2011]
e) Vector Stride WBUT 2012]
t) Array processor &Vector processor [WBUT 2014]
Answer:
a) Scalar and vector processors:
Avector processor is aCPU design that is able to run mathematical operations on
multiple data elements simultaneously. This is in contrast to a scalar processor which
handles one element at a time. A computer with built-in instructions that perform
multiple calculations on vectors (one-dimensional arrays) simultaneously. It is used to
Solve the same or similar problems as an array processor; however, a vector processor
CA-99
POPULAR PUBLICATIONS

passes a vector to a functional unit, whereas an array processor passes each element of a
vector to a different arithmetic unit.

Vector processors are based on a single-instruction, multiple data architecture that is


distinctly different than SIMD extension to scalar/superscalar processors. Each vector
data path has some data independence from the others allowing data path dependent
operations. This allows easier control for wider machines. Single chip vector processors
can still be low power and easy to program even with eight parallel vector units. For
many communications algorithms, characterized by high data parallelism, vector single
instruction machines end up being the ideal balance of instruction/programming
simplicity and compactness,while still supporting complex procesing requirements and
high performance.
A vector processor for executing vector instructions comprises a plurality of vector
registers and a plurality of pipeline arithmetic logic units. The vector registers are
constructed with acircuit which operates in a speed equal to 2n times asorfast as.the
the write
processing speed of the pipeline arithmetic logic units. Either the read
operation from or to the vector registers are carried out in the time obtained by a
processing cycle of each of the pipeline arithmetic logic units multiplied by n/2.
The simplest processors are scalar processors. Each instruction executed by a scalar
processor typically manipulates one or two data items at a time. RISC processors are in
this category. A scalar processor that includes a plurality of scalar arithmetic logic units
and a special function unit. Each scalar unit performs, in a different time interval, ofthea
same operation on a different data item, where each different time interval is one
plurality of successive, adjacent time intervals. Each unit provides an output data item in
the time interval in which the unit performs the operation and provides a processed data
item in the last of the successive, adjacent time intervals. The special function unit
provides a special function computation for the output data item of a selected one of the
scalar units, in the time interval in which the selected scalar unit performs the operation,
so as to avoid a conflict in use among the scalar units. A vector processing unit includes
an input data buffer, the scalar processor, and an butput orthogonal converter.
b) Memory to memory vector architecture:
To maintain an initiation rate of one word fetched or stored per clock, the memory system
by creating
must be capable of producing or accepting this data. This is usually done dealing
multiple memory banks. There are significant numbers of banks is useful for with
vector loads or stores that access rows or columns of data.
In the register to register machines the vectors have a relatively short length, 64 in the
case of the Cray family, but the startup time is far less than on the memory to memnory
machines. Thus these machines are much more efficient for operations involving short
vectors, but for long vector operations the vector registers must loaded with each segment
before the operation can continue.

CA-100
COMPUTER ARCHITECTURE
Vector-Memory instruction: vector
-memory
or vector store operations between vector registerinstruction can itbecan
and memory. defined
also by vectorbyload
defined the
following two mapping functions fË and f,
f :M’ V [vector load] and f,: V ’ M [vector
store]

Fig: Vector memory instructions


Register to register machines now dominate the vector computer market, with a number
of offerings from Cray Research Inc., including the Y-MP and the C-90.
The basic processor architecture of the Cray supercomputers has changed little since the
Cray-l was introduced in 1976. There are 8 vector registers, named VO through V7,
which each hold 64 64-bit words. There are also 8 scalar registers, which hold single 64
bit words, and 8 address registers (for pointers) that have 20-bit words. Instead of a
cache, these machines have a set of backup registers for the scalar and address registers;
transfer to and from the backup registers is done under program control, rather than by
lower level hardware using dynamic memory referencing patterns.
The original Cray-1 had 12 pipelined data processing units; newer Cray systems have 14.
There are separate pipelines for addition, multiplication, computing reciprocals (to divide
xby y, a Cray computes x.(1/y), and logical operations. The cycle time of the data
processing pipelines is carefully matched to the memory cycle times. The memory
system delivers one value per clock cycle through the use of 4-way interleaved memory.

)Vectorizing compilers: Refer to Question No. 4(6) of Long Answer Type


Questions.
d) Vector registers architectures:
Each vector register is a fixed-length bank holdinga single vector. Each vector register
must have at least two read ports and one write port. This willallow a high degree of
oVerlap among vector operations to different vector registers. The read and write ports,
which total at least 16 read ports and 8 write ports, are connected to the functional unit
Ihputs or outputs by a pair of crossbars.

CA-101
POPULAR PUBLICATIONS
Main memory

FP add/subtract
Vector
load-store
FP multiply

FP divide

Vector
Integer
registes

Logieal

Scalar
registers
Fig: Vector register Architecture

functional
In the above figure, there are eight 64-element' vector registers, and all the
units are vector functional units.
and can
Vector functional units: In a vector functional unit each unit is fully pipelined
hazards. In
needed to detect
start a new operation on every clock cycle. A control unit is
the above figure, there are five functional units.
a vector to or
Vector load-store unit: This is a vector memory unit that loads or stores
pipelined, so that words can be
from memory. Here, vector loads and stores are fullya bandwidth of one word per clock
moved between the vector registers and memory with
cycle, after an initial latency.
data as input to the vector
Set of scalar registers: Scalar registers can also provide vector load-store unit.
functional units, as well as compute 'addresses to pass to the
Vector Execution Time: The execution time of a sequence of vector operations primarily
depends on three factors:
The length of the operand vectors,
Structural hazards among the operations and
Data dependences.
We can compute the time for a single vector instruction depending on the vector length
and the initiation rate that is the rate at which a vector unit consumes newoperands and
produces new results. All modern supercomputers have vector functional units with
multiple parallel pipelines that can produce two or more results per clock cycle.
e) Vector Stride: Refer to Question No. 5 of Short Answer Type Questions.

CA-102
COMPUTER ARCHITECTURE

processor & Vector processor:


) Arrayand array processIng are essentially the same because, with slight and rare
Vector processor. A
differences, a vector processor and an array processor are the same type of most of the
handles
processor, or central processing unit (CPU), is a computer chip that employs
information and functions processed through a computer. A vector processor elements
multiple vector pipelines. An array processor uses a number of processing
processor and requires a host
operating in parallel. An array processor is a SIMD typesynchronous parallel processor
processor (control processor). An array processor is a
The ALU together with
containing multiple ALUs. Each ALU contains local memory.PEs are synchronized to
The
the local memory is called a processing element (PE), is a scalar processor. The
nerform same operation simultaneously. The host processor The vector instructions are
instructions are fetched and decoded by the control processor.
the vector operand. These
sentto PEs for distributed execution over different elements ofare passive devices without
The PEs
vector elements are contained in the local memories. technology is not usually
instruction decoding capabilities. Vector and array processing
technology is most often seen in high-traffic
used in home or office computers. This designed to house and allow access to
servers. Servers are racks of storage drives
computers located on a computer
information from several different users at different vector and
different principles than
network. Scalar processing technology operates on type of processing hardware used in
array processing technology and is the most common that operates like a scalar
is a processor
the average computer. A superscalar processor the CPU which each handle and process
processor, but it has many different units within is also equipped
processor type
data simultaneously. The higher-performance superscalarprocessing to the available scalar
with programming that makes it efficiently assign data
processors are superscalar.
units within the CPU. Most modern home computer

CA-103

You might also like