Study Material On Computer Organization and Architecture
Study Material On Computer Organization and Architecture
IS2
IS1
MIMD (Tightly Coupled)
Borroughs D-825
ISn
CU
CU
IS
CU2
CU1
CUn
IS
IS2
IS1
ISn
IS
IS2
IS1
ISn
CU2
CU1
CUn
IS2
IS1
ISn
PU
PU2
PU1
PU2
PU1
PUn
PUn
IS
PU2
PU1
PUn
(a) SISD Computer
DSn
DS
DSn
SM
MM1
MM
MM2
MM1
MMm
ISn
MM3
MM2
MM1
IS2
ISn
MM2
IS1
IS2
MMm
IS1
SM
COA : 2
The categorization depends on the multiplicity of simultaneous events in the system components.
Conceptually, only three types of components are needed. Both instructions and data are fetched from memory
modules (MM). Instructions decode by the control unit (CU), which sends the decoded instruction stream to the
processor unit (PU) for execution data streams flow between the processors and the memory bi-directionally.
Multiple memory modules may he used in the shared memory subsystem. Each instruction stream is generated
by an independent control unit. Multiple data streams originate from the subsystem of shared memory
modules.
Feng has suggested the use of the degree of parallelism to classify various computer architectures. The
maximum number of binary digits (bits) that can he processed within a unit time by a computer system is
called the maximum parallelism degree P. Let Pi be the number of bits that can be processed within the ith
processor cycle (or the ith clock period). Consider T processor cycles indexed by 1 = 1, 2, ...,T. The average
parallelism degree, Pa is defined by
In general, Pi <= P. thus we, define the utilization rate of a computer system within T cycles by
If the computing power of the processor is fully utilized (or the parallelism is fully exploited), then we
have Pi= P for all i and =1 for 100 % utilization. The utilization rate depends on the application program
being executed.
A bit slice is a string of bits one from each of the words at the same vertical bit position, e. g. the TI-ASC
has word length of 64 and 4 arithmetic pipelines. Each pipe has 8 pipeline stages. Thus there are 8* 4= 32 bits
per each bit slice in 4 pipes.
Stage-1Stage-1Stage-2
Stage-2 Stage-3
Stage-3 Stage-4
Stage-4Stage-5Stage-5
Stage-6 Stage-7
Stage-6 Stage-8
Stage-7 Stage-8
8 - bits
8-bits Pipeline no - 1
8 - bits
8-bits Pipeline no 2
8-bits
8 - bits Pipeline no 3
8-bits
8 - bits Pipeline no 4
The maximum parallelism degree P(C) of a given computer system C is represented by the product of
the word length n and the bit slice length m, that is,
P(C) = n. m
It is equal to the area of the rectangle defined by the integers n and m.
COA : 4
NUMBER REPRESENTATION TECHNIQUES:
Unsigned representation
Signed representation
Unsigned Representation :
In this representation all N bits will
N bits contribute to represent the + ve magnitude
part of the number.
Signed Representation:
In this representation technique first bit will
N bits represent the sign and rest (N 1) bits will
represent the magnitude part of the no.
1 N1
The main disadvantage of this representation
technique is that though mathematically +0
0 => + ve Magnitude and 0 are same but they have different
1 => ve representations here. The range specification
if N = 4 then diff. Combinations are graph is:
0111 => + 7 => + (2 N 1 1)
0110=> + 6
.
.
0001 => + 1 - ve Overflow + ve Overflow
0000 => + 0 - ve 0 + ve
1000 => 0
1001 => 1 ( 2 N1 1 ) + ( 2 N1 1 )
.
.
.
1110 => 6
1111 => 7 => - (2 N 1 1)
The above representation technique can also be called as signed magnitude representation.
But to solve the above mentioned problem we introduce a different representation technique called 2s
complement signedmagnitude representation.
COA : 5
Tricks:2
Tricks:1 0.1112 = 1*2-1 + 1*2-2 + 1*2-3
= 0.5+0.25+0.125
654321 0 = 0.875
111011 0 =64+32+16+4+2 = 118
= (27 - 24)+(23- 21) = (128-16)+(8- 1-2-3 = 1-(1/8) = 7/8 = 0.875
2)=118
So we have: 1-2-n = 0.111.1
654321 0
110111 1 =64+32+8+4+2+1 = 111
= (27 25)+(24- 20)
= (128-32)+(16-1)=111
So to have a fixed form before representing a number in the computers memory we should make
the number normalized. The normalization rules are:
32 bits
Truncated
1 8 23
Significand
Sign for Significand
0 => +ve 1 => - ve biased exponent i.e. 128 will be
added with the exponent. So the
range 128 to + 127 will be shifted
to 0 to 255 (+ve half only).
- ve + ve
0
+
+
- ( 1 2 24 ) 2 127 + ( 1 2 24 ) 2 127
COA : 7
+ 0.110111 2 +100101 = 0
+ 0.110111 2 100101 = 0
0.110111 2 + 100101 = 1
0.110111 2 100101 = 1
8 bits 23 bits
SignSignexponent
Exponent mantissa
( s ) (s) ( e ) (e) (m)
Here base ( r ) = 2, exponent e in excess 127 and mantisa has hidden 1, thus denoting 1.m.
e M Inference
255 #0 NaN ( not a no. ) ( divide by 0, sq. root of ve no. )
255 0 infinite no. x = ( - 1 ) 5 , + & - Differently
0 #0 x = ( - 1 ) S 2 126 ( o. m )
0 0 0, x = ( - 1 ) S 0 again + 0 and 0 are possible
Problem :
0 128 1 2 1
Class Work :
A) Evaluate 40A00000 H
B) Express 10 10 in IEEE 754 floating point format.
01 10
q1q 0
Is n NO
= 0?
Class Work
YES
(7)*(3)=(21) M
Result in AQ
n A Q Q0 Action -M
STOP
MOV X, R1 M [X] R1
One-Address Instructions:
LOAD A AC M [A]
ADD B AC AC+M[B]
STORE T M[T] AC
LOAD C AC M [C]
ADD D AC AC+M[D]
MUL T AC AC*M[T]
STORE X M [X] AC
Zero-Address Instructions:
PUSH A TOS A
PUSH B
TOS B
ADD TOS (A+B)
PUSH C
TOS C
PUSH D
TOS D
ADD
TOS (C+D)
MUL TOS (A+B) * (C+D)
POP X
M [X] TOS
Computer System Architecture
M. Morris Mano. (Excercise)
Ex : 8.12 Page No.293
X = (A+B*C)/ (D-E*F+G*H)
)/ (G+H*K)
X = ({A-B+C * (D*E-F)}/
COA : 11
( A B C)
X
D EF GH
Three address instructions:
A B C * ( D * E F)
X
G H *K
Three address instruction:
A B C * ( D * E F)
X
G H *K
One address instructions:
(A)HORIZONTZL
(A) HORIZONTAL MICRO-INSTRUCTION
MICRO-INSTRUCTION
Function Codes
Individual
Control Signals
Control Signal Decoder
Individual
Decoder
Decoder
Individual
Control Signals
Individual
Control Signal
JumpJumper
Condition
Condition
Control Signals
Demultiplexer
Decoder
Individual
Control Signal
Individual
Microinstruction Formats:
The two widely used formats used for microinstructions are horizontal and vertical. In the horizontal
microinstruction each bit of the microinstruction represents a micro-order or a control signal which directly
controls a single bus line or sometimes a gate in the machine. However, the length of such a microinstruction
may be hundred of bits.
In vertical microinstructions, many similar control signals can he encoded into few microinstruction
bits. For 16 ALU operations which may require 16 individual microcoder in horizontal microinstruction only 4
encoded bits are needed in vertical microinstruction. Similarly, in a vertical microinstruction only 3 bits are
needed to select one of the 8 registers. However, these encoded bits need to be passed from respective decoder
to get the individual control signals.
Some of the microinstructions may be passed through a de-multiplexer causing selected bits to he used
for few different location in the CPU. For example, a 16 bit field in a microinstruction can be used as branch
address in branching microinstruction, however, these bits may he utilized for some other control signals in a
non-branching microinstruction. In such a case de-multiplexer can be used.
The vertical microinstructions are normally of the order of 32 bits. In certain control units, several
levels of control are used. For example, a field in microinstruction or in the machine instruction may hold the
address of a read only memory which holds the control signals. This secondary ROM can hold large address
constants such as interrupt service routine address.
In general, a horizontal control unit are faster yet require wide instruction words, whereas, vertical
control units although require decoder, however, are shorter in length. Most of the systems use neither purely
horizontal nor purely vertical microinstructions.
Example:
Let us consider a hypothetical architecture where there are 16 General Purpose Registers(GPR)
(e.g. R0, R1, R2,,R15) and 4 Operation codes/Instructions (e.g. Addition, Subtraction, Multiplication
and division). Then the instruction
MUL R3,R4
10 0011, 0100
COA : 17
. . . . .
. . . .
. . . .
. . . .
.
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . .
Control Store
Decode Line
Activated Control Signal Generated Address of next
000 C1C3C5C7 microinstruction
001 C2 C4 C5 000
010 C1 C3 010
011 C2 C5 011
Either ?
011 C2 C5 If external condition is true
then target address is 110
110 C2 C4 C7 111
111 C0C1C2C3C4C5C7 Load next Instruction in IR
Or
011 C2 C5 If condition false then 100
100 C1C2C3C5 101
101 C1 C6 C7 111
111 C0C1C2C3C4C5C7 Load next Instruction in IR
COA : 18
HARDWIRED CONTROL:
Control units use fixed logic circuits to interpret instructions and generate control signals from them.
DESIGN METHODS:
The design of a hardwired control involves various complex tradeoffs between the amount of hardware
used, its speed of operation, and the cost of the design process itself. Because of the large number of control
signals used in a typical CPU and their dependence on the particular instruction set being implemented, the
design methods employed in practice are often ad hoc and heuristic in nature, and therefore cannot easily be
formalized. Three simplified and systematic approaches are:
Method 1 : The standard algorithmic approach to sequential circuit design called the state-table
method, since it begins with the construction of a state table for the control unit.
Method 2 A heuristic method based on the use of clocked delay elements for control-signal timing.
Method 3 A related method that uses counters, which we call sequence counters, for timing purpose.
Consider the problem of generating the following sequence of control signals at times tl , t2 , ., tn using
a hardwired control unit.
T 1 : Activate { Cl,j}
T 2 : Activate { C2,j }
......................................
T 3 : Activate { Cn,j }
Suppose that an initiation signal called START(t 1) is available at t1. START(t1) may he fanned out to { Cl,j } to
perform the first micro operation. If START(t1) is also entered into a time delay element of delay t2 ~ t1, the
output of that circuit, START(t2) can be used to activate {C2,j}. Similarly, another delay element of delay t3- t2
with input SI'ART(t2) can be used to activate {C3,j} and so on. Thus a sequence of delay elements can be used to
generate control signals in a very straight forward manner.
METHOD 3 : Sequence Counter Method
Considering the circuit diagram which consists of a modulo-k counter whose output is connected to a
1/k clocked decoder. If the count enable input is connected to a clock source, the counter cycles continually
through its k states. The decoder generates k pulse signals { i} on its output lines. The { i} effectively divide
the time required for one complete cycle by the counter into k equal parts; the { i} may be called phase
signals. The circuit shown may be called a sequence counter. The figure shows a one-loop flowchart containing
six steps that describes the behaviour of a typical CPU. Each pass through the loop constitutes an instruction
cycle. Assuming that each step can be performed in an appropriately chosen clock period, one may build a
control unit for this CPU around a single (modulo-6) sequence counter. Each signal 1 activates some set of
control lines in step 1 of every instruction cycle. It is usually necessary to be able to vary the operations
performed in step 1 depending on certain control signals or condition variables applied to the control unit.
These are represented by the signals Cin = {C'in, C"in}. A logic circuit N is therefore needed which, combines
Cin with the timing signals { i} generated by the sequence counter.
COA : 19
{C1, j} C1
Delay
Element
{C1, j} C2
{C2, j}
{C2, j}
C3
__
X X
No
IS X=1?
A 1
1
Begin Modulo K Delay Modulo-K
End
Clock Sequence element Sequence 2
Reset Counter Counter
2
k
1 2 k
(b)
k B
B
(a) A delay element cascade; (b) The
Transfer Program Counter to equivalent Sequence-Counter Circuit
memory address register.
Step 1
BUS SYSTEM:
INTRODUCTION:
A computer system contains a number of buses which provide path ways among several devices. A
shared bus that connects CPU, memory and I/O is called System Bus.
A system bus may consist of 50 to 100 separate lines. These lines can be categorized into 3 functional
groups:
Data bus provides a path for moving data between the system modules. A data bus width limits the
maximum number of bits which can be transferred simultaneously between 2 modules e.g. CPU and memory.
Address bus is used to designate the source of data for data bus. The width of the address bus specifies the
maximum possible memory supported by a system.
Control bus is used to control the access to data and address bus and for transmission of commands and
timing signals between the system modules.
Physically a bus is a number of parallel electrical conductors. These circuits are normally imprinted on
printed circuit boards.
SOME OF THE ASPECTS RELATED TO THE BUS:
DEDICATED OR MULTIPLEXED BUSES:
A dedicated bus line is permanently assigned to a function or to a physical subset of the components of
the computer. A functional dedicated bus is the dedicated address bus and data bus. Physical dedication
increases the throughput of the bus as only few modules are in contention but it increases overall size and cost
of a system.
In certain computer bus some or all the address lines are also used for data transfer operation, i.e., the
same lines are used for address as well as data lines at different times. This is known as time multiplexing and
the buses are called multiplexed buses.
This concerns the timing of data transfers through the bus. In synchronous buses the data is transferred
during specific time which is known to source and destination. This is achieved by using clock pulses.
Alternative approach is the asynchronous buses where each item which is to be transferred has a separate
control signal. This signal indicates the presence of the item on the bus to the destination.
BUS ARBITRATION:
An important aspect of the bus system is the control of a bus. There may be more than one module
connected to the bus wants to access the bus for data transfer. Thus there must be some methods for resolving
the simultaneous data transfer requests on the bus. The process of selecting one of the units from various bus
requesting units is called bus arbitration. There are 2 broad categories of bus arbitration-centralized and
distributed.
In the centralized scheme a hardware circuit device, bus controller or bus arbiter processes the request
to use the bus. The bus controller may be a separate module or a part of CPU. The distributed scheme has
shared access control logic among the various modules. All these module, work together to shared access.
COA : 23
COA : 24
The Bus Request line if activated, only indicates that one or more modules require the bus. The Bus
Request is responded by the bus controller only if the bus busy line is inactive, that is, the bus is free. The Bus
Request signal is responded by the bus controller by placing the signal on the Bus Grant line. The Bus Grant
signal passes through the modules one by one. On receiving the Bus Grant, the module which was requesting
bus access, blocks further propagation of Bus Grant signal and issues a Bus Busy signal and starts using the
bus. If the Bus Grant signal is passed through the module which had not issued the bus request, then the Bus
Grant signal is forwarded to the next module.
In this scheme the priority is wired in and can not he changed. Supposing that the assumed priority is
(highest to lowest) Module 1, Module 2 ....... Module N. If 2 Modules, say 1 and N, request the bus at the same
time then bus will be granted to Module 1 first as the signal has to pass through Module 1 to each Module N.
The basic drawback of this simple scheme is that if the bus request of Module 1 is occurring at a high rate then
rest of that Modules may not get the bus for quite some time. Another problem can occur when say Bus Grant
lint between say Module 4 & Module 5 fails, or Module 4 is unable to pass the Bus Grant signal, in any of the
above case no bus access will be possible beyond Module 4.
POLLING:
In polling instead of single Bus Grant line, in daisy chaining, we encounter poll count lines. These lines
are connected to all the modules connected on the bus.
The Bus Request and Bus Busy are the other 2 control lines for bus control. A request to use the bus is
made on the Bus Request line, while the Bus Request will not be responded to till the Bus Busy line is active.
The bus controller responds to a signal on Bus Request line by generating a sequence of numbers on poll count
lines. These numbers are normally considered to be a unique address assigned to the connected modules. When
the poll count matches the address of a particular module which is requesting for the bus, the module activates
the Bus Busy signal and starts using the bus. The polling basically is asking each module one by one whether it
has something to do with bus.
INDEPENDENT REQUESTING:
In this scheme, each module has its independent Bus Request and Bus Busy line. The identification of
requesting unit is almost immediate and requests can be responded quickly. Priority in such systems can be
built through the bus controller and can be changed through a program.
In certain systems a combinations of these arbitration schemes are used. In PDP-11 UNIBUS system uses daisy
chaining and independent addressing. It has five independent Bus Request lines and each one of these Bus
Request lines has a distinct Bus Grant line. Several modules of the same priority may be connected to a same
Bus Request line, the Bus Grant line to these same priority modules can be daisy chained.
COA : 25
COA : 26
Data Lines
System STB
Data Programmable (Strobe) Peripheral
Bus Interfacing
MPU
IBF (Input such as
Device Buffer Keyboard
Pulse)
RD
Pin For
Status
Check
INTR
(a)
Interfacing Device With Handshake Signals For Data Input
1. A peripheral strobes or places a data byte in the input port and informs the interfacing device by
sending handshake signal STB ( strobe )
2. The device informs the peripheral that its input port is full do not sent the next byte until this one has
been read. This massage is conveyed to the peripheral by sending handshake signal IBF ( input buffer
full ).
3. The MPU keeps checking the status until a byte is available. Or the interfacing device informs the MPU,
by sending an interrupt, that it has a byte to be read.
Timing Waveforms of the 8155 I/O Ports with Handshake : Input Mode
COA : 27
Data Lines
OBF ( out-
System put buffer
Data Programmable Full ) Peripheral
Bus Interfacing Such As
MPU
ACK
Device (Acknowl- Printer
edge )
WR
Pin for
status
check
INTR
1. The MPU writes a byte into the output port of the programmable device by sending control signal
WR.
2. The device informs the peripheral, by sending handshake signal OBF (Output Buffer Full), that a
byte is on the way.
3. The peripheral acknowledges the byte by sending back the ACK (Acknowledges) signal to the
device.
4. The device interrupts the MPU to ask for the next byte, or the MPU finds out that the byte has been
acknowledged through the status check.
Timing Waveforms of the 8155 I/O Ports with Handshake : Output Mode
COA : 28
Nt n
S=
(k+n1)tp
K segment pipeline with a clock cycle time t p to execute n tasks. In a non pipelined unit
that performs the some operation and takes a time equal to t n to complete each task. S is
called the space up ratio. When n >> K 1 then K + n 1 n.
So S = t n / t p
then t n = kt p
So s = kt p / t p = k.
COA : 29
ARITHMETIC PIPELINE
X = 0.9504 10 3 X = A 10, Y = B 10 6
Y = 0.8200 10 2 Exponents Mantissas
= 0.0820 10 3 3 2 0.9504 0.8200
a b A B
R R
Z=X+Y
=1.0324*10 3
4
=0.10324*10
= 0.1034*104 Compare
Difference
Segment 1: Exponents
By Subtraction 32=1
3 0.9504 0.0820
Add or Subtract
Mantissas
Segment 3:
1.0324
R R
4
0.10324
R
R
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction:
1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI - - FI DA FO EX
5 - - - FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Fetch instruction
Segment 1: from memory
Decode instruction
Segment 2: and calculate
effective address
Yes
Branch?
No
Fetch operand
Segment 3:
from memory
Yes
Interrupt Interrupt?
handling
No
Update PC
Empty Pipe
The pipeline throughput Hk is defined as the number of tasks (operations) performed per unit time.
COA : 32
Instruction pipeline:
The execution of a stream of
instructions can be pipelined by
overlapping the execution of the
current instruction with the fetch,
decode, and operand fetch of
subsequent instructions. This is also
known as instruction look-ahead.
Processor pipelining:
Here the same data stream is pipelined processed by a cascade of processors, each of which
processes a specific task. The data stream passes the first processor
with results stored in a memory block, which is also accessible, by
the second processor. The second processor then passes refined
results to the third, and so on. Three classification schemes are:
What are the binary values to be given at X, Y and load i/PS to perform the addition of 4 numbers b0, b1,
b2, b3 to be fed through i/p
i/PSs I1 and I2. (Use minimum no. Of CLK
ClK pulses). Draw the reservation table
accordingly.
I/PI1 I/PI2
0 1 0 1
CP Y X I1 I2 Load
Stage S1 1 1 1 b0 b1 0
2 1 1 b2 b3 0
3 1 1 0 0 0
4 1 1 0 0 0
Stage S2
5 1 1 0 0 1
6 0 0 0 0 0
7 1 1 0 0 0
8 1 1 0 0 0
Stage S3
9 1 1 0 0 0
8 4 16
i IF ID OF OE OS Load rB B
Load rC C
i-1 IF ID OF OE OS
Add rA rB rC
IF ID OF OE OS Store rA A
i-2 Add rB rA rC
Store rB B
IF: Instruction Fetch Load rD D
ID: Instruction Decode Sub rD rD rB
OF: Operand Fetch Store rD D
OE: Operand Execute
OS: operand Store
Reuse of operands I=228b,
A B+C D=192b,M=420b
IF ID OF OE OS (Register-to-register)
i B A+C
IF ID OF OE OS D D-B
i-1
i-2 IF ID OF OE OS
For this example
8 4 4 4 we have 8 bit
Time Add rA rB rC opcode, 32 bit
data, 16 GPRs.(4
8 16 16 16 Add rB rA rC
bit to no. them), 16
Add B C A Sub rD rD rB
bit address
Add A C B
Sub D B D
Compiler allocates
I=168b,D=288b,M=456b operand in registers.
(Memory to memory) I=60b, D=0b, M=60b
(Register to register)
II Address
Address I Register Data
Data in
in
General Registers
Increment
Increment
D B A
PC
PC ALU ALU
Shift Logic
ALU ALU
To D Cache F
COA : 35
R 57 Local to C
R 48
R 47 Common to B and C
R 42
R 41 Local to B
R 32
R 31 Common to A and B
R 26
R 25
R 16 Local to A
R9 Common to All
Procedures R 15
Common to A and D
R0 R 10
Global Proc A
Registers Overlapped register windows.
We have G=10, L=10, C=6 and W=4.The window size is 10+12+10 = 32 registers, and
the register file consists of (10+6)*4+10=74 registers.
RISC CHARACTERISTICS:
RISC Processors: In SPARC architecture out of 32 numbers of 32-bit IU (Integer Unit) registers
eight of these registers are global registers shared by all procedures, and the remaining 24 are window registers
associated with only each procedure. The concept af using overlapped register windows is the most important
feature introduced by the Berkeley RISC architecture. This concept is illustrated in following figure.
CACHE MEMORY:
Elements of Cache Design :
Cache Size Write Policy
Mapping Function Write through
Direct Write back
Associative Write once
Set associative
Block Size
Replacement Algorithm Number of Caches
Least-recently used (LRU) Single-or two-level
First-in-first-out (FIFO) Unified or split
Least-frequently used (LRU)
Random
Block Size:
As the Block size increases from very small to larger sizes, the bit ratio will at first increase because of
the principle of locality; the high probability that the data in the vanity of a referenced word is likely to be
referenced in the near future. Two specific effects come into play:
1. Larger Blocks reduce the number of blocks that fit into cache. Because each block fetch overwrites
older cache contents, a small number of blocks results in data being overwritten shortly after it is
fetched.
2. As a block becomes larger, each additional word is farther from the requested word, therefore less
likely to be needed in the near future.
Number Of Caches :
Recently, the use of multiple caches has become the norm. two aspects of this design issue concern the
number of levels of caches and the use of unified versus split caches.
The main disadvantage of unified cache design is that preference is given to execution unit first when
instruction pre-fetcher request for an instruction. This contention can degrade performance by interfering
with efficient use of the instruction pipeline. The split cache structure overcomes this difficulty.
COA : 38
Time Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Page Address 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3 1
OPT 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Anticipatory 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Swapping 2 2 2 2 2 2 2 2 2 5 5 5 5 5
Hit = 83 % 5 1 1 1 1 1 1 1 1 1 1 1 1
(10/12)*100 %
FIFO 4 4 4 4 1 1 1 1 1 1 1 5 5 5 5 5
HIT = 33% 3 3 3 3 3 3 4 4 4 4 4 1 1 1 1
(4/12)*100 % 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4
5 5 5 5 5 5 5 2 2 2 2 2 3 3
LRU 4 4 4 4 1 1 1 1 1 1 1 5 5 5 5 5
HIT=58% 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1
(7/12)*100% 2 2 2 2 2 2 2 2 2 2 2 2 3 3
5 5 5 5 4 4 4 4 4 4 4 4 4
Results Of Simulation Run Of Three Replacement Algorithms On A Common Page Trace Stream.
Problem :
A virtual memory system has 16k word Logical address space, 8k word physical address space with a
page size of 2k words. The page address trace of a program has been found to be:
7532104167420135
Note the four pages resident in the memory after each page reference change for each of the following
replacement policies:
(a) FIFO
(b) LRU; and
(c) Anticipatory Swapping
COA : 39
Where
L= Page address stream length to be processed by the replacement algorithm,
n=Page capacity of MS,
t=Time step when t pages of the address stream have been processed,
Bt(n) = set of pages in MS at time t,
Lt = number of distinct pages encountered at time t.
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Step
Page 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3 1
address
St(1) 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3 1
St(2) 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3
St(3) 4 3 2 5 1 2 3 4 3 2 4 5 1 4
St(4) 4 3 3 5 1 1 1 1 3 2 2 5 5
St(5) 4 4 4 5 5 5 5 1 3 3 2 2
n=1
Hit n=2 x
for n=3 x x x x x x
n=4 x x x x x x x
n=5 x x x x x x x x x x x
Stack processing of LRU Scheme of a given page address trace for different main storage capacity.
COA : 40
t=Average time elapsed to access a word by CPU,
t1=MS access time
t2=SS access time, and
R=Access time ratio=t2/t1
For a given design with a specified value of H, the access efficiency can be evaluated in
terms of H and R, where
H=hit ratio=N1
N1+N2,
= 1/( H+(1-H)R)
= 1/( R+(1-R)H)
COA : 41
Problem :
a) Average access time of the system considering only memory read cycle;
b) Average access time of the system both for read and write requests;
c) The hit ratio taking into consideration the write cycle.
Solution :
Average access time = PR avg. access time for read + (1-PR) * TMS
= 0.8*100ns+ (1-0.8) *500ns
= 80+100 = 180ns
Address of word
Address 01 Page In the page
p w P = log 2 P , w = log 2 W
Tag Register
r = 5, b = 7, w = 4
Cache
C = 128
P = 4096
Main storage
c = log 2 C
Cache Address : C/ W C = number of page trames
in cache
Page address in Cache
Key
M
CAM
A P W
R
C W
Cache Memory
Address
Register
CPU CACHE
Block diagram of the hardware structure realizing the behavior noted in the flowchart.
COA : 44
Start
Search in Cache
Yes No
Available
Hit Miss
Yes
Cache
full ?
All counters having value lesser
then that of the referenced one is
incremented by 1. While counters Replace the page in the The new page is
having value more then that of the frame having counter brought in a cache page
referenced page frame remain value c 1 & sets its frame not occupied
unchanged. Counter of the page hit value to 0 currently
is set to zero.
Stop
CAM Words
Yes
No
Access the cache by the Bring the page from main storage
address read from the to a vacant page frame of cache
CAM word. and proceed to complete the
read/write operation.
Stop
Start
Yes
Available?
Stop
memory (M2) : 1M words, so 1M/23 = 220/23 = 217 pages. The illustration of mapping has shown in the figure.
Now considering 4-way set-associative mapping, total number of sets = 211/4 = 512.
Effective access time = 0.95 x 50ns + (1-0.95)x 400ns = 67.5ns
Problem 2 on Cache mapping Consider MM with 256 words/module and in 4 modules. 16 words in each block. CM
of 256 words. 4-way set associative mapping is used. Show the mapping policy of MM and CM.
MM capacity = 4 x 256 = 22 x 28 = 210 No. of pages in MM = 210 /16 = 210 /24 = 26 = 64
8 4 4
No. of pages in CM = 256 /16 = 2 /2 = 2 = 16.
Total 4-sets, so 16/4 = 4 pages per set. So it is 4-way set-associative mapping.
COA : 48
Memory Hierarchy Technology There are three dimensions of the locality property:
temporal, spatial, and sequential. During the lifetime
of a software process, a number of pages are used
dynamically. These memory reference patterns are
caused by the following locality properties:
(1) Temporal locality Recently referenced items
(instructions or data) are likely to be referenced again
in the near future. This is often caused by special
program constructs such as iterative loops, process
stacks, temporary variables, or subroutines. Once a
loop is entered or a subroutine is called, a small code
segment will be referenced repeatedly many times.
Thus temporal locality tends to cluster the access in
the recently used areas.
(2) Spatial locality This refers to the tendency for a process to access items whose addresses are near one another.
For example, operations on tables or arrays involve accesses of a certain clustered area in the address space. Program
segments, such as routines and macros, tend to be stored in the same neighbourhood of the memory space.
(3) Sequential locality In typical programs, the execution of instructions follows a sequential order (or the program
order) unless branch instructions create out- of-order executions. The ratio of in-order execution to out-of-order
execution is roughly 5 to 1 in ordinary programs. Besides, the access of a large data array also follows a sequential
order.
Problem: Complete the following table replacing unknown spaces of the given memory hierarchy. Achieve an
effective memory-access time t = 10.04
s with a each hit ratio h1=0.98 and a hit
ratio h2=0.9 in the main memory. Also,
the total cost of the memory hierarchy is
upper-bounded by $15,000. The
memory hierarchy cost is calculated as
The maximum capacity of the disk is thus S3=39.8 G bytes.
Next, we want to choose the access time (t2) of the RAM to build the
main memory. The effective memory-
access time is calculated as
Substituting all known parameters, we have
10.04 x 10-6 = 0.98 x 25 x 10-9 + 0.02 X 0.9 X t2 + 0.02 x 0.1 x 1 x 4 x 10-3. Thus t2 = 903 ns.
COA : 49
Shared Caches: An alternative approach to maintaining cache coherence is to completely eliminate the
problem by using shared caches attached to shared-memory modules. No private caches are allowed in this
case. This approach will reduce the main memory access time but contributes very little to reducing the overall
memory-access time and to resolving access conflicts.
Shared caches can be built as second-level caches. Sometimes, one can make the second-level caches
partially shared by different clusters of processors. Various cache architectures are possible if private and
shared caches are both used in a memory hierarchy. Use of shared cache alone may be against the scalability of
entire system.
Non-cacheable Data: Another approach is not to cache shared writable data. Shared data are non-cacheable,
and only instructions or private data are cacheable in local caches. Shared data include locks, process queues,
and any other data structures protected by critical sections.
The compiler must tag data as either cacheable or non-cacheable. Special hardware tagging must be
used to distinguish them. Caches with cacheable and non-cacheable blocks demand more programmer effort,
in addition to support from hardware and computers.
Cache Flushing: A third approach is to use cache flushing every time a synchronization primitive is executed.
This may work well with transaction processing multiprocessor systems. Cache flushes are slow unless special
hardware is used. This approach does not solve I/O and process migration problems.
Flushing can be made very selective by programmers or by the compiler in order to increase efficiency.
Cache flushing at synchronization, I/O, and process migration may he carried out unconditionally or
selectively. Cache flushing is more often used with virtual address caches.
Hardware Interlocks: An interlock is a circuit that detects instructions whose source operands are destinations
of instructions farther up in the pipeline. Detection of this situation causes the instruction whose source is not
available to be delayed by enough clock cycles to resolve the conflict. This approach maintains the program
sequence by using hardware to insert the required delays.
Operand Forwarding: It uses special hardware to detect a conflict and then avoid it by routing the data
through special paths between pipeline segments. For example, instead of transferring an ALU result into a
destination register, the hardware cheeks the destination operand, and if it is needed as a source in the next
instruction, it passes the result directly into the ALU input, bypassing the register file. This method requires
additional hardware paths through multiplexers as well as the circuit that detects the conflict.
Delayed Load: The compiler are designed to detect a data conflict and reorder the instructions as necessary to
delay the loading of the conflicting data by inserting no-operation instructions. This method is referred to as
delayed load.
Handling of Branch Instructions One of the major problems in operating an instruction pipeline is the
occurrence of branch instructions. A branch instruction can be conditional or unconditional. It breaks the
normal sequence of the instruction stream, causing difficulties in the operation of the instruction pipeline.
COA : 50
Prefetch Target Instructions: One way of handling a conditional branch is to prefetch the target instruction in
addition to the instruction following the branch. Both are saved until the branch is executed. If the branch
condition is successful, the pipeline continues from the branch target instruction. An extension of this
procedure is to continue fetching instructions from both places until the branch decision is made. At that time
control chooses the instruction stream of the correct program flow.
Branch Target Buffer: Another possibility is the use of a branch target buffer or BTB. The BTB is an
associative memory included in the fetch segment of the pipeline. Each entry in the BTB consists of the address
of a previously executed branch instruction and the target instruction for that branch. It also stores the next
few instructions after the branch target instruction. When the pipeline decodes a branch instruction, it
searches the associative memory BTB for the address of the instruction. If it is in the BTB, the instruction is
available directly and prefetch continues from the new path. If the instruction is not in the BTB, the pipeline
shifts to a new instruction stream and stores the target instruction in the BTB. The advantage of this scheme is
that branch instructions that have occurred previously are readily available in the pipeline without
interruption.
Loop Buffer: A variation of the BTB is the loop buffer. This is a small and very high-speed register file
maintained by the instruction fetch segment of the pipeline. When a program loop is detected in the program,
it is stored in the loop buffer in its entirety, including all branches. The program loop can be executed directly
without having to access memory until the loop mode is removed by the final branching out.
Branch Prediction: Another procedure that some computers use is branch prediction. A pipeline with branch
prediction uses some additional logic to guess the outcome of a conditional branch instruction before it is
executed. The pipeline then begins prefetching the instruction stream from the predicted path. A correct
prediction eliminates the wasted time caused by branch penalties.
Delayed Branch: A procedure employed in most RISC processors is the delayed branch. In this procedure, the
compiler detects the branch instructions and rearranges the machine language code sequence by inserting
useful instructions (e.g. no-operation instruction) that keep the pipeline operating without interruptions.
Access to CPU Usually has direct access to main All access to secondary storage
storage is via main storage only
A comparative study of various parameters in respect of Cache-Main storage and Main storage-Secondary
storage interaction
COA : 51
The key register provides a mask for choosing a particular field or key in the argument word. The entire
argument is compared with each memory word if the key register contains all 1's. Otherwise, only those bits in
the argument that have 1's in their corresponding position of the key register are compared. Thus the key
provides a mask or identifying piece of information, which specifies how the reference to memory is made. To
illustrate with a numerical example, suppose that the argument register A and the key register K have the bit
configuration shown below. Only the three leftmost bits of A are compared with memory words because K has
1's in these positions.
A 101 111100
K 111 000000
Word 1 100 111100 no match
Word 2 101 000001 match
Word 2 matches the unmasked argument field because the three leftmost bits of the argument and the word
are equal.
The letter C with two subscripts where the first subscript gives
the word number and the second specifies the bit position in the
word. The internal organization of a typical cell Cij is shown in
Figure. It consists of a flip-flop storage element Fij and the
circuits for reading, writing, and matching the cell. The input bit
is transferred into the storage cell during a write operation. The
bit stored is read out during a read operation. The match logic
compares the content of the storage cell with the corresponding
unmasked bit of the argument and provides an output for the
decision logic that sets the bit in Mi.
Match logic: The match logic for each word can be derived from
the comparison algorithm for two binary numbers. First, we
neglect the key bits and compare the argument in A with the bits
stored in the cells of the words. Word i is equal to the argument
in A if Ai = Fij for j = 1, 2,. . . , n. Two bits are equal if they are both 1 or both 0. The equality of two bits can be
expressed logically by the Boolean function
xj = Aj Fij + Aj Fij
where xj = 1 if the pair of bits in position j are equal; otherwise, xj = 0.
For a word i to be equal to the argument in A we must have all xj variables equal to 1. This is the condition for
setting the corresponding match bit Mi to 1. The Boolean function for this condition is
Mi = x1 x2 x3 . xn
and constitutes the AND operation of all pairs of matched bits in a word.
Network Partitioning
The concept of virtual networks leads to the partitioning of a given physical network into logical subnetworks
for multicast communications. The idea is depicted in the figure.
COA : 53
2D Memory Organization:
In 2D memory organization, memory cells are organized as an array of words. Any word can be accessed
randomly.
Consider for N-bit address, there will be M = 2N words. Let there be B bits per word.
In PC-286 word length = 16 bit, so 64KB memory holds 32K words.
An address decoder is used to decode the address in N-bit MAR. The address decoder selects one out of 2N
word lines. In semiconductor memory, there will be B number of memory elements connected to the word line.
The address decoder thus selects the word lines to select the appropriate word by applying a voltage to it just
before the read or write operation.
To each memory element of a particular column is connected two bit wires, one to sense/write 0 bit and the
other to sense/write 1 bit. These wires are connected to sense amplifiers that sense the voltage on the proper
bit wires to sense a 0 or 1 during a read operation. After sensing the voltage on proper bit wires, the sense
amplifier stores 0 or 1 in the MBR. During a write operation, the data to be written is stored in MBR. The
sense amplifier forces appropriate voltage through proper bit wires reading the data from the MBR. This
forces the corresponding memory elements to store the required bit.
In the 2D organization the number of word lines equals the number of words in the memory, i.e., if there is M
words then M word lines are needed. The address decoder is complex.
COA : 54
2D Memory Organization:
Another organization called 2D organization uses only M word lines, allows the use of simple decoders and
it also enables the construction of memories modularly. Consider there are M words and B bits/word. Then
there will be B bit planes; each bit plane will contain M number of memory element on it. Therefore to
implement this memory, B number of M X 1 memory chips are required. Thus bits of a word are spread over
a number of chips. Normally one bit of a word in a single chip.
The word address is splitted into row address and column address. A row line and a column line connect each
memory element on each bit plane. To select a word, the most significant bits of the word address are entered
in the row register and least significant bits are entered in the column register. The voltage of the selected row
is lowered. The column decoder selects only the bits in the column that is indicated by the column register.
Thus one bit corresponding to a word is selected from each plane.
Problem 2: The execution times (in see) of four programs on three computers are given below
Execution Time
Program Computer A Computer B Computer C
P1 1 10 20
P2 1000 100 40
P3 500 1000 50
P4 100 800 100
Assume that 108 instruction were executed in each of four programs calculate MIPS rating of each program.
Draw a clear conclusion regarding the relative performance of the 3 computers.
Answer: Here T is given in the chart and Ic = 108 and MIPS = Ic / (T * 106)
In case of computer A
For program1 MIPS rate = 108 /(1 x 106)= 100
For program2 MIPS rate = 108 /(103 x 106)= 0.1
For program3 MIPS rate = 108 /(500 x 106)= 0.2
For program4 MIPS rate = 108 /(102 x 106)= 1
In case of computer B
For program1 MIPS rate = 108 /(10 x 106)= 10
For program2 MIPS rate = 108 /(102 x 106)= 1
For program3 MIPS rate = 108 /(103 x 106)= 0.1
For program4 MIPS rate = 108 /(800 x 106)= 0.125
In case of computer C
For program1 MIPS rate = 108 /(20 x 106)= 5
For program2 MIPS rate = 108 /(40 x 106)= 2.5
For program3 MIPS rate = 108 /(50 x 106)= 2
For program4 MIPS rate = 108 /(100 x 106)= 1
Problem 3: Consider the execution of an object code with 2,00,000 instructions in a 40mhz processor. The program
consists of four major types of instructions. The instruction mix and the number of cycles (CPI) needed for each
instruction type are given below
Instruction type CPI Instruction Mix
Arithmetic & logic 1 60%
Load/store with cache hit 2 18%
Branch 4 12%
Memory reference with cache miss 8 10%
(a) Calculate the average CPI when the program is executed on a uniprocessor with the above trace result
(b) Calculate the corresponding HIPS rate based on the CPI obtained in part a
Answer:
(a) Here Ic = 200000
And C = (200000*60*1/100) + (200000*18*2/100) + (200000*12*4/100) + (200000*10*8/100) = 448000
CPI = C / Ic = 448000 / 200000 = 2.24 Cycle / Instruction
(b) MIPS = f / (CPI * 106) and here f = 40 * 106 H2. So MIPS = (40 * 106) / (2.24 * 106) = 17.86
Ic = 100 and C = Total CPI = 60 x 1 + 18 x 2 + 12 x 4 + 10 x 3 = 224
So average CPI = 224 / 100 = 2.24 Cycle / Instruction
COA : 57
Problem 4: A workstation uses a 15MHz processor with claimed 10 MIPS rating to execute a given program mix.
Assume one cycle delay for each memory access.
(a) What is the effective CPI of this computer?
(b) Suppose the processor is being upgraded with a 30MHz clock. However, the speed of the memory subsystem
remains unchanged, consequently two clock cycles are needed per memory access. If 30% of the instructions require
one memory access and another 5% require two memory access per instruction, what is the performance of the up
graded processor with a compatible instruction set and equal instruction counts in the given program mix?
Answer:
(a) CPI = ? f=15 x 106 Hz MIPS = 10
6
MIPS = f / (CPI * 10 ) So CPI = f / (MIPS * 106) = (15 x 106) / (10 * 106)= 1.5
Let Ic = Instruction Count
2 memory accesses and 1 cycle delay /access
(b) Performance ratio = T15 / T30 =
(Icx(0.30x(CPI+1)+0.05x(CPI + 2) + 0.65xCPI)x(1/15)) / (Icx(0.30 x (CPI + 2)+0.05x(CPI + 4)+0.65xCPI)x(1/30)
The DMA controller has three registers: an address register, a word count register, and a control register. The
address register contains an address to specify the desired location in memory. The word count register holds
the number of words to be transferred. This register is decremented and address register is incremented by
one after each word transfer and internally tested for zero. The control register specifies the mode of transfer.
The CPU initializes the DMA by sending the following information through the data bus:
1. The starting address of the memory block where data are available (for read) or are to be stored (for write)
2. The word count, which is the number of words in the memory block
3. Control to specify the mode of transfer such as read or write
4. A control to start the DMA transfer
The CPU communicates with the DMA
through the address and data buses as with
any interface unit. The DMA has its own
address, which activates the DS (DMA
select) and RS (Register select) lines. The
CPU initializes the DMA through the data
bus. Once the DMA receives the start
control command, it can start the transfer
between the peripheral device and the
memory.
When the peripheral device sends a DMA
request, the DMA controller activates the
BR line, informing the CPU to relinquish
the buses. The CPU responds with its BC
line, informing the DMA that its buses are
disabled. The DMA then puts the current
value of its address register into the address
bus, initiates the RD or WR signal, and
sends a DMA acknowledge to the
peripheral device. The RD and WR lines in
the DMA controller are bi-directional. The
direction of transfer depends on the status
of the BC line. When BC = 0, the RD and
WR are input lines allowing the CPU to communicate with the internal DMA registers. When BC = 1, the RD
and WR are output lines from the DMA controller to the random-access memory to specify the read of write
operation for the data.
When the peripheral device receives a DMA acknowledge, it puts a word in the data bus (for write) or receives
a word from the data bus (for read). Thus the DMA controls the read or write operations and supplies the
address for the memory. The peripheral unit can then communicate with memory through the data bus for
direct transfer between the two units while the CPU is momentarily disabled.
For each word that is transferred, the DMA increments its address register and decrements its word count
register. If the word count does not reach zero, the DMA checks the request line coming from the peripheral.
For a high-speed device, the line will be active as soon as the previous transfer is completed. A second transfer
is then initiated, and the process continues until the entire block is transferred. If the peripheral speed is
slower, the DMA disables the bus request line so that the CPU can continue to execute its program. When the
peripheral requests a transfer, the DMA requests the buses again.
If the word-count register reaches zero, the DNM stops any further transfer and removes its bus request. It
also informs the CPU of the termination by means of an interrupt. When the CPU responds to the interrupt, it
reads the content of the word count register. The zero value of this register indicates that all the words were
transferred successfully. The CPU can read this register at any time to check the number of words already
transferred. A DMA controller may have more than one channel. In this case, each channel has a request and
acknowledge pair of control signals which are connected to separate peripheral devices. Each channel also has
its own address register and word count register within the DMA controller. A priority among the channels
may be established so that channels with high priority are serviced before channels with lower priority.
COA : 60
Data
Instruction Cycle
Count
Processor Processor Processor Processor Processor Processor
Cycle Cycle Cycle Cycle Cycle cycle
Data
Fetch Fetch Fetch Fetch Fetch Fetch
Data Register Instruction Instruction Instruction Instruction Instruction Instruction
Links
Address
Register Interrupt
Address
Lines
Break Point
DMA REQ
DMA ACK DMA
Control Breakpoints
INTR
Logic
Read
Write DMA And Interrupt Break Points During An Instruction
Cycle
(C) I / O Bus
COA : 61
Interrupt-driven I/O, though more efficient than simple programmed I/O, still requires the active
intervention of the CPU to transfer data between memory and an I/O module, and any data transfer must
traverse a path through the CPU. Thus, both these forms of I/O suffer from two inherent drawbacks:
1. The I/O transfer rate is limited by the speed with which the CPU can test and service a device.
2. The CPU is tied up in managing an I/O transfer; a number of instructions must be executed for each
I/O transfer.
DMA Function:
The DMA module is capable of mimicking the CPU and, indeed, of taking over control of the system
from the CPU. The technique works as follows:
When the CPU wishes to read or write a block of data, it issues a command to the DMA module, by
sending to the DMA module the following information:
1. Whether a read or write is requested.
2. The address of the I/O device involved.
3. The starting location in memory to read from or write to.
4. The number of words to be read or written.
Thus the CPU has delegated this I/O operation to the DMA module and continues with other works.
The DMA module transfers the entire block of data, one word at a time, directly to or from memory
without going through the CPU.
When the transfer is complete, the DMA module sends an interrupt signal to the CPU.
The DMA module needs to take control of the bus in order to transfer data to and from memory. Thus
the DMA module must use the bus only when the CPU does not need it, or it must force the CPU to
temporarily suspend operation. The latter technique is more, common and is referred to as cycle-stealing since
the DMA module in effect steals a bus cycle.
The figure shows where in the instruction cycle the CPU may be suspended. In each case, the CPU is
suspended just before it needs to use the bus. The DMA module then transfers one word and returns controls
to the CPU. This is not an interrupt; the CPU does not save a context and do something else. Rather the CPU
pauses for one bus cycle. The overall effect is to cause the CPU to execute more slowly. For a multiple-word I/O
transfer, DMA is far more efficient than interrupt-driven or programmed I/O.
Single-bus, detached DMA: Here all modules share the same system bus. The DMA module, acting as
a surrogate CPU, uses programmed I/O to exchange data between memory and an I/O module
through the DMA module. This configuration while it may be inexpensive is clearly inefficient. As
with CPU controlled-programmed I/O, each transfer of a word consumes two bus cycles.
Single-bus, integrated DMA - I/O: The number of required bus cycles can be cut substantially by
integrating the DMA and I/O functions. This means that, there will be a path between the DMA
module and one or more I/O modules that does not include the system bus. The DMA logic may
actually be a part of an I/O module, or it may be a separate module that controls one or more I/O
modules.
I/O bus: the above configuration can be improved by connecting I/O modules to the DMA module
using an I/O bus. This reduces the number of I/O interfaces in the DMA module to one and provides
for an easily expendable configuration.
In the last two configurations, the system bus that the DMA module shares with CPU and memory is used by
the DMA module only to exchange data with memory. The exchange of data between the DMA and I/O
modules takes place off the system bus.
COA : 62
Example:
(1) X=(2,5,8,7) and Y=(9,3,6,4) after operation B=X>Y is executed, the Boolean Vector B=(0,1,1,1) is
generated.
(2) X=(1,2,3,4,5,6,7,8) and B=(1,0,1,0,1,0,1,0) after compress operation Y=X(B) Vector Y=(1,3,5,7)
(3) X=(1,2,4,8), Y=(3,5,6,7) and B=(1,1,0,1,0,0,0,1) after merge operation Z=X,Y,(B) Vector
Z=(1,2,3,4,5,6,7,8).
V1 V1 V1 V2 S V1
V2 S V3 V2
(a) f1: V1V2 (b) f2: V1S (c) f3: V1 * V2V3 (d) f4: S*V1V2
COA : 64
C = < N, F, I, M >
Where,
N The number of PEs in the system. Illiac-IV has N=64, BSP has N=16.
F A set of data-routing functions provided by the interconnection network or by alignment network.
I The set of machine instructions for scalar-vector, data-routing, and network-manipulation operations.
M The set of masking schemes, which partitions the set of PEs into the two disjoint subsets of enabled PEs
and disabled PEs.
Inter-PE Communications:
These are fundamental decisions in determining the appropriate architecture of an interconnection network for an
SIMD machine. The decisions are made between the operation modes, control strategies, switching methodologies
and network topologies.
Operation mode: two types of communications can be identified: synchronous and asynchronous.
Synchronous communication is needed for establishing communication paths synchronously for either a data
manipulating function or for a data instruction broadcast. Asynchronous communication is needed for multiprocessing
in which connection requests are issued dynamically.
Control strategy: a typical interconnection network consists of a number of switching elements and
interconnecting links. Interconnection functions are realized by properly setting control of the switching elements.
The control-setting function can be managed by a centralized controller or by the individual switching elements. The
latter strategy is called distributed control and the first strategy corresponds to centralized control.
COA : 65
Switching methodology: the two major switching methodologies are circuit switching and packet
switching. In circuit switching, a physical path is actually established between a source and destination. In
packet switching, data is put in a packet and routed through the interconnection network without establishing
a physical connection path. In general, circuit switching is much more efficient for many short data messages.
Network topology: a network can be depicted by a graph in which nodes represent switching points and
the edges represent communication links. The topologies tend to be regular and can be grouped into two
categories: static and dynamic. In a static topology, links between two processors are passive and dedicated
buses cannot be reconfigured for direct connections to other processors. On the other hand, setting the
networks active switching elements can reconfigure links in the dynamic category.
The space of the interconnection networks can be represented by the Cartesian product of the above four sets of
design features: {operation mode} X {control strategy} X {switching methodology} X {network topology}.
Static Vs. Dynamic Networks:
The topological structure of an SIMD array processor is mainly characterized by the data-routing network
used in interconnecting the processing elements. Formally, such an inter-PE communication network can be
specified by a set of data-routing functions.
The SIMD interconnection networks are classified into the following two categories based on network
topologies: static and dynamic networks.
In static networks, topologies can be classified according to the dimensions required for layout. E.g.
linear array (1D), star, ring, tree, mesh and systolic array all are of 2D, completely connected chordal
ring, 3-cube and 3-cube-connected-cycle are all of 3D.
In dynamic networks, two classes of networks can be described: single-stage versus multistage.
Single-stage networks: a single-stage network is a switching network with N input selectors (IS) and N output
selectors (OS). Each IS is essentially a 1-to-D demultiplexer and each OS is an M-to-1 multiplexer where
1DN and 1MN. the crossbar-switching network is a single-stage network with D=M=N. To establish a
desired connecting path, different path control signals will be applied to all IS and OS selectors. The single-
stage network is also called a recirculating network.
Multistage networks: many stages of interconnected switches form a multistage SIMD network. Multistage
networks are described by three characterizing feature: the switch box, the network topology, and the control
structure. There are four states of a switch box: straight, exchange, upper broadcast, and lower broadcast. It is
capable of connecting an arbitrary number of input to output terminals. Multistage networks may be one-sided
(called full switches and have input/output ports on the same side) and two-sided (having two sides for input
and output sections). Two-sided multistage networks can be divided into three classes:
Blocking networks are examples of data manipulators. E.g. baseline, omega, n cube networks.
Rearrangeable networks can perform all possible connections between inputs and outputs by
rearranging its inputs and outputs. E.g. benes network.
Nonblocking networks can perform all possible connections between inputs and outputs with out
blocking. E.g. clos network can perform one-to-one and one-to-many connections.
Cube Interconnection Networks:
The cube network can be implemented as either as a re-circulating network or as a multistage network for
SIMD machines. A three-dimensional cube is shown. Vertical lines connect vertices (PEs) whose addresses
differ in the most significant bit position. Vertices at both ends of the diagonal lines differ in the middle bit
position. Horizontal lines differ in the least significant bit position. This unit-cube concept can be extended to
an n-dimensional unit space, called an n cube, with n bits per vertex.
An O(n3) algo for SISD matrix multiplication: An O(n2) algo for SIMD matrix multiplication:
For i = 1 to n Do For i = 1 to n Do
For j = 1 to n Do Par For k = 1 to n Do
Cij = 0 (initializing) Cik = 0 (vector load)
For k = 1 to n Do For j = 1 to n Do
Cij = Cij + aik * bkj (scalar additive multiply) Par For k = 1 to n Do
End of k loop Cik = Cik + aij * bjk (vector multiply)
End of j loop End of j loop
End of i loop End of i loop
COA : 66
PE0 A0 0 0 0 S(0)
PE3
A3 2,3 0-3 0-3 S(3)
PE4
A4 3,4 1-4 0-4 S (4)
PE5
PE6
S (6)
A6 5,6 3-6 0-6
PE7
I/O
Data & Instructions
Data bus
CU Memory
CU
Control bus
Control
Interconnection
(a) Configuration I (llliac iv) Network
PEo PE1 PE N-1
Alignment Network
Mo M1 M P -1
(b)Configuration II (BSP)
Architectural Configurations of SIMD array processors.
COA : 68
Odd parity (total no of Is in the message and the parity bit is odd)
Let us use even parity logic in the following example:
0 (Recognized 1
Parity bits)
As the received and recognized parity bits are Here the error could not be detected
different, so the error could be detected
So the main limitation of parity bit is that it can not detect any error when error occurs in even no of places. So
to solve this problem we introduce SECDED code.
In this method more than one parity bit are generated from distinct group of message/data bits. Now few
points are:
1. How many parity bits are to be introduced?
2. How will be the message bits involvements?
3. How they will be arranged?
4. How does SEC work?
5. How does DED work?
If there are N no of data bits and P no of parity bits, then the equation is
N 2 P -1
Example:
8 7 6 5 4 3 2 1
1 0 1 1 0 1 0 1
8 7 6 5 4 3 2 1
1 0 0 1 0 1 0 1
P4 / = {1,0,0,1} = 0
P4 P3 P2 P1 => 1 0 1 0 P3 / = {0,1,0,1} = 0
P2 / = {1,1,0,0,0} = 0
+ P4 / P3 /P2 / P1 / => 0 0 0 0 P1 / = {1,0,0,1,0} = 0
Syndrome => 1 0 1 0 => 1010 So this bit place no. has become corrupted.
Class Work :
In case of DED operation, double bit errors are detected but could not be corrected. Here in this DED
scheme an extra generate parity bit will be attached with the old scheme. And the inference will be drawn as
shown below.
Original Massage :
8 7 6 5 4 3 2 1 GP P 4 = { 1, 0, 1, 1 } = 1
1101 1011 O 0 P 3 = { 1, 0, 1, 1 } = 1
P 2 = { 1, 0, 1, 0, 1 } = 1
P 1 = { 1, 1, 1, 1, 1 } = 1
Received Massage :
8 7 6 5 4 3 2 1 GP P / 4 = { 0, 1, 1, 1 } = 1
1110 1011 O 0 P / 3 = { 1, 0, 1, 1 } = 1
P / 2 = { 1, 0, 1, 1, 1 } = 0
P / 1 = { 1, 1, 1, 0, 1 } = 0
P4 P3 P2 P1 1 1 1 1
/ / / /
? P4 P3 P2 P1 1 1 0 0
Syndrome 0 0 1 1 = 3 10
So syndrome is pointing to location 3 but errors have occurred at locations 5th and 6th . The general
parity bit (GP) detects no error, because errors have occurred in even number of places. So it proves that
double bits error has occurred. So DED commences.
COA : 71
COA : 72
However, to have simplicity in decoding, the operation codes may be chosen as in the R.H.S. of diagram.
However, such a scheme is useful, if the overhead associated with the comparatively complex decoding
structure gets outweighed by the saving in the memory space to store the instructions in average programs.
COA : 73
cn-1
CARRY-LOOK AHEAD GENERATOR
zn-1 zn-2 z0
1-bit 1-bit 1-bit cin
adder adder adder
cn-2 cn-3 c0
p g
4-BIT CARRY-LOOK AHEAD GENERATOR cin
c3
cout p3 g3 c2 p2 g2 c1 p1 g1 c0 p0 g0
4 p g 4 p g 4 p g 4 p g
4-bit 4-bit 4-bit 4-bit
cin
adder adder adder adder
4 4 4 4 4 4 4 4
x15:x12 x15:x12 x11:x8 x11:x8 x7:x4 x7:x4 x3:x0 x3:x0
Problem 4: Suppose that data on the tape is organized into blocks each containing 32 Kbytes. A gap of 0.4 inch
separates the blocks from each other. The density of recording is 6250 bits/inch. How many bytes may be
stored on the tape reel of 2400 ft.?
Answer:
Block size = 32 Kbytes Recording density = 6250 bpi Block length = (32 x 103)/6250 = 5.12 in
Block separation = 0.4 in Tape length = 2400 ft
Hence no. of blocks on tape = (2400 ft x 12)/(5.12 + 0.4) in = 5217 blocks = 166.944 Mbytes
**********************************************************************************************
Problem 5: A disk pack has 19 surfaces. Storage area on each surface has an inner diameter of 22 cm and
outer diameter of 33 cm. Maximum storage density on any track is 2000 bit/cm and minimum spacing between
tracks is 0.25 mm. (a) What is the storage capacity of the pack?
(b) What is the data transfer rate in bytes per sec. at a rotational speed of 3600 RPM.
(c) Using two 16-bit words, suggest a suitable scheme for specifying disk address.
(d) The main memory of a computer has 32-bit word length and 500 nsec. cycle time. Assuming that disk
transfers data to/from the main memory on a cycle stealing basis, evaluate the percentage of memory cycles
stolen during data transfer period?
Answer:
No. of surfaces = 19 Inner track diameter = 22 cm outer track diameter = 33 cm
Track width (total) = 5.5 cm Track separation = 0.25 mm
No. of tracks/surface = (5.5 X 10) / 0.25 = 220 Minimum track circumference = 22 cm x
Maximum track storage density = 2000 bits/cm (on the innermost track)
Data storage capacity/track = 22 x x 2000 = 138.23 Kbits
Disk speed = 3600 rpm Rotation time = 1/3600 minute = 16.67 msec
(a) Storage capacity = 19 x 220 x 138.23 Kbits = 577.8 Mbits = 72.225 Mbytes (with 8 bit/byte)
(b) Data transfer rate = 138.23 Kbits / 16.67 msec = 8.2938 Mbits/sec
This is the peak data transfer rate excluding seek time and rotational latency.
(c) A possible disk addressing scheme could be with the following fields:
Surface (head) no. = 5-bit [0 to 181
Track (cylinder) no. = 8-bit [0 to 2191
(assuming 128 sector per track, each sector storing 1 Kbit)
Sector no. 7-bit [0 to 127]
Thus the two 16-bit words can be used as follows to store the sector address in a track on a surface.
The field marked x may be used by the designer to store some useful information.
(d) CPU cycle time = 500 nsec
Data transfer rate = 8.2938 Mbit/sec
Time to read a 32-bit block = 32/8.2938Mbit/sec = 3858.3 nsec = 8 CPU cycles
Ideally, 1 out of every 8 CPU cycles should be stolen Percentage disk service time = 100/8 = 12.5%
**********************************************************************************************
Problem 6: A particular disk drive having a rotational speed of 3600 RPM has the capability for rotational
position sensing. It employs hard sectored disk pack with 128 sector marks on a track. At a particular instance,
a record on sector 1 is sought and on interrogation it is observed that the head is just entering sector 2.
Determine the duration for how long the control unit and IOP can be released prior to reading the record.
Assume a delay of .3 msee for sensing rotational position.
Answer:
Rotation speed = 3600 rpm Rotation time = 16.67 msec
Position sense time = 0.3 msec
Sequence of actions after detecting head on sector
(i) Time to traverse 127 sectors = (16.67 x 127)/ 128 msee = 16.54 (
(ii) Confirm position as sector 1 = 0.3 msec
Total delay = 16.84 msec
So control unit and IOP can be released for 16.84 msec.
COA : 78
Problem 7: A disk drive having 3600 RPM employs 128 sectors per track. What is the average rotational
latency of the drive? The IOP to which the drive is attached requested a record in say sector x. While
interrogating the disk system, if it is identified that the disk head is on sector (x + 1), for how long the IOP and
the disk controller can be released prior to R/W head starts reading the sector x.
Answer:
Drive speed = 3600 rpm
Rotation time = 1 / 3600 min = 16.67 msec
Average rotational latency = 8.33 msec
Rotation from sector (x + 1) to sector (x)
= 1 rotation - 1 sector traversal time
= 16.67 msec (1 1/128)
= 16.536 msec.
**********************************************************************************************
Problem 8: A disk drive has the following specifications: 4040 bpi recording density, 6448 x 103 bit/sec data
transfer rate, 3600 RPM, innermost track diameter of 6.5 inch, total of 400 tracks with track density of 200 tpi.
Determine whether the recording density mentioned refers to the track on the innermost/outermost or central
region of the recording surface.
Answer:
Rotation speed = 3600 rpm
Rotation time = 1/3600 mm = 16.67 msec
Data transfer rate = 6448 x 103 bit/sec
Data/track = 6448 x 103 bit/sec x 16.67 msec = 107.466 Kbits
For storage density of 4040 bits per in, track circumference = 107.466 Kbits / 4040 = 26.6 in
Track diameter = 26.6 in / = 8.5 in (approx.)
Innermost track diameter = 6.5 in
400 tracks
Radial width of the tracks = 400 tracks / 200 tpi = 2 in
Outermost track diameter = 6.5 + 2 x 2 = 10.5 in
Hence the recording density refers to the central region of the disk recording surface.
**********************************************************************************************
Problem 9: A floppy disk drive has following specifications: 77 tracks, 26 sectors per track, 188 bytes/sector,
320 bytes preamble and postamble data per track, usable data storage per sector is 128 byte, 360 RPM, 96
track per inch, 3200 bpi average recording density. Compute unformatted capacity, formatted usable data
storage, data transfer rate, radial distance between innermost and outermost track, average diameter of a
track.
Answer:
For a floppy disk drive
No. of tracks = 77
Sector per track = 26
Sector size = 188 bytes
Usable sector capacity = 128 bytes
Track overhead = 320 bytes (preamble) + 320 bytes (postamble) = 640 bytes
Unfomatted capacity = 77 tracks x [640 + (26 sectors x 188)] = 425.656 Kbytes
Fomatted capacity = 77 tracks x 26 sectors x 128 = 256.256 Kbytes
Drive speed = 360 rpm
Rotation time = 167 msec
Peak data transfer rate = (640 + 26 X 188)/167 msec = 33.168 Kbytes/sec = 265.344 Kbits/sec
Storage density = 3200 bpi
Average track circumference = ((640 + 26 x 188) x 8 )/3200 bits/track = 13.82 inch
Average track diameter = 4.399 inch
At 96 tracks/in, radial width for 77 tracks = 77/96 in = 0.802 in
COA : 79
Printing speed: The printing speed in terms of lines per minute of a chain/band printer can be computed as
follows for a 132 print position printer. Let
ta = time for auto spacing of paper in millisecond,
tp = time for a print cycle in micro second,
ts = synchronization time in between a pair of sub-scans in micro second,
C = number of characters in the set.
Time for one sub-scan including synchronization delay = 44tp + ts,
since there are 132/3 = 44 print cycles in a sub-scan.
Time for one print scan = 3(44tp + ts)
T = Time to print a line = {3C x (44tp + ts)} x 10-3 + ta
Speed in LPM (Lines Per Minute) = (10 -3 x 60) / T
For, ta = 10 msec, tp = 4.5 micro second, ts = 10 micro second, C = 64,
T = 3 x 64(44 x 45 + 10) x 10-3 + 10 = 50 ms (approx.)
LPM = 1200.
**********************************************************************************************
Problem 1: A chain printer has following specifications:
Time to space after printing a line : 15 msec
Number of characters in the character set : 50
Number of print positions : 132
Number of sub-scans :3
Print cycle time : 5 micro-sec
Synchronization, time after each sub-scan : 35 micro-sec
Find the printing speed in lines per minute.
Answer
In chain printer, number of print positions = 132
No. of sub-scans = 3 Print cycle time = 5 micro-sec
Sub-scan time = (5 micro-sec x 132)/3 = 220 micro-sec Synchronization time after sub-scan = 35 micro-sec
Time for one sub-scan = (220 + 35) x 3 micro-sec = 765 micro-sec
No. of character = 50. So line print time = 765 x 50 = 38.25 msec
Time to space after printing a time = 15 msec
Printing speed = 60 sec/(38.25 + 15) msec = 1126 lines/min i.e., LPM
COA : 80
Problem 2: Determine the printing speed of a dot matrix printer in characters per second having following
specifications:
Time to print a character : 3 msec
Time to space in between characters : 1 msec N
Number of characters in a line : 100
Specify the time to print a character line.
Answer
Time to print a character = 3 msec
Time to space between characters = 1 msec
Printing speed = 1/(4 x 10-3) = 250 cps (characters per sec)
No. of characters/line = 100
Time to print a line = 100 x (3 + 1) msec = 0.4 sec
**********************************************************************************************
Problem 3: Determine the bit rate in MHz of a VDU terminal having following specifications:
Number of characters/line : 80
Number of bits/character :7
Horizontal sweep time : 63.5 micro-sec
Retrace time : 20% of horizontal sweep time.
Answer
For a VDU terminal
No. of characters/line = 80
No. of bits per dot row of a character = 7
So no. of bits in a dot row per line = 560 bits
Horizontal sweep time = 63.5 micro-sec (assumed to include retrace time)
Retrace time = 20% x 6.35 micro-sec = 12 micro-sec
Time for movement of beam on a dot row of a character line = 63.5 - 12.7 = 50.8 micro-sec
So bit rate = 560 bits/(50.8 X 10-6) = 11.23 Mbit/see
**********************************************************************************************
Problem 4: Determine the cycle time of the refresh RAM used in a graphic system having following
specifications.
Frame size : 1024 x 1024 pixels
Horizontal sweep time : 63.5 micro-sec
Retrace time : 30% of sweep time
One word of RAM stores : 4 pixels.
Answer
For a graphics terminal refresh RAM:
Frame size = 1024 x 1024 pixels
1 RAM word stores = 4 pixels
No. of words to store one pixels row
= 1024/4 = 256 words
Time for one pixel row = 63.5 - 0.3 x 63.5 = 44.45 micro-sec
RAM cycle time = (44.45 x 103)/256 nsec
= 174 nsec (approx.)
**********************************************************************************************