15CS72_ACA_Module3_chapter2finalnotes
15CS72_ACA_Module3_chapter2finalnotes
Asynchronous Models
When stage Si is ready to transmit, it sends a ready signal to stage Si+1 After stage Si+1
receives the incoming data, it returns an acknowledgement signal to Si
Synchronous Models
Latches are used between every two stages. Latches are made with master slave flip
flop which isolates inputs from outputs. Upon the arrival of a clock pulse, all latches
transfer data to the next stage simultaneously. The synchronous pipeline is shown
below
The utilization pattern of successive stages in a synchronous pipeline is specified by
reservation table given below.
The clock cycle of the pipeline is denoted as . Let i be the time delay in stage Si and
d is the time delay of latch. Alo maximum stage delay is m
= m + d
The pipeline frequency is defined as the inverse of clock period which represents the
throughput of pipeline
f=1/
Clock Skewing
Ideally the clock pulse will arrive at all stages at the same time However due to problem
known as clock skewing the clock pulse may arrive at different stages with time offset of
s. Let tmax be the time delay of the longest logic path within a stage and tmin that of the
shortest logic path within a stage.
Hence
Speedup
Ideally a linear pipeline with k stages can process n tasks in k + (n-1) clock cycles. Thus
the total time requires is
Tk = [k+(n-1)]
T1 =nk
nk
SK = k + (n−1)
Ek =
SK/k
n
Ek = k+(n−1)
Obviously, the efficiency approaches 1 when n→ ∞ and a lower bound on Ek is 1/k
when n=1
Throughput
The pipeline throughput Hk is defined as the number of tasks performed per unit time
n
Hk = [k+(n−1)]
nf
Hk = k + (n−1)
Usually the number of pipeline stages would not exceed 10 in real computers. The
optimal choice of the number of pipeline stages should be able to maximize the
performance/cost ratio for the target processing load.
Let t is time taken to execute a program on non pipelined processor. If the same
program is executed on k-stage pipeline then clock period p = (t/k) + d
Thus, the pipeline has a maximum throughput of f= 1/ p = 1/(t/k + d). The total
pipeline cost is c+kh, where c covers the cost of all logic states and h represents the
cost of each latch. Thus pipeline performance/cost ratio(PCR) is given by
PCR=f/(c+kh)
1
PCR= (t/k + d) (c + kh)
k0= √ t.c
d.h
Where t is the total flow through delay of the pipeline. Thus the total stage cost c, the
latch delay d. and the latch cost h must he considered to achieve the optimal value k0
Besides the streamline connections from S1 to S2 and from S2 to S3, there is a feed
forward connection from S1 to S3 and two feedback connection from S3 to S2 and from
S3 to S1. Also by following different data flow patterns, one can use the same pipeline
to evaluate different functions.
Reservation tables
In linear pipeline, the data flow is linear hence reservation table is simple. But in
non-linear pipeline data flow is nonlinear hence reservation table is not simple.
Multiple reservation tables can be generated for the evaluation of different functions.
Two reservation tables are given in figure below corresponding to a function X and
function Y, respectively. Each function evaluation is specified by one reservation table.
A static pipeline is specified by a single reservation table. A dynamic pipeline may be
specified by more than one reservation table.
The number of columns in a reservation table is called the evaluation time for given
function. For example, the function X requires eight clock cycles as shown in figure
respectively.
The checkmarks in each row of the reservation table correspond to the time instants
(cycles) that a particular stage will be used. There may be multiple checkmarks in a row,
which means repeated usage of the same stage in different cycles. Multiple checkmarks
in a column mean that multiple stages need to be used in parallel during a particular
clock cycle.
Latency Analysis
Pipeline has different initiations and number of clock cycles between two initiations is
called latency between them.
Any attempt by two or more initiations to use the same pipeline stage at the same time
is called as collision(resource conflict). Latencies that cause collisions are called
forbidden latencies as shown in figure.
To detect a forbidden latency, one needs simply to check the distance between any two
checkmarks in the same row of the reservation table. For example, the distance
between the first mark and the second mark in row S1, in figure given below for function
X is 5 implying the 5 is forbidden latency. Thus latencies 2, 4, 5 and 7 are all seen to be
forbidden for function X and latencies 2 and 4 are forbidden for function Y.
Latency Sequence : It is a sequence of permissible non forbidden latencies between
successive task initiations.
Latency Cycle: It is is a latency sequence which repeats the same subsequence (cycle)
indefinitely.
Hence for function X the non forbidden latencies are 1 and 8. Hence the latency cycle
for function X is (1,8) which repeats in successive initiations of new tasks. Three valid
latency cycles for evaluation of function X is shown below. As shown in figure(a) below
the latency between the initiation of X1 and X2 is 1 and between X2 and X3 is 8 and
between X3 and X4 is again 1. Hence the sequence 1,8,1,8,1,8 is repeatedly applied for
all initiations linearly.
The average latency of a latency cycle is obtained by dividing the sum of all latencies by
the number of latencies along the cycle. The latency cycle (1,8) thus has an average
latency of (1+8)/2= 4.5. A constant cycle is a latency cycle which contains only one
latency value. For example figure 6.5 b and 6.5 c have constant cycle and latency of 3
and 6.
6.2.2 Collision Free Scheduling
Hence the main objective is to minimize the average latency and avoid collisions when
scheduling events in non linear pipeline.
Collision Vector
For a reservation table with n columns the maximum forbidden latency m ≤ n-1. The
permissible latency p should be as small as possible. Hence 1≤ p ≤ m-1.
The combined Set of permissible and forbidden latencies can be easily displayed by a
collision vector, which is an m-bit binary vector C = (CmCm-1……..C3C2C1). The value of Ci
=1 if latency i causes a collision and Ci = 0 if latency i is permissible.
Hence for the reservation tables given above the collision vector Cx = (1011010) is
obtained for function X, and Cy = (1010) for function Y.
From Cx we can immediately tell that latencies 7,5, 4, and 2 are forbidden and latencies
6, 3 and 1 are permissible.
State Diagram
The initial collision vector is loaded into register. The contents of the register are
right shifted. When a 0 bit emerges from the right end after p shifts, it means p is
a permissible latency.When a 1 bit emerges from the right end after p shifts, it
means p is a forbidden latency. Also after each right shift 0 enters from the left
end of the shift register.
Consider the initial collision vector CX = (1011010). The next state after p shifts
is obtained by bitwise-OR of the initial collision vector with the shifted register
contents.
For example : CX = (1011010). The next state (1111111) is reached after one
right shift of the register. And state (1011011) is reached after three shifts or six
shifts. Refer figure given below
The state diagram is shown below. From the initial state [1011010] only three outgoing
transitions are possible, corresponding to the three permissible latency 6, 3, and 1 in the
initial collision vector. Similarly, from state [1011011] , one reaches the same state after
either three shifts or six shifts.When the number of shifts is m + 1 or greater, all
transitions are redirected back to the initial state.
Greedy Cycle
From the state diagram, we can determine optimal latency cycles which result in the
MAL. There are infinitely many latency cycles one can trace from the state diagram. For
example (1,8), (1,8,6,8), (3), (6), (3,8) are legitimate cycles traced from the state
diagram. A simple cycle is a latency cycle in which each state appears only once.
Hence for the above state diagram (b) the simple cycles are (3), (6), (8), (1,8), (3,8) and
(6,8). Some of the simple cycles are greedy cycle . A greedy cycle is one whose edges
are all made with minimum latencies from their respective starting states.For example,
in Fig.b the cycles (1, 8) and (3) are greedy cycles. Greedy cycles in Fig. c are (1, 5)
and (3). Such cycles must first be simple, and their average latencies must be lower
than those of the other simple cycles. The greedy cycle (1, 8) in Fig. b has an average
latency of 4.5, which is lower than that of the simple cycle (6,8) = (6 + 8)/2= 7. The
greedy cycle (3) has a constant latency which equals the MAL for evaluating function X
without causing a collision. The minimum-latency edges in the state diagrams are
marked with asterisks.
Bounds on MAL
1. The MAL is lower bounded by the maximum number of checkmarks in any row of
the reservation table.
2. The MAL is lower than or equal to the average latency of any greedy cycle in the
state diagram.
3. The average latency of any greedy cycle is upper-bounded by the number of 1’s
in the initial collision vector plus 1.
Consider eight instructions are getting executed in pipeline in program order for the two
statements i.e. X= Y + Z and A = B * C as shown in figure below
The shaded boxes correspond to idle cycles when instruction issues are blocked due to
resource latency or conflicts or due to data dependencies. The first two load instructions
issue on consecutive cycles. The add is dependent on both loads and must wait three
cycles before the data (Y and Z) are loaded in.
Similarly, the store of the sum to memory location X must wait three cycles for the add
to finish due to a flow dependence.The total time required is 17 clock cycles. This time
is measured beginning at cycle 4 when the first instruction starts execution until cycle 20
when the last instruction starts execution.
Also if the original program order is not preserved and instructions are reordered and
executed the time is reduced to 11 clock cycles as shown in figure below.The reordering
should not change the end results.
Prefetch Buffers
1. It is used to match the instruction fetch rate and the pipeline consumption rate .
2. Three types of buffers can be used namely sequential buffers, target buffers and
loop buffers. It is shown in figure below.
3. Sequential instructions are loaded into a pair of sequential buffers for
in-sequence pipelining. Instructions from a branch target are loaded into a pair of
target buffers for out-of-sequence pipelining. Both buffers operate in a
first-in-first-out fashion.
4. The branch condition is evaluated. If the branch is taken branch the instruction
from target buffer is loaded into pipeline otherwise the instruction from the
sequential buffer is loaded into the pipeline.
5. Within each pair, one can use one buffer to load instructions from memory and
use another buffer to feed instructions into the pipeline.
6. A third type of prefetch buffer is known as loop buffer.The loop buffer operates in
two steps. First, it contains instructions sequentially ahead of the current
instruction. This saves the instruction fetch time from memory. Second, it
recognises when the target of a branch falls within the loop boundary. In this
case, unnecessary memory accesses can be avoided if the target instruction is
already in the loop buffer.
The RAW hazard corresponds to the flow dependence, WAR to the anti-dependence,
and WAW to the output dependence.
6.3.3 Dynamic Instruction Scheduling and Static Scheduling
6. The multiply instruction cannot be initiated until the preceding load is complete.
This data dependence will stall the pipeline for three clock cycles since the two
loads overlap by one cycle.
7. The two loads since they are independent of the add and move can be moved
ahead to increase the spacing between them and the multiply instructions. The
code rearrangement is shown below.
Tomosulo’s Algorithm
1. It is implemented in floating point unit of IBM 360/91.
2. It is hardware based approach.
3. It has multiple functional units i.e. adder and multiplier. Each functional unit has a
reservation station( RS). Instructions are executed out of order only when the
operands are available.
4. This scheme resolves the resource conflicts.
5. An issued instruction is forwarded to RS if the operands are not available. It waits
in the reservation station until the operands become available. If the operands
are available it is dispatched for execution to the functional unit associated with
the RS. All working registers are tagged.
6. When the instruction has completed the execution the result is broadcasted to
common data bus along with its tag.
7. The registers as well as the RSs monitor the result bus(common data bus) and
update their contents (and ready/busy bits) when a matching tag is found.
9. The tightest upper bound on the effective pipeline throughput is obtained when
b= k-1 and n—> ∞
10. Suppose p = 0.2 , q= 0.6, and b =k-1= 7. We define the following performance
degradation factor.
The above analysis implies that the pipeline performance can be degraded by
46% with branching.