0% found this document useful (0 votes)
6 views

15CS72_ACA_Module3_chapter2finalnotes

This document discusses linear and nonlinear pipeline processors, detailing their structures, models, and performance metrics such as speedup, efficiency, and throughput. It covers the concepts of asynchronous and synchronous models, latency analysis, collision-free scheduling, and instruction pipeline design, highlighting the importance of optimizing pipeline performance through various techniques. Additionally, it emphasizes the need for careful design considerations to maximize the performance-to-cost ratio in pipeline systems.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

15CS72_ACA_Module3_chapter2finalnotes

This document discusses linear and nonlinear pipeline processors, detailing their structures, models, and performance metrics such as speedup, efficiency, and throughput. It covers the concepts of asynchronous and synchronous models, latency analysis, collision-free scheduling, and instruction pipeline design, highlighting the importance of optimizing pipeline performance through various techniques. Additionally, it emphasizes the need for careful design considerations to maximize the performance-to-cost ratio in pipeline systems.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Module 3 Chapter 2

6.1 Linear Pipeline Processors

A linear pipeline processor is where k processing stages are linearly connected to


perform a fixed function over a stream of data flowing from one end to the other. Also
the fixed functions can be partitioned into several subfunctions. The external input is
fed into first stage S​1​. The processed results are passed from stage S​i to stage S​i+1 for
all i = 1, 2,... k-1.It is also called as static pipeline.

6.1.1 Asynchronous and Synchronous Pipeline Models

Asynchronous Models
When stage S​i is ready to transmit, it sends a ready signal to stage S​i+1 After stage S​i+1
receives the incoming data, it returns an acknowledgement signal to S​i

● Useful in designing communication channels in message passing multicomputers


● Different amounts of delay may be experienced in different stages.

Synchronous Models
Latches are used between every two stages. Latches are made with master slave flip
flop which isolates inputs from outputs. Upon the arrival of a clock pulse, all latches
transfer data to the next stage simultaneously. The synchronous pipeline is shown
below
The utilization pattern of successive stages in a synchronous pipeline is specified by
reservation table given below.

Clocking and Timing Control

The clock cycle of the pipeline is denoted as . Let ​i be the time delay in stage S​i and
d is the time delay of latch. Alo maximum stage delay is ​m

= ​m​ + d
The pipeline frequency is defined as the inverse of clock period which represents the
throughput of pipeline

f=1/

Clock Skewing

Ideally the clock pulse will arrive at all stages at the same time However due to problem
known as clock skewing the clock pulse may arrive at different stages with time offset of
s. Let t​max be the time delay of the longest logic path within a stage and t​min that of the
shortest logic path within a stage.

Hence

(d + t​max​ + s )≤ ≤ ( ​m+ t​​ min​ – s)

In ideal case s=0, t​max​ = ​m ​and t​min​ = d

Speedup

Ideally a linear pipeline with k stages can process n tasks in k + (n-1) clock cycles. Thus
the total time requires is

T​k​ = [k+(n-1)]

Amount of time required to execute n tasks on non pipelined processor is

T​1​ =nk

The speedup factor for k-stage pipeline is

S​K​ = T​1​/ T​k


nk
S​K​= k + (n−1)

nk
S​K​ = k + (n−1)

Efficiency ​: The efficiency E​k​ of a linear k-stage pipeline is defined as

E​k =
​ S​K​/k

n
E​k​ = k+(n−1)
Obviously, the efficiency approaches 1 when n→ ∞ and a lower bound on E​k is 1/k
when n=1

Throughput

The pipeline throughput H​k​ is defined as the number of tasks performed per unit time
n
H​k​ = [k+(n−1)]

nf
H​k​ = k + (n−1)

The maximum throughput f occurs when E​k​ → 1as n→ ∞

Optimal Number of stages

Usually the number of pipeline stages would not exceed 10 in real computers. The
optimal choice of the number of pipeline stages should be able to maximize the
performance/cost ratio for the target processing load.

Let t is time taken to execute a program on non pipelined processor. If the same
program is executed on k-stage pipeline then clock period p = (t/k) + d

Thus, the pipeline has a maximum throughput of f= 1/ p = 1/(t/k + d). The total
pipeline cost is c+kh, where c covers the cost of all logic states and h represents the
cost of each latch. Thus pipeline performance/cost ratio(PCR) is given by

PCR=f/(c+kh)
1
PCR= (t/k + d) (c + kh)

Figure shown below plots PCR as a function of k


The peak of PCR corresponds to the optimal choice for the number of desired pipeline
stages

k​0​= √ t.c
d.h

Where t is the total flow through delay of the pipeline. Thus the total stage cost c, the
latch delay d. and the latch cost h must he considered to achieve the optimal value k​0

6.2 NONLINEAR PIPELINE PROCESSORS


It is dynamic pipeline where variable functions are performed at different times. The
multifunction pipeline is shown below in figure.

Besides the streamline connections from S1 to S2 and from S2 to S3, there is a feed
forward connection from S1 to S3 and two feedback connection from S3 to S2 and from
S3 to S1. Also by following different data flow patterns, one can use the same pipeline
to evaluate different functions.

Reservation tables

In linear pipeline, the data flow is linear hence reservation table is simple. But in
non-linear pipeline data flow is nonlinear hence reservation table is not simple.

Multiple reservation tables can be generated for the evaluation of different functions.
Two reservation tables are given in figure below corresponding to a function X and
function Y, respectively. Each function evaluation is specified by one reservation table.
A static pipeline is specified by a single reservation table. A dynamic pipeline may be
specified by more than one reservation table.

The number of columns in a reservation table is called the evaluation time for given
function. For example, the function X requires eight clock cycles as shown in figure
respectively.

The checkmarks in each row of the reservation table correspond to the time instants
(cycles) that a particular stage will be used. There may be multiple checkmarks in a row,
which means repeated usage of the same stage in different cycles. Multiple checkmarks
in a column mean that multiple stages need to be used in parallel during a particular
clock cycle.

Latency Analysis

Pipeline has different initiations and number of clock cycles between two initiations is
called latency between them.

Any attempt by two or more initiations to use the same pipeline stage at the same time
is called as collision(resource conflict). Latencies that cause collisions are called
forbidden latencies as shown in figure.

As shown in figure(a) with latency 2, X1 and X2 collide in stage 2 at time 4 and 5.


Similarly, other collisions are shown at times 6,7,8,9,10..., etc.The collision patterns for
latency 5 are shown in figure(b), where X1 and X2 are scheduled 5 clock cycles apart.
Their first collision occurs at time 6.

To detect a forbidden latency, one needs simply to check the distance between any two
checkmarks in the same row of the reservation table. For example, the distance
between the first mark and the second mark in row S1, in figure given below for function
X is 5 implying the 5 is forbidden latency. Thus latencies 2, 4, 5 and 7 are all seen to be
forbidden for function X and latencies 2 and 4 are forbidden for function Y.
Latency Sequence : It is a sequence of permissible non forbidden latencies between
successive task initiations.

Latency Cycle: It is is a latency sequence which repeats the same subsequence (cycle)
indefinitely.

Hence for function X the non forbidden latencies are 1 and 8. Hence the latency cycle
for function X is (1,8) which repeats in successive initiations of new tasks. Three valid
latency cycles for evaluation of function X is shown below. As shown in figure(a) below
the latency between the initiation of X1 and X2 is 1 and between X2 and X3 is 8 and
between X3 and X4 is again 1. Hence the sequence 1,8,1,8,1,8 is repeatedly applied for
all initiations linearly.
The average latency of a latency cycle is obtained by dividing the sum of all latencies by
the number of latencies along the cycle. The latency cycle (1,8) thus has an average
latency of (1+8)/2= 4.5. A constant cycle is a latency cycle which contains only one
latency value. For example figure 6.5 b and 6.5 c have constant cycle and latency of 3
and 6.
6.2.2 Collision Free Scheduling

Hence the main objective is to minimize the average latency and avoid collisions when
scheduling events in non linear pipeline.

Collision Vector
For a reservation table with n columns the maximum forbidden latency m ≤ n-1. The
permissible latency p should be as small as possible. Hence 1≤ p ≤ m-1.

The combined Set of permissible and forbidden latencies can be easily displayed by a
collision vector, which is an m-bit binary vector C = (C​m​C​m-1……..​C​3​C​2​C​1​). The value of C​i
=1 if latency i causes a collision and C​i​ = 0 if latency i is permissible.

Hence for the reservation tables given above the collision vector C​x = (1011010) is
obtained for function X, and C​y​ = (1010) for function Y.
From C​x we can immediately tell that latencies 7,5, 4, and 2 are forbidden and latencies
6, 3 and 1 are permissible.

State Diagram

State diagram is constructed using the collision vector.


The collision vector at time period 1 is called as initial collision vector.
Initial state is represented by initial collision vector.
Next state is obtained using the n-bit shift register as shown below.

The initial collision vector is loaded into register. The contents of the register are
right shifted. When a 0 bit emerges from the right end after p shifts, it means p is
a permissible latency.When a 1 bit emerges from the right end after p shifts, it
means p is a forbidden latency. Also after each right shift 0 enters from the left
end of the shift register.
Consider the initial collision vector C​X = (1011010). The next state after p shifts
is obtained by bitwise-OR of the initial collision vector with the shifted register
contents.
For example : C​X = (1011010). The next state (1111111) is reached after one
right shift of the register. And state (1011011) is reached after three shifts or six
shifts. Refer figure given below

Current Next State Bitwise OR Shift No Shifted Permissible


state Register or forbidden
Content

1011010 1111111 1011010 1 0101101 Permissible


OR
0101101

1011010 101110 1011010 2 0010110 Forbidden


OR
0010110

1011010 1011011 1011010 3 0001011 Permissible


OR
0001011

1011010 1011111 1011010 4 0000101 Forbidden


OR
0000101

1011010 1011010 1011010 5 0000010 Forbidden


OR
0000010

1011010 1011011 1011010 6 0000001 Permissible


OR
0000001

1011010 1011010 1011010 7 0000000 Forbidden


OR
0000000

The state diagram is shown below. From the initial state [1011010] only three outgoing
transitions are possible, corresponding to the three permissible latency 6, 3, and 1 in the
initial collision vector. Similarly, from state [1011011] , one reaches the same state after
either three shifts or six shifts.When the number of shifts is m + 1 or greater, all
transitions are redirected back to the initial state.

Greedy Cycle
From the state diagram, we can determine optimal latency cycles which result in the
MAL. There are infinitely many latency cycles one can trace from the state diagram. For
example (1,8), (1,8,6,8), (3), (6), (3,8) are legitimate cycles traced from the state
diagram. A simple cycle is a latency cycle in which each state appears only once.
Hence for the above state diagram (b) the simple cycles are (3), (6), (8), (1,8), (3,8) and
(6,8). Some of the simple cycles are greedy cycle . A greedy cycle is one whose edges
are all made with minimum latencies from their respective starting states.For example,
in Fig.b the cycles (1, 8) and (3) are greedy cycles. Greedy cycles in Fig. c are (1, 5)
and (3). Such cycles must first be simple, and their average latencies must be lower
than those of the other simple cycles. The greedy cycle (1, 8) in Fig. b has an average
latency of 4.5, which is lower than that of the simple cycle (6,8) = (6 + 8)/2= 7. The
greedy cycle (3) has a constant latency which equals the MAL for evaluating function X
without causing a collision. The minimum-latency edges in the state diagrams are
marked with asterisks.

Pipeline Schedule Optimization


The idea is to insert non compute delay stages into the original pipeline. This will modify
the reservation table, resulting in a new collision vector and an improved state diagram.
The purpose is to yield an optimal latency cycle, which is absolutely the shortest.

Bounds on MAL
1. The MAL is lower bounded by the maximum number of checkmarks in any row of
the reservation table.
2. The MAL is lower than or equal to the average latency of any greedy cycle in the
state diagram.
3. The average latency of any greedy cycle is upper-bounded by the number of 1’s
in the initial collision vector plus 1.

6.3 Instruction Pipeline Design


A stream of instructions can be executed by a pipeline in an overlapped manner.
6.3.1 Instruction Execution Phases
Pipelined Instruction Processing
A typical instruction pipeline is depicted in figure given below. The fetch stage(F)
fetches instructions from a cache memory, ideally one per cycle. The decode stage (D)
reveals the instruction function to he performed and identifies the resources needed.
Resources include general-purpose registers, buses, and functional units. The issue
stage (I) reserves resources. The operands are also read from registers during the
issue stage.The instructions are executed in one or several execution stages (E). Three
execute stages are shown in figure.

Consider eight instructions are getting executed in pipeline in program order for the two
statements i.e. X= Y + Z and A = B * C as shown in figure below
The shaded boxes correspond to idle cycles when instruction issues are blocked due to
resource latency or conflicts or due to data dependencies. The first two load instructions
issue on consecutive cycles. The add is dependent on both loads and must wait three
cycles before the data (Y and Z) are loaded in.
Similarly, the store of the sum to memory location X must wait three cycles for the add
to finish due to a flow dependence.The total time required is 17 clock cycles. This time
is measured beginning at cycle 4 when the first instruction starts execution until cycle 20
when the last instruction starts execution.

Also if the original program order is not preserved and instructions are reordered and
executed the time is reduced to 11 clock cycles as shown in figure below.The reordering
should not change the end results.

Example : The MIPS R4000 instruction pipeline


1. It has an eight stage pipeline as shown in figure.
2. The instruction and data memory references are split across two stages.
3. This pipeline operated efficiently because different CPU resources, such as bus
access, ALU operations, register accesses, and so on, were utilized
simultaneously on a non interference basis. The overlapped execution of
successive instructions is shown in Figure given below

6.3.2 Mechanisms for Instruction Pipelining

Multiple Functional units


Multiple copies of the same pipeline stage can be used simultaneously. In order to
resolve data or resource dependencies among the successive instructions entering the
pipeline, the reservation station (RS) are used with each functional unit. Operations wait
in the RS until their data dependencies have been resolved. Each RS is uniquely
identified by a tag, which is monitored by a tag unit. The tag unit keeps checking the
tags from all currently used registers or RS. This register tagging technique allows the
hardware to resolve conflicts between source and destination registers assigned for
multiple instructions. The multiple functional units operate in parallel, once the
dependencies are resolved. This alleviates the bottleneck in the execution stages of the
instruction pipeline.

Prefetch Buffers
1. It is used to match the instruction fetch rate and the pipeline consumption rate .
2. Three types of buffers can be used namely sequential buffers, target buffers and
loop buffers. It is shown in figure below.
3. Sequential instructions are loaded into a pair of sequential buffers for
in-sequence pipelining. Instructions from a branch target are loaded into a pair of
target buffers for out-of-sequence pipelining. Both buffers operate in a
first-in-first-out fashion.
4. The branch condition is evaluated. If the branch is taken branch the instruction
from target buffer is loaded into pipeline otherwise the instruction from the
sequential buffer is loaded into the pipeline.
5. Within each pair, one can use one buffer to load instructions from memory and
use another buffer to feed instructions into the pipeline.
6. A third type of prefetch buffer is known as loop buffer.The loop buffer operates in
two steps. First, it contains instructions sequentially ahead of the current
instruction. This saves the instruction fetch time from memory. Second, it
recognises when the target of a branch falls within the loop boundary. In this
case, unnecessary memory accesses can be avoided if the target instruction is
already in the loop buffer.

Internal Data Forwarding


1. The throughput of a pipelined processor can be further improved with internal
data forwarding among multiple functional units.
2. There are two techniques i.e. store load forwarding and load-load forwarding. It is
shown in figure given below.
3. Store load Forwarding : The load operation(LD R2, M) from memory to register
R2 is replaced by the move operation ( MOVE R2,R1) from register R1 to R2.
Since register transfer is faster than memory access, this data forwarding will
reduce memory traffic and thus results in a shorter execution time.
4. Load Load forwarding : It replaces the second load operation (LD R2,M) with
move operation (MOVE R2,R1)
Hazard Avoidance
1. If the instructions are not executed in order, incorrect results may be read or
written, thereby producing hazards.
2. Consider two instructions I and J in program order such that J follows I.We use
the notation D(I) and R(I) for the domain and range of an instruction I.The domain
contains the input set to be used by instruction I. The range corresponds to the
output set of instructions I.
3. Listed below are the conditions in which the hazards can occur.

The RAW hazard corresponds to the flow dependence, WAR to the anti-dependence,
and WAW to the output dependence.
6.3.3 Dynamic Instruction Scheduling and Static Scheduling

Static Instruction Scheduling


1. Performed by optimizing compiler.
2. The data dependencies between the instructions create interlocked relationships
among them. Interlocking is resolved by static scheduling.
3. In static scheduling the instructions are rearranged and executed out of order.
4. The instructions are rearranged such that the interlocked instructions are
separated by a distance equal to the stage delay between them.
5. Consider the example given below.

6. The multiply instruction cannot be initiated until the preceding load is complete.
This data dependence will stall the pipeline for three clock cycles since the two
loads overlap by one cycle.
7. The two loads since they are independent of the add and move can be moved
ahead to increase the spacing between them and the multiply instructions. The
code rearrangement is shown below.

8. Through this code reanangement, the data dependencies and program


semantics are preserved, and multiply can be initiated without delay. Thus
pipeline stalling is avoided.

Dynamic Instruction Scheduling

Dynamic Instruction Scheduling is done in Tomasulo’s register-tagging scheme built in


the IBM 360/91 , or the scoreboarding scheme built in the CDC 6600 processor.

Tomosulo’s Algorithm
1. It is implemented in floating point unit of IBM 360/91.
2. It is hardware based approach.
3. It has multiple functional units i.e. adder and multiplier. Each functional unit has a
reservation station( RS). Instructions are executed out of order only when the
operands are available.
4. This scheme resolves the resource conflicts.
5. An issued instruction is forwarded to RS if the operands are not available. It waits
in the reservation station until the operands become available. If the operands
are available it is dispatched for execution to the functional unit associated with
the RS. All working registers are tagged.
6. When the instruction has completed the execution the result is broadcasted to
common data bus along with its tag.
7. The registers as well as the RSs monitor the result bus(common data bus) and
update their contents (and ready/busy bits) when a matching tag is found.

Consider the execution of two statements X= Y + Z and A= B* C in seven stage pipeline


with tomosulo’s algorithm as shown below The total execution time is 13 clock cycles
counting from cycle 4 to cycle 15 by ignoring the pipeline startup and draining time.

6.3.4 Branch Handling Techniques.


The performance of the pipeline is affected by the presence of the branch instructions.
Effect of branching
1. The taken branch is instruction in which the condition evaluates to true. Thus for
taken branch the next instruction fetched for execution will be nonsequential or
remote instruction called as branch target.
2. The number of pipeline cycles wasted between a branch taken and the fetching
of its branch target is called the delay slot and is denoted as b.
3. Also 0≤b≤k-1 where k is the number of pipeline stages.
4. When the branch is taken branch all the instructions following the branch in the
pipeline become useless and will be drained from the pipeline.
5. Let p be the probability of a conditional branch instruction in a typical instruction
stream and q the probability of a successfully executed conditional branch
instruction (a branch taken). Typical values of p = 20% and q = 60% have been
observed in some programs.
6. The penalty paid by branching is equal to pqnb𝝉 because each branch taken
costs b𝝉 extra pipeline cycles.
7. The total execution time for n instruction with effect of branching is given below.

8. The effective pipeline throughput is given below.

9. The tightest upper bound on the effective pipeline throughput is obtained when
b= k-1 and n—> ∞

10. Suppose p = 0.2 , q= 0.6, and b =k-1= 7. We define the following performance
degradation factor.

The above analysis implies that the pipeline performance can be degraded by
46% with branching.

Branch Prediction and Arithmetic Pipeline( Refer Textbook)

You might also like