0% found this document useful (0 votes)

6 views

15CS72_ACA_Module3_chapter2finalnotes

This document discusses linear and nonlinear pipeline processors, detailing their structures, models, and performance metrics such as speedup, efficiency, and throughput. It covers the concepts of asynchronous and synchronous models, latency analysis, collision-free scheduling, and instruction pipeline design, highlighting the importance of optimizing pipeline performance through various techniques. Additionally, it emphasizes the need for careful design considerations to maximize the performance-to-cost ratio in pipeline systems.

Uploaded by

Tarun Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

15CS72_ACA_Module3_chapter2finalnotes

Uploaded by

Tarun Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Module 3 Chapter 2

6.1 Linear Pipeline Processors

A linear pipeline processor is where k processing stages are linearly connected to

perform a fixed function over a stream of data flowing from one end to the other. Also
the fixed functions can be partitioned into several subfunctions. The external input is
fed into first stage S1. The processed results are passed from stage Si to stage Si+1 for
all i = 1, 2,... k-1.It is also called as static pipeline.

6.1.1 Asynchronous and Synchronous Pipeline Models

Asynchronous Models
When stage Si is ready to transmit, it sends a ready signal to stage Si+1 After stage Si+1
receives the incoming data, it returns an acknowledgement signal to Si

● Useful in designing communication channels in message passing multicomputers

● Different amounts of delay may be experienced in different stages.

Synchronous Models
Latches are used between every two stages. Latches are made with master slave flip
flop which isolates inputs from outputs. Upon the arrival of a clock pulse, all latches
transfer data to the next stage simultaneously. The synchronous pipeline is shown
below
The utilization pattern of successive stages in a synchronous pipeline is specified by
reservation table given below.

Clocking and Timing Control

The clock cycle of the pipeline is denoted as . Let i be the time delay in stage Si and
d is the time delay of latch. Alo maximum stage delay is m

= m + d
The pipeline frequency is defined as the inverse of clock period which represents the
throughput of pipeline

f=1/

Clock Skewing

Ideally the clock pulse will arrive at all stages at the same time However due to problem
known as clock skewing the clock pulse may arrive at different stages with time offset of
s. Let tmax be the time delay of the longest logic path within a stage and tmin that of the
shortest logic path within a stage.

Hence

(d + tmax + s )≤ ≤ ( m+ t min – s)

In ideal case s=0, tmax = m and tmin = d

Speedup

Ideally a linear pipeline with k stages can process n tasks in k + (n-1) clock cycles. Thus
the total time requires is

Tk = [k+(n-1)]

Amount of time required to execute n tasks on non pipelined processor is

T1 =nk

The speedup factor for k-stage pipeline is

SK = T1/ Tk

nk
SK= k + (n−1)

nk
SK = k + (n−1)

Efficiency : The efficiency Ek of a linear k-stage pipeline is defined as

Ek =
SK/k

n
Ek = k+(n−1)
Obviously, the efficiency approaches 1 when n→ ∞ and a lower bound on Ek is 1/k
when n=1

Throughput

The pipeline throughput Hk is defined as the number of tasks performed per unit time
n
Hk = [k+(n−1)]

nf
Hk = k + (n−1)

The maximum throughput f occurs when Ek → 1as n→ ∞

Optimal Number of stages

Usually the number of pipeline stages would not exceed 10 in real computers. The
optimal choice of the number of pipeline stages should be able to maximize the
performance/cost ratio for the target processing load.

Let t is time taken to execute a program on non pipelined processor. If the same
program is executed on k-stage pipeline then clock period p = (t/k) + d

Thus, the pipeline has a maximum throughput of f= 1/ p = 1/(t/k + d). The total
pipeline cost is c+kh, where c covers the cost of all logic states and h represents the
cost of each latch. Thus pipeline performance/cost ratio(PCR) is given by

PCR=f/(c+kh)
1
PCR= (t/k + d) (c + kh)

Figure shown below plots PCR as a function of k

The peak of PCR corresponds to the optimal choice for the number of desired pipeline
stages

k0= √ t.c
d.h

Where t is the total flow through delay of the pipeline. Thus the total stage cost c, the
latch delay d. and the latch cost h must he considered to achieve the optimal value k0

6.2 NONLINEAR PIPELINE PROCESSORS

It is dynamic pipeline where variable functions are performed at different times. The
multifunction pipeline is shown below in figure.

Besides the streamline connections from S1 to S2 and from S2 to S3, there is a feed
forward connection from S1 to S3 and two feedback connection from S3 to S2 and from
S3 to S1. Also by following different data flow patterns, one can use the same pipeline
to evaluate different functions.

Reservation tables

In linear pipeline, the data flow is linear hence reservation table is simple. But in
non-linear pipeline data flow is nonlinear hence reservation table is not simple.

Multiple reservation tables can be generated for the evaluation of different functions.
Two reservation tables are given in figure below corresponding to a function X and
function Y, respectively. Each function evaluation is specified by one reservation table.
A static pipeline is specified by a single reservation table. A dynamic pipeline may be
specified by more than one reservation table.

The number of columns in a reservation table is called the evaluation time for given
function. For example, the function X requires eight clock cycles as shown in figure
respectively.

The checkmarks in each row of the reservation table correspond to the time instants
(cycles) that a particular stage will be used. There may be multiple checkmarks in a row,
which means repeated usage of the same stage in different cycles. Multiple checkmarks
in a column mean that multiple stages need to be used in parallel during a particular
clock cycle.

Latency Analysis

Pipeline has different initiations and number of clock cycles between two initiations is
called latency between them.

Any attempt by two or more initiations to use the same pipeline stage at the same time
is called as collision(resource conflict). Latencies that cause collisions are called
forbidden latencies as shown in figure.

As shown in figure(a) with latency 2, X1 and X2 collide in stage 2 at time 4 and 5.

Similarly, other collisions are shown at times 6,7,8,9,10..., etc.The collision patterns for
latency 5 are shown in figure(b), where X1 and X2 are scheduled 5 clock cycles apart.
Their first collision occurs at time 6.

To detect a forbidden latency, one needs simply to check the distance between any two
checkmarks in the same row of the reservation table. For example, the distance
between the first mark and the second mark in row S1, in figure given below for function
X is 5 implying the 5 is forbidden latency. Thus latencies 2, 4, 5 and 7 are all seen to be
forbidden for function X and latencies 2 and 4 are forbidden for function Y.
Latency Sequence : It is a sequence of permissible non forbidden latencies between
successive task initiations.

Latency Cycle: It is is a latency sequence which repeats the same subsequence (cycle)
indefinitely.

Hence for function X the non forbidden latencies are 1 and 8. Hence the latency cycle
for function X is (1,8) which repeats in successive initiations of new tasks. Three valid
latency cycles for evaluation of function X is shown below. As shown in figure(a) below
the latency between the initiation of X1 and X2 is 1 and between X2 and X3 is 8 and
between X3 and X4 is again 1. Hence the sequence 1,8,1,8,1,8 is repeatedly applied for
all initiations linearly.
The average latency of a latency cycle is obtained by dividing the sum of all latencies by
the number of latencies along the cycle. The latency cycle (1,8) thus has an average
latency of (1+8)/2= 4.5. A constant cycle is a latency cycle which contains only one
latency value. For example figure 6.5 b and 6.5 c have constant cycle and latency of 3
and 6.
6.2.2 Collision Free Scheduling

Hence the main objective is to minimize the average latency and avoid collisions when
scheduling events in non linear pipeline.

Collision Vector
For a reservation table with n columns the maximum forbidden latency m ≤ n-1. The
permissible latency p should be as small as possible. Hence 1≤ p ≤ m-1.

The combined Set of permissible and forbidden latencies can be easily displayed by a
collision vector, which is an m-bit binary vector C = (CmCm-1……..C3C2C1). The value of Ci
=1 if latency i causes a collision and Ci = 0 if latency i is permissible.

Hence for the reservation tables given above the collision vector Cx = (1011010) is
obtained for function X, and Cy = (1010) for function Y.
From Cx we can immediately tell that latencies 7,5, 4, and 2 are forbidden and latencies
6, 3 and 1 are permissible.

State Diagram

State diagram is constructed using the collision vector.

The collision vector at time period 1 is called as initial collision vector.
Initial state is represented by initial collision vector.
Next state is obtained using the n-bit shift register as shown below.

The initial collision vector is loaded into register. The contents of the register are
right shifted. When a 0 bit emerges from the right end after p shifts, it means p is
a permissible latency.When a 1 bit emerges from the right end after p shifts, it
means p is a forbidden latency. Also after each right shift 0 enters from the left
end of the shift register.
Consider the initial collision vector CX = (1011010). The next state after p shifts
is obtained by bitwise-OR of the initial collision vector with the shifted register
contents.
For example : CX = (1011010). The next state (1111111) is reached after one
right shift of the register. And state (1011011) is reached after three shifts or six
shifts. Refer figure given below

Current Next State Bitwise OR Shift No Shifted Permissible

state Register or forbidden
Content

1011010 1111111 1011010 1 0101101 Permissible

OR
0101101

1011010 101110 1011010 2 0010110 Forbidden

OR
0010110

1011010 1011011 1011010 3 0001011 Permissible

OR
0001011

1011010 1011111 1011010 4 0000101 Forbidden

OR
0000101

1011010 1011010 1011010 5 0000010 Forbidden

OR
0000010

1011010 1011011 1011010 6 0000001 Permissible

OR
0000001

1011010 1011010 1011010 7 0000000 Forbidden

OR
0000000

The state diagram is shown below. From the initial state [1011010] only three outgoing
transitions are possible, corresponding to the three permissible latency 6, 3, and 1 in the
initial collision vector. Similarly, from state [1011011] , one reaches the same state after
either three shifts or six shifts.When the number of shifts is m + 1 or greater, all
transitions are redirected back to the initial state.

Greedy Cycle
From the state diagram, we can determine optimal latency cycles which result in the
MAL. There are infinitely many latency cycles one can trace from the state diagram. For
example (1,8), (1,8,6,8), (3), (6), (3,8) are legitimate cycles traced from the state
diagram. A simple cycle is a latency cycle in which each state appears only once.
Hence for the above state diagram (b) the simple cycles are (3), (6), (8), (1,8), (3,8) and
(6,8). Some of the simple cycles are greedy cycle . A greedy cycle is one whose edges
are all made with minimum latencies from their respective starting states.For example,
in Fig.b the cycles (1, 8) and (3) are greedy cycles. Greedy cycles in Fig. c are (1, 5)
and (3). Such cycles must first be simple, and their average latencies must be lower
than those of the other simple cycles. The greedy cycle (1, 8) in Fig. b has an average
latency of 4.5, which is lower than that of the simple cycle (6,8) = (6 + 8)/2= 7. The
greedy cycle (3) has a constant latency which equals the MAL for evaluating function X
without causing a collision. The minimum-latency edges in the state diagrams are
marked with asterisks.

Pipeline Schedule Optimization

The idea is to insert non compute delay stages into the original pipeline. This will modify
the reservation table, resulting in a new collision vector and an improved state diagram.
The purpose is to yield an optimal latency cycle, which is absolutely the shortest.

Bounds on MAL
1. The MAL is lower bounded by the maximum number of checkmarks in any row of
the reservation table.
2. The MAL is lower than or equal to the average latency of any greedy cycle in the
state diagram.
3. The average latency of any greedy cycle is upper-bounded by the number of 1’s
in the initial collision vector plus 1.

6.3 Instruction Pipeline Design

A stream of instructions can be executed by a pipeline in an overlapped manner.
6.3.1 Instruction Execution Phases
Pipelined Instruction Processing
A typical instruction pipeline is depicted in figure given below. The fetch stage(F)
fetches instructions from a cache memory, ideally one per cycle. The decode stage (D)
reveals the instruction function to he performed and identifies the resources needed.
Resources include general-purpose registers, buses, and functional units. The issue
stage (I) reserves resources. The operands are also read from registers during the
issue stage.The instructions are executed in one or several execution stages (E). Three
execute stages are shown in figure.

Consider eight instructions are getting executed in pipeline in program order for the two
statements i.e. X= Y + Z and A = B * C as shown in figure below
The shaded boxes correspond to idle cycles when instruction issues are blocked due to
resource latency or conflicts or due to data dependencies. The first two load instructions
issue on consecutive cycles. The add is dependent on both loads and must wait three
cycles before the data (Y and Z) are loaded in.
Similarly, the store of the sum to memory location X must wait three cycles for the add
to finish due to a flow dependence.The total time required is 17 clock cycles. This time
is measured beginning at cycle 4 when the first instruction starts execution until cycle 20
when the last instruction starts execution.

Also if the original program order is not preserved and instructions are reordered and
executed the time is reduced to 11 clock cycles as shown in figure below.The reordering
should not change the end results.

Example : The MIPS R4000 instruction pipeline

1. It has an eight stage pipeline as shown in figure.
2. The instruction and data memory references are split across two stages.
3. This pipeline operated efficiently because different CPU resources, such as bus
access, ALU operations, register accesses, and so on, were utilized
simultaneously on a non interference basis. The overlapped execution of
successive instructions is shown in Figure given below

6.3.2 Mechanisms for Instruction Pipelining

Multiple Functional units

Multiple copies of the same pipeline stage can be used simultaneously. In order to
resolve data or resource dependencies among the successive instructions entering the
pipeline, the reservation station (RS) are used with each functional unit. Operations wait
in the RS until their data dependencies have been resolved. Each RS is uniquely
identified by a tag, which is monitored by a tag unit. The tag unit keeps checking the
tags from all currently used registers or RS. This register tagging technique allows the
hardware to resolve conflicts between source and destination registers assigned for
multiple instructions. The multiple functional units operate in parallel, once the
dependencies are resolved. This alleviates the bottleneck in the execution stages of the
instruction pipeline.

Prefetch Buffers
1. It is used to match the instruction fetch rate and the pipeline consumption rate .
2. Three types of buffers can be used namely sequential buffers, target buffers and
loop buffers. It is shown in figure below.
3. Sequential instructions are loaded into a pair of sequential buffers for
in-sequence pipelining. Instructions from a branch target are loaded into a pair of
target buffers for out-of-sequence pipelining. Both buffers operate in a
first-in-first-out fashion.
4. The branch condition is evaluated. If the branch is taken branch the instruction
from target buffer is loaded into pipeline otherwise the instruction from the
sequential buffer is loaded into the pipeline.
5. Within each pair, one can use one buffer to load instructions from memory and
use another buffer to feed instructions into the pipeline.
6. A third type of prefetch buffer is known as loop buffer.The loop buffer operates in
two steps. First, it contains instructions sequentially ahead of the current
instruction. This saves the instruction fetch time from memory. Second, it
recognises when the target of a branch falls within the loop boundary. In this
case, unnecessary memory accesses can be avoided if the target instruction is
already in the loop buffer.

Internal Data Forwarding

1. The throughput of a pipelined processor can be further improved with internal
data forwarding among multiple functional units.
2. There are two techniques i.e. store load forwarding and load-load forwarding. It is
shown in figure given below.
3. Store load Forwarding : The load operation(LD R2, M) from memory to register
R2 is replaced by the move operation ( MOVE R2,R1) from register R1 to R2.
Since register transfer is faster than memory access, this data forwarding will
reduce memory traffic and thus results in a shorter execution time.
4. Load Load forwarding : It replaces the second load operation (LD R2,M) with
move operation (MOVE R2,R1)
Hazard Avoidance
1. If the instructions are not executed in order, incorrect results may be read or
written, thereby producing hazards.
2. Consider two instructions I and J in program order such that J follows I.We use
the notation D(I) and R(I) for the domain and range of an instruction I.The domain
contains the input set to be used by instruction I. The range corresponds to the
output set of instructions I.
3. Listed below are the conditions in which the hazards can occur.

The RAW hazard corresponds to the flow dependence, WAR to the anti-dependence,
and WAW to the output dependence.
6.3.3 Dynamic Instruction Scheduling and Static Scheduling

Static Instruction Scheduling

1. Performed by optimizing compiler.
2. The data dependencies between the instructions create interlocked relationships
among them. Interlocking is resolved by static scheduling.
3. In static scheduling the instructions are rearranged and executed out of order.
4. The instructions are rearranged such that the interlocked instructions are
separated by a distance equal to the stage delay between them.
5. Consider the example given below.

6. The multiply instruction cannot be initiated until the preceding load is complete.
This data dependence will stall the pipeline for three clock cycles since the two
loads overlap by one cycle.
7. The two loads since they are independent of the add and move can be moved
ahead to increase the spacing between them and the multiply instructions. The
code rearrangement is shown below.

8. Through this code reanangement, the data dependencies and program

semantics are preserved, and multiply can be initiated without delay. Thus
pipeline stalling is avoided.

Dynamic Instruction Scheduling

Dynamic Instruction Scheduling is done in Tomasulo’s register-tagging scheme built in

the IBM 360/91 , or the scoreboarding scheme built in the CDC 6600 processor.

Tomosulo’s Algorithm
1. It is implemented in floating point unit of IBM 360/91.
2. It is hardware based approach.
3. It has multiple functional units i.e. adder and multiplier. Each functional unit has a
reservation station( RS). Instructions are executed out of order only when the
operands are available.
4. This scheme resolves the resource conflicts.
5. An issued instruction is forwarded to RS if the operands are not available. It waits
in the reservation station until the operands become available. If the operands
are available it is dispatched for execution to the functional unit associated with
the RS. All working registers are tagged.
6. When the instruction has completed the execution the result is broadcasted to
common data bus along with its tag.
7. The registers as well as the RSs monitor the result bus(common data bus) and
update their contents (and ready/busy bits) when a matching tag is found.

Consider the execution of two statements X= Y + Z and A= B* C in seven stage pipeline

with tomosulo’s algorithm as shown below The total execution time is 13 clock cycles
counting from cycle 4 to cycle 15 by ignoring the pipeline startup and draining time.

6.3.4 Branch Handling Techniques.

The performance of the pipeline is affected by the presence of the branch instructions.
Effect of branching
1. The taken branch is instruction in which the condition evaluates to true. Thus for
taken branch the next instruction fetched for execution will be nonsequential or
remote instruction called as branch target.
2. The number of pipeline cycles wasted between a branch taken and the fetching
of its branch target is called the delay slot and is denoted as b.
3. Also 0≤b≤k-1 where k is the number of pipeline stages.
4. When the branch is taken branch all the instructions following the branch in the
pipeline become useless and will be drained from the pipeline.
5. Let p be the probability of a conditional branch instruction in a typical instruction
stream and q the probability of a successfully executed conditional branch
instruction (a branch taken). Typical values of p = 20% and q = 60% have been
observed in some programs.
6. The penalty paid by branching is equal to pqnb𝝉 because each branch taken
costs b𝝉 extra pipeline cycles.
7. The total execution time for n instruction with effect of branching is given below.

8. The effective pipeline throughput is given below.

9. The tightest upper bound on the effective pipeline throughput is obtained when
b= k-1 and n—> ∞

10. Suppose p = 0.2 , q= 0.6, and b =k-1= 7. We define the following performance
degradation factor.

The above analysis implies that the pipeline performance can be degraded by
46% with branching.

Branch Prediction and Arithmetic Pipeline( Refer Textbook)

2.2 Pipelining: Asynchronous
25% (4)
2.2 Pipelining: Asynchronous
24 pages
Key Calculator
No ratings yet
Key Calculator
7 pages
Lecture 0 - CpE 690 Introduction To VLSI Design
No ratings yet
Lecture 0 - CpE 690 Introduction To VLSI Design
8 pages
Stud CSA Mod4 p2 PipeliningBasics
No ratings yet
Stud CSA Mod4 p2 PipeliningBasics
83 pages
ACA - Chapter 6
No ratings yet
ACA - Chapter 6
75 pages
AAPP Mod3 Latest
No ratings yet
AAPP Mod3 Latest
65 pages
Pipelining
No ratings yet
Pipelining
29 pages
Chapter 6 (Pipelining and Superscalar Techniques)
No ratings yet
Chapter 6 (Pipelining and Superscalar Techniques)
10 pages
Pipelining: Advanced Computer Architecture
No ratings yet
Pipelining: Advanced Computer Architecture
23 pages
4.non Linear Pipeline
88% (8)
4.non Linear Pipeline
20 pages
Aca Module 2
100% (1)
Aca Module 2
35 pages
BCS-29 Advanced Computer Architecture: Linear & Nonlinear Pipelines Instruction Pipelines & Arithmetic Operations
No ratings yet
BCS-29 Advanced Computer Architecture: Linear & Nonlinear Pipelines Instruction Pipelines & Arithmetic Operations
33 pages
Pipelinenew
No ratings yet
Pipelinenew
43 pages
2 - Performance Issue
No ratings yet
2 - Performance Issue
4 pages
2 Performance Issue
No ratings yet
2 Performance Issue
4 pages
CAO-II Module 2 Complete
100% (1)
CAO-II Module 2 Complete
32 pages
Unit II
No ratings yet
Unit II
25 pages
Module 5 Notes Bcs302[1]
No ratings yet
Module 5 Notes Bcs302[1]
22 pages
Linear Pipelining
No ratings yet
Linear Pipelining
24 pages
Chap-10: Speed and Efficiency
No ratings yet
Chap-10: Speed and Efficiency
29 pages
revision1
No ratings yet
revision1
14 pages
APznzabDMN0K7ucLj 5y16mZ4MCAzvYka6XPubu o-J2kvJ41PtLmk6WmKHv2EeC4Ezo2wWs0bceGCsYwyq4dsvlt0hqLhY17sXl8HI4hJMeArq1cYV0OrVA-LXS0 77s jVurWxDlctuiAfZ24C8IrdGDNq-YxVFyEtTihvDe2xUFnrVedfCLXwLd0z
No ratings yet
APznzabDMN0K7ucLj 5y16mZ4MCAzvYka6XPubu o-J2kvJ41PtLmk6WmKHv2EeC4Ezo2wWs0bceGCsYwyq4dsvlt0hqLhY17sXl8HI4hJMeArq1cYV0OrVA-LXS0 77s jVurWxDlctuiAfZ24C8IrdGDNq-YxVFyEtTihvDe2xUFnrVedfCLXwLd0z
20 pages
Section A
No ratings yet
Section A
18 pages
Pipeline 1
No ratings yet
Pipeline 1
17 pages
Nonlinear Pipelining: Nonlinear Pipeline: Which Allows
No ratings yet
Nonlinear Pipelining: Nonlinear Pipeline: Which Allows
25 pages
CA
No ratings yet
CA
30 pages
Nonlinear Pipelining: Nonlinear Pipeline: Which Allows
No ratings yet
Nonlinear Pipelining: Nonlinear Pipeline: Which Allows
25 pages
Uni1-2 Pipelining
No ratings yet
Uni1-2 Pipelining
12 pages
Pipeline (Computing) : Concept and Motivation Design Considerations
No ratings yet
Pipeline (Computing) : Concept and Motivation Design Considerations
4 pages
HPC Question Bank
No ratings yet
HPC Question Bank
5 pages
Pipeline Processing
No ratings yet
Pipeline Processing
43 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
16 pages
Computer Organization: An Introduction To RISC Hardware: 6.1 An Overview of Pipelining
No ratings yet
Computer Organization: An Introduction To RISC Hardware: 6.1 An Overview of Pipelining
12 pages
PIPELINE
No ratings yet
PIPELINE
13 pages
Lecture Notes On Parallel Processing Pipeline
No ratings yet
Lecture Notes On Parallel Processing Pipeline
12 pages
Lecture 7 - PIPELINING
No ratings yet
Lecture 7 - PIPELINING
16 pages
Aca Tut1-1
No ratings yet
Aca Tut1-1
5 pages
Pipe Lining
No ratings yet
Pipe Lining
23 pages
Problem
No ratings yet
Problem
19 pages
Pipelining - Computer Architecture and Organization
No ratings yet
Pipelining - Computer Architecture and Organization
40 pages
Micro Controler
No ratings yet
Micro Controler
5 pages
George 2015
No ratings yet
George 2015
5 pages
CO Gate 2023
No ratings yet
CO Gate 2023
6 pages
Co Unit 4
No ratings yet
Co Unit 4
17 pages
Unit-4 Pipelinie and Vector Processing
No ratings yet
Unit-4 Pipelinie and Vector Processing
33 pages
Unit 3
No ratings yet
Unit 3
64 pages
Parallel Processing Chapter - 3: Instruction Level Parallelism
No ratings yet
Parallel Processing Chapter - 3: Instruction Level Parallelism
33 pages
Pipelined Architecture With Its Diagram
No ratings yet
Pipelined Architecture With Its Diagram
20 pages
Chapter 6
No ratings yet
Chapter 6
71 pages
ACA - Pipelining
No ratings yet
ACA - Pipelining
25 pages
Computer Organization and Architecture Pipelining Set Execution, Stages and Throughput
No ratings yet
Computer Organization and Architecture Pipelining Set Execution, Stages and Throughput
7 pages
Assignment 2 Solution
0% (1)
Assignment 2 Solution
4 pages
Pipelining: Advanced Computer Architecture
100% (1)
Pipelining: Advanced Computer Architecture
30 pages
Please Note That This Home Work Consist of 10 Marks (Instead of 7 Marks)
No ratings yet
Please Note That This Home Work Consist of 10 Marks (Instead of 7 Marks)
3 pages
Pipe Line1
No ratings yet
Pipe Line1
7 pages
Practice Problems Based On Pipelining
No ratings yet
Practice Problems Based On Pipelining
18 pages
CSC 424 Assignment
100% (1)
CSC 424 Assignment
8 pages
15CS72 IAT3 Solution
No ratings yet
15CS72 IAT3 Solution
17 pages
Pipeline: A Simple Implementation of A RISC Instruction Set
No ratings yet
Pipeline: A Simple Implementation of A RISC Instruction Set
16 pages
3 Real-Time Scheduling: Summary: in Previous Chapter We Presented Basic Services Provided by A Real-Time Operating
No ratings yet
3 Real-Time Scheduling: Summary: in Previous Chapter We Presented Basic Services Provided by A Real-Time Operating
44 pages
Exercises of Tensors
From Everand
Exercises of Tensors
Simone Malacrida
No ratings yet
A Complete Course in Physics ( Graphs ) - First Edition
From Everand
A Complete Course in Physics ( Graphs ) - First Edition
Rajat Kalia
No ratings yet
8 Bit ALU by Xilinx
No ratings yet
8 Bit ALU by Xilinx
16 pages
Associative Memory
No ratings yet
Associative Memory
31 pages
Acer Aspire One AO521 Quanta ZH9 AMD Laptop Schematics PDF
No ratings yet
Acer Aspire One AO521 Quanta ZH9 AMD Laptop Schematics PDF
38 pages
A Hybrid Delay Model For Interconnected Multi-Input Gates
No ratings yet
A Hybrid Delay Model For Interconnected Multi-Input Gates
10 pages
18ECC203J - Microprocessor, Microcontroller & Interfacing Techniques Lab
No ratings yet
18ECC203J - Microprocessor, Microcontroller & Interfacing Techniques Lab
13 pages
Chapter 2-2 Verilog
No ratings yet
Chapter 2-2 Verilog
23 pages
Computer Architecture and Organization: Adigrat University Electrical and Computer Engineering Dep't
No ratings yet
Computer Architecture and Organization: Adigrat University Electrical and Computer Engineering Dep't
35 pages
Pipelining - Basic Principles, Classification of Pipeline Processors
No ratings yet
Pipelining - Basic Principles, Classification of Pipeline Processors
30 pages
Tut14 Logic Gate
No ratings yet
Tut14 Logic Gate
7 pages
Moving A DVD or CD Drive To Another LPAR
No ratings yet
Moving A DVD or CD Drive To Another LPAR
8 pages
Canbus
No ratings yet
Canbus
28 pages
Anna University Embedded Systems Syllabus
No ratings yet
Anna University Embedded Systems Syllabus
3 pages
AMD 64 BIOS and Kenrel Dev's Guide
No ratings yet
AMD 64 BIOS and Kenrel Dev's Guide
434 pages
How To Use DDR2 RAM PDF
No ratings yet
How To Use DDR2 RAM PDF
40 pages
El382 Microprocessors and Digital Control Systems 2019 1
No ratings yet
El382 Microprocessors and Digital Control Systems 2019 1
8 pages
Talking Energy Meter
100% (1)
Talking Energy Meter
11 pages
Eijkhout - Intro To HPC
No ratings yet
Eijkhout - Intro To HPC
482 pages
MPI GTU Study Material Presentations Unit-4 11082019035916AM
No ratings yet
MPI GTU Study Material Presentations Unit-4 11082019035916AM
164 pages
Introduction To Microcontroller: 1 Razkiah@kym - Edu.my
No ratings yet
Introduction To Microcontroller: 1 Razkiah@kym - Edu.my
29 pages
QP DPCO
No ratings yet
QP DPCO
23 pages
Memory SV
No ratings yet
Memory SV
5 pages
Module 4 CMOS Logic
No ratings yet
Module 4 CMOS Logic
18 pages
CD4007
No ratings yet
CD4007
2 pages
Designing a Digital Circuit with a Programmable Logic Device (PLD)
100% (1)
Designing a Digital Circuit with a Programmable Logic Device (PLD)
3 pages
Input and Output
No ratings yet
Input and Output
38 pages
Pre-Requisites: Microprocessor and Its Application Digital Electronics
No ratings yet
Pre-Requisites: Microprocessor and Its Application Digital Electronics
11 pages
256 Kilobit (32 K X 8-Bit) CMOS EPROM: Distinctive Characteristics
No ratings yet
256 Kilobit (32 K X 8-Bit) CMOS EPROM: Distinctive Characteristics
12 pages
Lab 3 Introduction To Verilog
No ratings yet
Lab 3 Introduction To Verilog
6 pages

15CS72_ACA_Module3_chapter2finalnotes

Uploaded by

15CS72_ACA_Module3_chapter2finalnotes

Uploaded by

Module 3 Chapter 2

6.1 Linear Pipeline Processors

A linear pipeline processor is where k processing stages are linearly connected to

6.1.1 Asynchronous and Synchronous Pipeline Models

● Useful in designing communication channels in message passing multicomputers

Clocking and Timing Control

(d + t​max​ + s )≤ ≤ ( ​m+ t​​ min​ – s)

In ideal case s=0, t​max​ = ​m ​and t​min​ = d

Amount of time required to execute n tasks on non pipelined processor is

The speedup factor for k-stage pipeline is

S​K​ = T​1​/ T​k

Efficiency ​: The efficiency E​k​ of a linear k-stage pipeline is defined as

The maximum throughput f occurs when E​k​ → 1as n→ ∞

Optimal Number of stages

Figure shown below plots PCR as a function of k

6.2 NONLINEAR PIPELINE PROCESSORS

As shown in figure(a) with latency 2, X1 and X2 collide in stage 2 at time 4 and 5.

State diagram is constructed using the collision vector.

Current Next State Bitwise OR Shift No Shifted Permissible

1011010 1111111 1011010 1 0101101 Permissible

1011010 101110 1011010 2 0010110 Forbidden

1011010 1011011 1011010 3 0001011 Permissible

1011010 1011111 1011010 4 0000101 Forbidden

1011010 1011010 1011010 5 0000010 Forbidden

1011010 1011011 1011010 6 0000001 Permissible

1011010 1011010 1011010 7 0000000 Forbidden

Pipeline Schedule Optimization

6.3 Instruction Pipeline Design

Example : The MIPS R4000 instruction pipeline

6.3.2 Mechanisms for Instruction Pipelining

Multiple Functional units

Internal Data Forwarding

Static Instruction Scheduling

8. Through this code reanangement, the data dependencies and program

Dynamic Instruction Scheduling

Dynamic Instruction Scheduling is done in Tomasulo’s register-tagging scheme built in

Consider the execution of two statements X= Y + Z and A= B* C in seven stage pipeline

6.3.4 Branch Handling Techniques.

8. The effective pipeline throughput is given below.

Branch Prediction and Arithmetic Pipeline( Refer Textbook)

You might also like

(d + tmax + s )≤ ≤ ( m+ t min – s)

In ideal case s=0, tmax = m and tmin = d

SK = T1/ Tk

Efficiency : The efficiency Ek of a linear k-stage pipeline is defined as

The maximum throughput f occurs when Ek → 1as n→ ∞