0% found this document useful (0 votes)
13 views

Pipelining

The document discusses pipelining in computer architecture, illustrating its benefits through a laundry example and detailing the RISC instruction set's 5-stage pipeline. It covers various types of hazards that can occur during pipelining, including structural, data, and control hazards, and explains how these can impact performance. Additionally, it highlights the importance of throughput over latency and the need for careful design to mitigate hazards.

Uploaded by

Sangita Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Pipelining

The document discusses pipelining in computer architecture, illustrating its benefits through a laundry example and detailing the RISC instruction set's 5-stage pipeline. It covers various types of hazards that can occur during pipelining, including structural, data, and control hazards, and explains how these can impact performance. Additionally, it highlights the importance of throughput over latency and the need for careful design to mitigate hazards.

Uploaded by

Sangita Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Pipelining & Hazards

1
Pipelining: Its Natural!
🞕 Laundry Example
🞕 Ann, Brian, Cathy, Dave each
have one load of clothes to
wash, dry, and fold A B C D
 Washer takes 30 minutes
 Dryer takes 40 minutes
 “Folder” takes 20 minutes

🞕 One load: 90 minutes


6 PM
Sequential Laundry
11 Midnight
7 8 9 10
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
A
k

O
r
B
d
e
r C

🞕 Sequential laundry takes 6 hours for 4 loads


🞕 If they learned pipelining, how long would laundry take?
Pipelined Laundry Start Work
AS AP 7
6 PM 8 9 10 11 Midnight
Time

30 40 40 40 40 20
T
a Sequential laundry takes 6 hours for 4
s
A loads
k

O
r
B
d
e
r C

🞕 Pipelined laundry takes 3.5 hours for 4 loads


RISC Instruction Set
🞕 Every instruction can be implemented in at most 5 clock cycles/
stages
 Instruction fetch cycle (IF): send PC to memory, fetch the
current instruction from memory, and update PC to the next
sequential PC by adding 4 to the PC.
 Instruction decode/register fetch cycle (ID): decode the
instruction, read the registers corresponding to register source specifiers
from the register file.
 Execution/effective address cycle (EX): perform Memory
address calculation for Load/Store, Register-Register ALU
instruction and Register-Immediate ALU instruction.
 Memory access (MEM): Perform memory access for
load/store instructions.
 Write-back cycle (WB): Write back results to the dest
operands for Register-Register ALU instruction or Load
instruction.
Classic 5-Stage Pipeline for a
🞕 RISC
Each cycle the hardware will initiate a new instruction and will
be executing some part of the five different instructions.
 Simple;
 However, be ensure that the overlap of instructions in the
pipeline cannot cause such a conflict. (also called Hazard)

Clock number
Instruction number 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB
Computer Pipelines
🞕 Pipeline properties
 Execute billions of instructions, so throughput is what matters.
 Pipelining doesn’t help latency of single task, it helps throughput
of entire workload;
 Pipeline rate limited by slowest pipeline stage;
 Multiple tasks operating simultaneously;
 Potential speedup = Number pipe stages;
 Unbalanced lengths of pipe stages reduces speedup;
 Time to “fill” pipeline and time to “drain” it reduces speedup.
🞕 The time per instruction on the pipelined processor in ideal
conditions is equal to,
Time per instruction on unpipelined machine
Number of pipe stage
† However, the stages may not be perfectly balanced.
† Pipelining yields a reduction in the average execution time per
instruction.
Review: Components of a
Computer Memo
Processor ry Input
Enable?
Read/Write
Control

Progra
Datapat m
h Address
Program Counter Bytes
(PC)

Registers Write Data

Arithmetic & Logic Unit ReadData Data


Output
(ALU)

Processor-Memory Interface I/O-Memory Interfaces


C P U and Datapath vs Control

🞕 Datapath: Storage, FU, interconnect sufficient to perform the desired


functions
 Inputs are Control Points
 Outputs are signals
🞕 Controller: State machine to orchestrate operation on the data path
 Based on desired function and signals
Making RISC Pipelining Real
🞕 Function units used in different cycles
 Hence we can overlap the execution of multiple instructions
🞕 Important things to make it real
 Separate instruction and data memories, e.g. I-cache and D-cache, banking
» Eliminate a conflict for accessing a single memory.
 The Register file is used in the two stages (two R and one W every cycle)
» Read from register in ID (second half of CC), and write to register in WB
(first half of CC).
 PC
» Increment and store the PC every clock, and done it during the IF
stage.
» A branch does not change the PC until the ID stage (have an adder to
compute the potential branch target).
 Staging data between pipeline stages
» Pipeline register
Pipeline Datapath
🞕 Register files in ID and WB stage
 Read from register in ID (second half of CC), and write to
register in WB (first half of CC).
🞕 IM and DM
Pipelining Performance
(1/2)
🞕 Pipelining increases throughput, not reduce the execution time of
an individual instruction.
 In face, slightly increases the execution time (an
instruction) due to overhead in the control of the pipeline.
 Practical depth of a pipeline is limits by increasing execution time.
🞕 Pipeline overhead
 Unbalanced pipeline stage;
 Pipeline stage overhead;
 Pipeline register delay;
 Clock skew.
Processor Performance

CPU Time 
Instructions *
Cycles *
Cycl
Program Time
e
🞕 Instructions per program depends on source code,
compiler technology, and ISA
Instruction
🞕 Cycles per instructions (CPI) depends on ISA
and
µarchitecture
🞕 Time per cycle depends upon the µarchitecture and base
technology
CPI for Different Instructions
7 cycles 5 cycles 10 cycles
Inst 1 Inst 2 Inst 3

Time

Total clock cycles = 7+5+10 = 22


Total instructions = 3
CPI = 22/3 = 7.33

CPI is always an average over a large


number of instructions

22
Pipeline Performance
(2/2)
🞕 Example 1 (p.C-10): Consider the unpipelined processor in previous section. Assume that
it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches, and 5
cycles for memory operations. Assume that the relative frequencies of these operations are
40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining
the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how
much speedup in the instruction execution rate will we gain from a pipeline?
🞕 Answer
The average instruction execution time on the unpipelined processor is

Average instruction execution time  Clock cycle  Average CPI


 1 ns40%  20%  4  40%  5
 1 ns 4.4  4.4 ns

Speedup from pipelining 


Average instruction time unpipelined  4.4 ns  3.7 times
Average instruction time pipelined

1.2 ns
† In the pipeline, the clock must run at the speed of the slowest stage
Performance with Pipeline Stall
(1/2) Average instruction time unpipelined
Speedup from pipelining 
Average instruction time pipelined
CPI unpipelined Clock cycle

unpipelined CPI pipelined  Clock
cycle
CPI pipelined Clock cycle unpipelined
unpipelined
 
CPI pipelined Clock cycle pipelined

CPI pipelined  Ideal CPI  Pipeline stall clock cycles per


instruction
 1  Pipelined stall clock cycles per instruction
Performance with Pipeline Stall
(2/2) CPI unpipelined  Clock cycle unpipelined
Speedup from pipelining
CPI pipelined
 1  Clock cycle unpipelined 1 Pipeline stall
Clock cycle pipelined
cycles per instruction Clock cycle pipelined

Clock cycle pipelined 


Clock cycle unpipelined
Pipeline depth

 Pipeline depth 
Clock cycle
1
unpipelined
Speedup from pipelining 
Clock cycle unpipelined
 1 Pipeline
Clock stall cycles per instruction
cycle pipelined Clock cycle pipelined
1
  Pipeline
1 Pipeline stall cycles per instruction depth

Pipelining speedup is proportional to the pipeline


depth and 1/(1+ stall cycles)
Pipeline Hazards
🞕 Hazard, that prevent the next instruction in the instruction steam.
 Structural hazards: resource conflict, e.g. using the same unit
 Data hazards: an instruction depends on the results of a
previous instruction
 Control hazards: arise from the pipelining of branches
and other instructions that change the PC.
🞕 Hazards in pipelines can make it necessary to stall the
pipeline.
 Stall will reduce pipeline performance.
Structure Hazards
🞕 Structure Hazards
 If some combination of instructions cannot be accommodated
because of resource conflict (resources are pipelining of functional units
and duplication of resources).
» Occur when some functional unit is not fully pipelined, or
» No enough duplicated resources.
One Memory Port/Structural
Hazards
Time (clock
cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle
Cycle 5 7
I Load Ifetch Re DMe Re

AL
U
g m g
n
s
Instr 1 Ifetch Re DMe Re

AL
U
t g m g

r.
Ifetch Re DMe Re
Instr 2

AL
U
g m g
O
r Re
d Instr 3
Ifetch Re DMe

AL
U
g m g

e Instr 4 Ifetch Re DMe Re

AL
U
g m g

r
One Memory Port/Structural
Hazards
Time (clock
cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle
Cycle 5 7
I Load Ifetch Re DMe Re

AL
U
g m g
n
s
Instr 1 Ifetch Re DMe Re

AL
U
t g m g

r.
Ifetch Re DMe Re
Instr 2

AL
U
g m g
O
r
d Stall
Bubbl Bubbl Bubbl Bubbl Bubbl
e e e e e

e Instr 3 Ifetch Re DMe Re

AL
U
g m g

r
How do you “bubble” the pipe?
Performance on Structure
🞕 Hazard
Example 2 (p.C-14): Let’s see how much the load structure hazard might cost.
Suppose that data reference constitute 40% of the mix, and that the ideal CPI of the
pipelined processor, ignoring the structure hazard, is 1. Assume that the processor with
the structure hazard has a clock rate that is 1.05 times higher than the clock rate of
processor without the hazard. Disregarding any other performance losses, is the
pipeline with or without the structure hazard faster, and by how much?
🞕 Answer
The average instruction execution time on the unpipelined processor is
Average instruction timeideal  CPI  Clock cycle timeideal
 1 Clock cycle timeideal
Average instruction timestructure hazard  CPI  Clock cycle time

Clock cycle
 1 0.4 1
timeideal
1.05
 1.3  Clock cycle
timeideal
Summary of Structure Hazard
🞕 An alternative to this structure hazard, designer could provide a
separate memory access for instructions.
 Splitting the cache into separate instruction and data caches, or
 Use a set of buffers, usually called instruction buffers, to hold
instruction;
🞕 However, it will increase cost overhead.
 Ex1: pipelining function units or duplicated resources is a high cost;
 Ex2: require twice bandwidth and often have higher bandwidth at
the pins to support both an instruction and a data cache access every
cycle;
 Ex3: a floating-point multiplier consumes lots of gates.

† If the structure hazard is rare, it may not be worth the cost


to avoid it.
Data Hazards
🞕 Data Hazards
 Occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.
» Occur when some functional unit is not fully pipelined, or
» No enough duplicated resources.
 A example of pipelined execution

DADD

R1, R2, R3
DSUB

R4, R1, R5 AND


Three Generic Data Hazards
(1/3)
🞕 Read After Write (RAW)
 InstrJ tries to read operand before InstrI writes it

I: ADD R1,R2,R3
J: SUB R4,R1,R3

🞕 Caused by a “true dependence” (in compiler nomenclature). This


hazard results from an actual need for communication.
Three Generic Data Hazards
(2/3)
🞕 Write After Read (WAR)
 InstrJ writes operand before InstrI reads it

I: SUB R4,R1,R3
J: ADD R1,R2,R3
K: MUL R6,R1,R7

🞕 Called an “anti-dependence” by compiler writers.


This results from reuse of the name “R1”.
🞕 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and

 Reads are always in stage 2, and

 Writes are always in stage 5


Three Generic Data Hazards
(3/3)
🞕 Write After Write (WAW)
 InstrJ writes operand before InstrI writes it.
useless

I: SUB R1,R4,R3
J: ADD R1,R2,R3
K: MUL R6,R1,R7

🞕 This hazard also results from the reuse of name r1


🞕 Hazard when writes occur in the wrong order
🞕 Can’t happen in our basic 5-stage pipeline because:
 All writes are ordered and take place in stage 5
🞕 WAR and WAW hazards occur in complex pipelines
🞕 Notice that Read After Read – RAR is NOT a hazard
#2: Forwarding (aka bypassing) to
Avoid Data Hazard
Time (clock
cycles)
I
n DADD R1,R2,R3 Re DMe Re

AL
U
g m g
Pipeline register
s Ifetch
t
r. DSUB R4,R1,R5
Ifetch Re DMe Re

AL
U
g m g

O Ifetch Re DMe Re

AL
r AND R6,R1,R7

U
g m g

d
Ifetch Reg DMe Re

AL
OR

U
m g
e

r Ifetch Re DMe Re

AL
R8,R1,R9 XOR

U
g m g
Another Example of a RAW Data
Hazard
🞕 Result of sub is needed by and, or, add, & sw
instructions
🞕 Instructions and & or will read old value of r2 from
reg file
Time CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
🞕 During CC5, r2
(cycles) is written
10 10 and
10 read – new
10 10/20 20 value
20 is
20
value of r2
read sub r2, r1, IM Reg ALU DM Reg
r3
Program Execution

and r4, r2, IM Reg AL DM Reg


r5 U

or r6, r3, IM Reg AL DM Reg


r2 U
Order

add r7, r2, IM Reg AL


DM Reg
r2 U

sw r8, IM AL
Reg
U
DM
10(r2)
Solution #1: Stalling the Pipeline
🞕 The and instruction cannot fetch r2 until CC5
 The and instruction remains in the IF/ID register until CC5

🞕 Two bubbles are inserted into ID/EX at end of CC3


& CC4
 Bubbles are NOP instructions: do not modify registers or
memory
Time (in CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
 Bubbles
cycles) delay instruction
10 10 execution
10 10 and waste
10/20 20 clock
20 20
cycles
value of r2
IM Reg ALU DM Reg
sub r2, r1, r3
Instruction

and r4, r2, IM bubble bubble Reg AL DM Reg


Order

r5 U

or r6, r3, IM Reg ALU DM


r2
Solution #2: Forwarding ALU
🞕 The Result
ALU result is forwarded (fed back) to the ALU
input
 No bubbles are inserted into the pipeline and no cycles are
wasted
🞕 ALU result exists in either EX/MEM or MEM/WB
register sub r2, r1, AL
IM Reg
U
DM Reg
r3
Program Execution

Time (in cycles) CC1 CC2 CC3 CC4 CC5


and r4, r2, IM Reg DM Reg
CC6 CC7 ALU
CC8
r5
or r6, r3, IM Reg AL DM Reg
r2 U

add r7, r2,


Order

IM Reg AL DM Reg
r2 U

sw r8, IM Reg AL DM
10(r2) U
Double Data Hazard
🞕 Consider the sequence:
add r1,r1,r2
sub r1,r1,r3
and r1,r1,r4
🞕 Both hazards
occur
 Want to use
the most
recent
 When
executing
AND, forward
result of SUB
»
ForwardA =
01 (from the
EX/MEM pipe
Data Hazard Even with
Forwarding
Time (clock
cycles)

I LD R1,0(R2) Ifetch Re DMe Re

AL
U
g m g
n
s
t DSUB R4,R1,R6 Ifetch Re DMe Re

AL
U
g m g
r.

O Re

AL
Ifetch Re DMe
DAND R6,R1,R7

U
g m g
r
d
Ifetch Re DMe Re

AL
OR R8,R1,R9

U
g m g
e

r
Data Hazard Even with
Forwarding
Time (clock
cycles)

I LD R1,0(R2) Ifetch Re DMe Re

AL
U
g m g
n
s
t Bubbl Re
DSUB R4,R1,R6 Ifetch Re DMe

AL
U
g m g
r. e

O Ifetch Bubbl Re DMe Re

AL
AND R6,R1,R7

U
g m g
r e
d
Bubbl Ifetch Re DMe

AL
U
e OR R8,R1,R9 e
g m

r
How is this detected?
Load
Delay
🞕 Not all RAW data hazards can be forwarded
 Load has a delay that cannot be eliminated by

forwardin
🞕 In the example shown below …
 The LW instruction does not have data until end of CC4

 AND wants data at beginning of CC4 - N OT possible

Time (cycles) However,


CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8load can
lw r2, IF Reg AL DM Reg forward data to
20(r1) U
second next
instruction
and r4, r2, IF DM
Program

Reg AL Reg
r5 U
Order

or r6, r3, IF Reg ALU DM Reg


r2
add r7, r2, IF Reg ALU DM Reg
r2
Stall the Pipeline for one
Cycle
🞕 Freeze the PC and the IF/ID registers
 No new instruction is fetched and instruction after load is
stalled
🞕 Allow the Load in ID/EX register to proceed
🞕 Introduce a bubble into the ID/EX register
🞕 Load can forward data after stalling next
instruction
lw r2, IM Reg AL DM Reg
Time (cycles)
20(r1) U CC1 CC2 CC3 CC4 CC5
CC6 CC7 CC8AL
Program

and r4, r2, IM bubble Reg DM Reg


U
Order

r5
or r6, r3, IM Reg AL
DM Reg
U
r2
Forwarding to Avoid LW-SW Data
Hazard Time (clock
cycles)
I
Re DMe Re

AL
n DADD R1,R2,R3

U
g m g
s
Ifetch

t
r. LD R4,0(R1) Ifetch Re DMe Re

AL
U
g m g

O Ifetch Re DMe Re

AL
U
r SD R4,12(R1) g m g

d
Ifetch Re DMe Re

AL
e OR R8,R6,R9

U
g m g

r XOR Ifetch Re DMe Re

AL
R10,R9,R11

U
g m g
Detecting RAW Hazards
🞕 Pass register numbers along pipeline
 ID/EX.RegisterRs = register number for Rs in

ID/EX
 ID/EX.RegisterRt = register number for Rt in

ID/EX
 ID/EX.RegisterRd = register number for Rd in

ID/EX
🞕 Current instruction being executed in ID/EX
register
🞕 RAW Data hazards when Fwd from
🞕 Previous instruction is
1a. EX/MEM.RegisterRd = in the EX/MEM EX/MEM
register
ID/EX.RegisterRs 1b.
pipeline
reg
🞕 Second previous is=in the MEM/WB register
EX/MEM.RegisterRd Fwd from
MEM/WB
ID/EX.RegisterRt pipeline
reg
Detecting the Need to
Forward
🞕 But only if forwarding instruction will write to a
register!
 EX/MEM.RegWrite, MEM/WB.RegWrite
🞕 And only if Rd for that instruction is not R0
 EX/MEM.RegisterRd ≠ 0
 MEM/WB.RegisterRd ≠ 0
Forwarding Conditions
🞕 Detecting RAW hazard with Previous
Instruction
 if (EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01 (Forward from EX/MEM pipe
stage)
 if (EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01 (Forward from
EX/MEM pipe stage)
🞕 Detecting RAW hazard with Second
Previous
 if (MEM/WB.RegWrite and
(MEM/WB.RegisterRd ≠ 0) and
Control Hazard on Branches: Three
Stage Stall

10: BEQ R1,R3,36 Ifetch Re DMe Re

AL
U
g m g

Re
14: AND R2,R3,R5 Ifetch Re DMe

AL
U
g m g

Ifetch Re DMe Re
18: OR R6,R1,R7

AL
U
g m g

Ifetch Re DMe Re

AL
22: ADD R8,R1,R9

U
g m g

36: XOR R10,R1,R11 Ifetch Re DMe Re

AL
U
g m g

What do you do with the 3 instructions in between?


How do you do it?
Where is the “commit”?
Branch/Control
Hazards
🞕 Branch instructions can cause great
performance loss
🞕 Branch instructions need two
Takenthings:
or Not
 Branch Result
Taken
 Branch Target
If Branch is NO T
» PC + 4
taken If Branch is
» PC + 4 +
🞕 For our Taken delay
pipeline: 3-cycle
branch
4 × imm
 PC is updated 3 cycles after fetching branch

instruction
 Branch target address is calculated in the ALU

stage
 Branch result is also computed in the ALU

stage
1. Pipelining Introduction
Contents
2. The Major Hurdle of Pipelining—Pipeline Hazards
3. RISC-V ISA and its Implementation

Reading:
 Textbook: Appendix C
 RISC-V ISA
 Chisel Tutorial

You might also like