Pipelining
Pipelining
1
Pipelining: Its Natural!
🞕 Laundry Example
🞕 Ann, Brian, Cathy, Dave each
have one load of clothes to
wash, dry, and fold A B C D
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
A
k
O
r
B
d
e
r C
30 40 40 40 40 20
T
a Sequential laundry takes 6 hours for 4
s
A loads
k
O
r
B
d
e
r C
Clock number
Instruction number 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB
Computer Pipelines
🞕 Pipeline properties
Execute billions of instructions, so throughput is what matters.
Pipelining doesn’t help latency of single task, it helps throughput
of entire workload;
Pipeline rate limited by slowest pipeline stage;
Multiple tasks operating simultaneously;
Potential speedup = Number pipe stages;
Unbalanced lengths of pipe stages reduces speedup;
Time to “fill” pipeline and time to “drain” it reduces speedup.
🞕 The time per instruction on the pipelined processor in ideal
conditions is equal to,
Time per instruction on unpipelined machine
Number of pipe stage
† However, the stages may not be perfectly balanced.
† Pipelining yields a reduction in the average execution time per
instruction.
Review: Components of a
Computer Memo
Processor ry Input
Enable?
Read/Write
Control
Progra
Datapat m
h Address
Program Counter Bytes
(PC)
CPU Time
Instructions *
Cycles *
Cycl
Program Time
e
🞕 Instructions per program depends on source code,
compiler technology, and ISA
Instruction
🞕 Cycles per instructions (CPI) depends on ISA
and
µarchitecture
🞕 Time per cycle depends upon the µarchitecture and base
technology
CPI for Different Instructions
7 cycles 5 cycles 10 cycles
Inst 1 Inst 2 Inst 3
Time
22
Pipeline Performance
(2/2)
🞕 Example 1 (p.C-10): Consider the unpipelined processor in previous section. Assume that
it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches, and 5
cycles for memory operations. Assume that the relative frequencies of these operations are
40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining
the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how
much speedup in the instruction execution rate will we gain from a pipeline?
🞕 Answer
The average instruction execution time on the unpipelined processor is
1.2 ns
† In the pipeline, the clock must run at the speed of the slowest stage
Performance with Pipeline Stall
(1/2) Average instruction time unpipelined
Speedup from pipelining
Average instruction time pipelined
CPI unpipelined Clock cycle
unpipelined CPI pipelined Clock
cycle
CPI pipelined Clock cycle unpipelined
unpipelined
CPI pipelined Clock cycle pipelined
Pipeline depth
Clock cycle
1
unpipelined
Speedup from pipelining
Clock cycle unpipelined
1 Pipeline
Clock stall cycles per instruction
cycle pipelined Clock cycle pipelined
1
Pipeline
1 Pipeline stall cycles per instruction depth
AL
U
g m g
n
s
Instr 1 Ifetch Re DMe Re
AL
U
t g m g
r.
Ifetch Re DMe Re
Instr 2
AL
U
g m g
O
r Re
d Instr 3
Ifetch Re DMe
AL
U
g m g
AL
U
g m g
r
One Memory Port/Structural
Hazards
Time (clock
cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle
Cycle 5 7
I Load Ifetch Re DMe Re
AL
U
g m g
n
s
Instr 1 Ifetch Re DMe Re
AL
U
t g m g
r.
Ifetch Re DMe Re
Instr 2
AL
U
g m g
O
r
d Stall
Bubbl Bubbl Bubbl Bubbl Bubbl
e e e e e
AL
U
g m g
r
How do you “bubble” the pipe?
Performance on Structure
🞕 Hazard
Example 2 (p.C-14): Let’s see how much the load structure hazard might cost.
Suppose that data reference constitute 40% of the mix, and that the ideal CPI of the
pipelined processor, ignoring the structure hazard, is 1. Assume that the processor with
the structure hazard has a clock rate that is 1.05 times higher than the clock rate of
processor without the hazard. Disregarding any other performance losses, is the
pipeline with or without the structure hazard faster, and by how much?
🞕 Answer
The average instruction execution time on the unpipelined processor is
Average instruction timeideal CPI Clock cycle timeideal
1 Clock cycle timeideal
Average instruction timestructure hazard CPI Clock cycle time
Clock cycle
1 0.4 1
timeideal
1.05
1.3 Clock cycle
timeideal
Summary of Structure Hazard
🞕 An alternative to this structure hazard, designer could provide a
separate memory access for instructions.
Splitting the cache into separate instruction and data caches, or
Use a set of buffers, usually called instruction buffers, to hold
instruction;
🞕 However, it will increase cost overhead.
Ex1: pipelining function units or duplicated resources is a high cost;
Ex2: require twice bandwidth and often have higher bandwidth at
the pins to support both an instruction and a data cache access every
cycle;
Ex3: a floating-point multiplier consumes lots of gates.
DADD
R1, R2, R3
DSUB
I: ADD R1,R2,R3
J: SUB R4,R1,R3
I: SUB R4,R1,R3
J: ADD R1,R2,R3
K: MUL R6,R1,R7
I: SUB R1,R4,R3
J: ADD R1,R2,R3
K: MUL R6,R1,R7
AL
U
g m g
Pipeline register
s Ifetch
t
r. DSUB R4,R1,R5
Ifetch Re DMe Re
AL
U
g m g
O Ifetch Re DMe Re
AL
r AND R6,R1,R7
U
g m g
d
Ifetch Reg DMe Re
AL
OR
U
m g
e
r Ifetch Re DMe Re
AL
R8,R1,R9 XOR
U
g m g
Another Example of a RAW Data
Hazard
🞕 Result of sub is needed by and, or, add, & sw
instructions
🞕 Instructions and & or will read old value of r2 from
reg file
Time CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
🞕 During CC5, r2
(cycles) is written
10 10 and
10 read – new
10 10/20 20 value
20 is
20
value of r2
read sub r2, r1, IM Reg ALU DM Reg
r3
Program Execution
sw r8, IM AL
Reg
U
DM
10(r2)
Solution #1: Stalling the Pipeline
🞕 The and instruction cannot fetch r2 until CC5
The and instruction remains in the IF/ID register until CC5
r5 U
IM Reg AL DM Reg
r2 U
sw r8, IM Reg AL DM
10(r2) U
Double Data Hazard
🞕 Consider the sequence:
add r1,r1,r2
sub r1,r1,r3
and r1,r1,r4
🞕 Both hazards
occur
Want to use
the most
recent
When
executing
AND, forward
result of SUB
»
ForwardA =
01 (from the
EX/MEM pipe
Data Hazard Even with
Forwarding
Time (clock
cycles)
AL
U
g m g
n
s
t DSUB R4,R1,R6 Ifetch Re DMe Re
AL
U
g m g
r.
O Re
AL
Ifetch Re DMe
DAND R6,R1,R7
U
g m g
r
d
Ifetch Re DMe Re
AL
OR R8,R1,R9
U
g m g
e
r
Data Hazard Even with
Forwarding
Time (clock
cycles)
AL
U
g m g
n
s
t Bubbl Re
DSUB R4,R1,R6 Ifetch Re DMe
AL
U
g m g
r. e
AL
AND R6,R1,R7
U
g m g
r e
d
Bubbl Ifetch Re DMe
AL
U
e OR R8,R1,R9 e
g m
r
How is this detected?
Load
Delay
🞕 Not all RAW data hazards can be forwarded
Load has a delay that cannot be eliminated by
forwardin
🞕 In the example shown below …
The LW instruction does not have data until end of CC4
Reg AL Reg
r5 U
Order
r5
or r6, r3, IM Reg AL
DM Reg
U
r2
Forwarding to Avoid LW-SW Data
Hazard Time (clock
cycles)
I
Re DMe Re
AL
n DADD R1,R2,R3
U
g m g
s
Ifetch
t
r. LD R4,0(R1) Ifetch Re DMe Re
AL
U
g m g
O Ifetch Re DMe Re
AL
U
r SD R4,12(R1) g m g
d
Ifetch Re DMe Re
AL
e OR R8,R6,R9
U
g m g
AL
R10,R9,R11
U
g m g
Detecting RAW Hazards
🞕 Pass register numbers along pipeline
ID/EX.RegisterRs = register number for Rs in
ID/EX
ID/EX.RegisterRt = register number for Rt in
ID/EX
ID/EX.RegisterRd = register number for Rd in
ID/EX
🞕 Current instruction being executed in ID/EX
register
🞕 RAW Data hazards when Fwd from
🞕 Previous instruction is
1a. EX/MEM.RegisterRd = in the EX/MEM EX/MEM
register
ID/EX.RegisterRs 1b.
pipeline
reg
🞕 Second previous is=in the MEM/WB register
EX/MEM.RegisterRd Fwd from
MEM/WB
ID/EX.RegisterRt pipeline
reg
Detecting the Need to
Forward
🞕 But only if forwarding instruction will write to a
register!
EX/MEM.RegWrite, MEM/WB.RegWrite
🞕 And only if Rd for that instruction is not R0
EX/MEM.RegisterRd ≠ 0
MEM/WB.RegisterRd ≠ 0
Forwarding Conditions
🞕 Detecting RAW hazard with Previous
Instruction
if (EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01 (Forward from EX/MEM pipe
stage)
if (EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01 (Forward from
EX/MEM pipe stage)
🞕 Detecting RAW hazard with Second
Previous
if (MEM/WB.RegWrite and
(MEM/WB.RegisterRd ≠ 0) and
Control Hazard on Branches: Three
Stage Stall
AL
U
g m g
Re
14: AND R2,R3,R5 Ifetch Re DMe
AL
U
g m g
Ifetch Re DMe Re
18: OR R6,R1,R7
AL
U
g m g
Ifetch Re DMe Re
AL
22: ADD R8,R1,R9
U
g m g
AL
U
g m g
instruction
Branch target address is calculated in the ALU
stage
Branch result is also computed in the ALU
stage
1. Pipelining Introduction
Contents
2. The Major Hurdle of Pipelining—Pipeline Hazards
3. RISC-V ISA and its Implementation
Reading:
Textbook: Appendix C
RISC-V ISA
Chisel Tutorial