Pipelining Basic and Intermediate Concepts
Pipelining Basic and Intermediate Concepts
Intermediate Concepts
Appendix A mainly with some
support from Chapter 3
Pipelining: Its Natural!
• Laundry Example
A B C D
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
Pipelined Laundry
Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Problems
– Usually, stages are not balanced
– Pipelining overhead
– Hazards (conflicts)
• Performance (throughput CPU performance equation)
– Decrease of the CPI
– Decrease of cycle time
MIPS Instruction Formats
J opcode address
0 5 6 31
Fixed-field decoding
1st and 2nd Instruction cycles
• Instruction fetch (IF)
IR Mem[PC];
NPC PC + 4
• Instruction decode & register fetch (ID)
A Regs[IR6..10];
B Regs[IR11..15];
Imm ((IR16)16 # # IR16..31)
3rd Instruction cycle
• Execution & effective address (EX)
– Memory reference
• ALUOutput A + Imm
– Register - Register ALU instruction
• ALUOutput A func B
– Register - Immediate ALU instruction
• ALUOutput A op Imm
– Branch
• ALUOutput NPC + Imm; Cond (A op 0)
4th Instruction cycle
• Memory access & branch completion (MEM)
– Memory reference
• PC NPC
• LMD Mem[ALUOutput] (load)
• Mem[ALUOutput] B (store)
– Branch
• if (cond) PC ALUOutput; else PC NPC
5th Instruction cycle
• Write-back (WB)
– Register - register ALU instruction
• Regs[IR16..20] ALUOutput
– Register - immediate ALU instruction
• Regs[IR11..15] ALUOutput
– Load instruction
• Regs[IR11..15] LMD
5 Steps of MIPS Datapath
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
RS2
Address
Memory
Reg File
Inst
ALU
L
Memory
RD
Data
M
MUX
D
Sign
Imm Extend
WB Data
5 Steps of MIPS Datapath
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
Step 2
Load
RR ALU Imm Store
Step 5
Basic Pipeline
Clock number
1 2 3 4 5 6 7 8 9
Instr #
IF ID EX MEM WB
i
i +1 IF ID EX MEM WB
i +2 IF ID EX MEM WB
i +3 IF ID EX MEM WB
i +4 IF ID EX MEM WB
Pipeline Resources
IM Reg ALU DM Reg
4 M
u Zero?
Add
x
M
u
x M
PC u
Instr. Regs ALU
x
Cache M Data
u Cache
x
Sign
extend
Performance limitations
2
DAP Spr.‘98 ©UCB 24
• Throughput = #instructions per unit time (seconds/cycles etc.)
• Throughput of an unpipelined machine
– 1/time per instruction
– Time per instruction = pipeline depth*time to execute a single stage.
– The time to execute a single stage can be rewritten as:
IF ID FP Multiply MEM WB
EX
Partially pipelined M1 M2 M3 M4 M5
IF ID FP Multiply MEM WB
EX
Not pipelined M1 M2 M3 M4 M5
IF ID FP Multiply MEM WB
EX
To pipeline or Not to pipeline
• Elements to consider
– Effects of pipelining and duplicating units
• Increased costs
• Higher latency (pipeline register overhead)
– Frequency of structural hazard
• Example: unpipelined FP multiply unit in DLX
– Latency: 5 cycles
– Impact on mdljdp2 program?
• Frequency of FP instructions: 14%
– Depends on the distribution of FP multiplies
• Best case: uniform distribution
• Worst case: clustered, back-to-back multiplies
Resource Duplication
Reg
Inst 1 M Reg ALU M
Stall
IM Reg ALU DM
AND R6, R1, R7
IM Reg ALU
OR R8, R1, R9
• Important
– Read and understand table on page A-36 in the book.
Forwarding Implementation (2/2)
Zero?
EX/MEM
MEM/WB
u
ID/EX
x
ALU
Data
M memory
u
x
Stalls inspite of forwarding
LW Rb,b IF ID EX MEM WB
LW Rc,c IF ID EX MEM WB
LW Re,e IF ID EX MEM WB
ADD Ra,Rb,Rc IF ID EX MEM WB
LW Rf,f IF ID EX MEM WB
SW a,Ra IF ID EX MEM WB
SUB Rd,Re,Rf IF ID EX MEM WB
SW d,Rd IF ID EX MEM WB
Compiler Scheduling
• Eliminates load interlocks
• Demands more registers
• Simple scheduling
– Basic block (sequential segment of code)
– Good for simple pipelines
– Percentage of loads that result in a stall
• FP: 13%
• Int: 25%
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster 3
DAP Spr.‘98 ©UCB 25
Control Hazards
Branch IF ID EX MEM WB
Branch successor IF stall stall IF ID EX MEM WB
Branch successor+1 IF ID EX MEM WB
Branch successor+2 IF ID EX MEM WB
Branch successor+3 IF ID EX MEM
Branch successor+4 IF ID EX
Add
4 M
u Zero?
Add
x
M
PC u
Instr. Regs M ALU
x
Cache u Data
x Cache
Sign
extend
IF/ID
ID/EX EX/MEM MEM/WB
Reduction of Branch Penalties
Static, compile-time, branch prediction schemes
1 Stall the pipeline
Simple in hardware and software
2 Treat every branch as not taken
Continue execution as if branch were normal instruction
If branch is taken, turn the fetched instruction into a no-op
3 Treat every branch as taken
Useless in MIPS …. Why?
4 Delayed branch
Sequential successors (in delay slots) are executed anyway
No branches in the delay slots
Delayed Branch
#4: Delayed Branch
– Define branch to take place AFTER a following
instruction
branch instruction
sequential successor1
sequential successor2Branch delay of length n
........
sequential successorn
branch target if taken
Compiler organizes code so that the most frequent path is the not-taken one
Cancelling Branch Instructions
Cancelling branch includes the predicted direction
• Incorrect prediction => delay-slot instruction becomes no-op
• Helps the compiler to fill branch delay slots (no requirements for
. b and c)
• Behavior of a predicted-taken cancelling branch
Untaken Branch IF ID EX MEM WB
Instruction i+1 IF stall stall stall stall (clear the IF/ID register)
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB
SUB R4,R5,R6
if R2=0 then ADD R1,R2,R3
ADD R1,R2,R3 if R1=0 then
ADD R1,R2,R3
if R1=0 then OR R7,R8,R9
SUB R4,R5,R6 SUB R4,R5,R6
Branch Slot Requirements
Strategy Requirements Improves performance
a) From before Branch must not depend on delayed Always
instruction
b) From target Must be OK to execute delayed When branch is taken
instruction if branch is not taken
c) From fall Must be OK to execute delayed When branch is not taken
through instruction if branch is taken
IF ID EX M WB
Cache
IF ID EX M WB Suspend
Execution
Memory
IF ID EX M WB
Exception handling IF ID EX M WB
procedure
...
RFE
Stopping and Restarting Execution
• TRAP, RFE(return-from-exception) instructions
• IAR register saves the PC of faulting instruction
• Safely save the state of the pipeline
– Force a TRAP on the next IF
– Until the TRAP is taken, turn off all writes for the
faulting instruction and the following ones.
– Exception-handling routine saves the PC of the
faulting instruction
• For delayed branches we need to save more PCs
Exceptions in MIPS
ADD IF ID EX M WB
LW IF ID EX M WB
ADD IF ID EX M WB
IF ID EX M WB
FP/int multiply
M1 M2 M3 M4 M5 M6 M7
IF ID MEM WB
FP adder
A1 A2 A3 A4
FP/int divider
DIV
Latencies and Initiation Intervals
Functional Unit Latency Initiation Interval
Integer ALU 0 1
Data Memory 1 1
FP adder 3 1
FP/int multiply 6 1
FP/int divider 24 25
MULTD IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB
ADDD IF ID A1 A2 A3 A4 Mem WB
LD IF ID EX Mem WB
SD IF ID EX Mem WB
Hazards in FP pipelines
• Structural hazards in DIV unit
• Structural hazards in WB
• WAW hazards are possible (WAR not possible)
• Out-of-order completion
– Exception handling issues
• More frequent RAW hazards
– Longer pipelines
ADD F2, F0, F8 IF stall ID stall stall stall stall stall stall A1 A2 A3 A4 Mem WB
Hazard Detection Logic at ID
• Check for Structural Hazards
– Divide unit/make sure register write port is available
when needed
• Check for RAW hazard
– Check source registers against destination registers in
pipeline latches of instructions that are ahead in the
pipeline. Similar to I-pipeline
• Check for WAW hazard
– Determine if any instruction in A1-A4, M1-M7 has
same register destination as this instruction.
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster 3
DAP Spr.‘98 ©UCB 25