Embedded Systems Design: Pipelining and Instruction Scheduling
Embedded Systems Design: Pipelining and Instruction Scheduling
Presentation Credit
Computer Organization and Design, David A.
Patterson, John L. Hennesy (Third Edition), Chapter 6
Pipelining Introduction
What is Pipelining?
Why Pipelining?
To increase throughput.
To maximize hardware utilization.
Pipelining in Microprocessors
Same pipelining principle applies to
microprocessors for the following five tasks
Pipelining principle
Improve performance by increasing instruction throughput
P rog ram
e x ec utio n
T im e
o rd er
(in in stru ctio ns )
lw $ 1, 1 0 0 ($0 )
lw $ 2, 2 0 0 ($0 )
A LU
D ata
ac cess
10
12
14
Instruction
R eg
fe tch
lw $ 3, 3 0 0 ($0 )
D ata
ac c ess
A LU
lw $1 , 1 0 0 ($ 0)
Ins truction
fetc h
lw $2 , 2 0 0 ($ 0)
2 ns
lw $3 , 3 0 0 ($ 0)
R eg
Instruc tion
fetc h
2 ns
A LU
R eg
Ins truc tion
fetc h
2 ns
D a ta
access
ALU
R eg
2 ns
Pipelined
10
R eg
A LU
D a ta
access
2 ns
2 ns
...
8 ns
14
12
R eg
D a ta
acces s
R eg
Ins tru ction
fe tch
8 ns
18
R eg
8 ns
P rog ra m
e x ec utio n
Tim e
o rd er
(in in struc tio n s)
16
R eg
2 ns
Single cycle
time is the
maximum
time required
by a phase.
9
Pipelining speedup?
Ideal speedup = number of stages
Do we achieve this?
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
PC
Address
Instruction
memory
I nstr ucti on
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Hazards
Hazards are situations that prevent the next instruction in
the instruction stream from executing during its designated
clock cycle.
Hazard types:
Structural Hazards
Same resource is needed multiple times in the same cycle
Data Hazards
Data dependencies limit pipelining
Control Hazards
Next executed instruction may not be the next specified
instruction
Structural hazards
Examples:
Two accesses to a single ported memory
Two operations need the same function unit
at the same time
Two operations need the same function unit
in successive cycles, but the unit is not pipelined
Solutions:
stalling
add more hardware
Structural hazards
Non-pipelined units
Same non-pipelined FU
Instruction stream
time
IF ID OF EX
IF ID OF
IF ID
IF
WB
EX EX WB
OF
EX EX WB
ID
OF EX WB
IF
ID OF EX WB
IF ID OF EX WB
IF ID OF EX WB
Stall cycle
A pipeline stall delays all the remaining instructions
Note: this example pipeline differs from the 5-stage MIPS pipeline
Structural
hazards
Data hazards
Data dependencies:
RaW (read-after-write)
WaW (write-after-write)
WaR (write-after-read)
Hardware solution:
Forwarding / Bypassing
Detection logic
Stalling
Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5
sub r4, r1, r3
; r1 := r2+5
; RaW of r1
; WaR of r2
; WaW of r1
st
ld
r1, 5(r2)
r5, 0(r4)
; M[r2+5] := r1 (st=store)
; r5 = M[r4 + 0] (ld=load)
; memory RaW if 5+r2 = 0+r4
;r1:= r2+5
;RaW of r1
IF
ID OF
IF
EX WB
OF
ID
EX WB
IF
ID OF
EX WB
Saves two cycles
IF
ID OF
EX WB
Forwarding/By-pass circuitry
Forwarding
Forwarding cannot prevent all pipeline stalls.
For example in case of a sub after load.
No instruction
issue in this
cycle
; Load variable B
; Load variable E
; Add B and E
; Store result at memory for variable A
; Load variable F
; Add B and F
; Store result at memory for variable C
branch
jump
call (jump and link)
return
(exception/interrupt and rti / return from interrupt)
Branch example
Progra m
execu tion
order
CC 1
CC 2
IM
Reg
CC 3
CC 4
CC 5
DM
R eg
CC 6
CC 7
CC 8
CC 9
44 an d $1 2, $2 , $ 5
48 or $1 3, $6 , $2
52 ad d $1 4, $2 , $ 2
72 lw $4 , 50($ 7)
IM
R eg
IM
DM
R eg
IM
R eg
DM
R eg
IM
R eg
DM
R eg
Reg
DM
R eg
Branching (Solution-1)
Squash pipeline:
When we decide to branch, other instructions are
in the pipeline!
We are predicting branch not taken
need to add hardware for flushing instructions if we
are wrong
Clock cycles
Branch L
IF
Predict
not taken
L:
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
IF
ID
MEM WB
EX
MEM WB
Branching (Solution-2)
Intelligent Predictor:
Some branches predicted as taken, some as
untaken.
Dynamic hardware predictors, dynamically
predicts if branch to be taken or not.
Can predict upto to 90% accuracy.
When guess is wrong, the pipelined must be emptied.
Branching (Solution-3)
Delayed branch instruction:
The delayed branch always executes the next
sequential instruction. (Branch taken after one
instruction)
Hidden from programmer, because assembler
can automatically handle it.
Non-pipelined
Multi-cycle Implementation
Interlocks implemented in hardware that detect when to stall
pipeline.
To avoid two writes in the same cycle:
Track the use write port in ID stage.
Stall an instruction before it is issued for execution.
If instruction in ID needs to use the write-port at the same time as
an already issued instruction, then instruction in ID is stalled for
one cycle.
To avoid possibility of RAW hazard:
Stall the instruction for RAW hazard
To avoid possibility of WAW hazard:
If there is a RAW, then WAW, then the stall of RAW will handle
Occurs in case of useless instructions, There will be no two
consecutive writes. (But in some rare cases WAW can arise)
Instruction Scheduling
Compiler or programmer schedules instructions (i.e. modifies the
sequence of code) to minimize the hardware stalls.
Without changing the meaning of code, compiler rearranges the
order of instructions to pipeline stalls.
Scheduling Constraints
The scheduled/optimized program must generate the same result
as the original program generates.
All the operations executed in the original program must be
executed in scheduled/optimized program.
No over-usage of resources. Assignment of resources in a cycle
must comply with the available resources
Scheduling Constraints
Data Dependence:
Compiler or programmer must schedule the code keeping in view
the following data dependences.
True dependence: write -> read (RAW hazard)
1. a =
2.
=a
Output dependence: write -> write (WAW hazard)
1. a =
2. a =
Anti dependence: read -> write (WAR hazard)
1. = a
2. a =
$t1, $t1, 1
$t2, $t2, 1
$t2, $t3, label
addiu
addiu
beq
nop
$t1, $t1, 1
$t2, $t2, 1
$t2, $t3, label
$t2, $t2, 1
$t2, $t3, label
$t1, $t1, 1