Lec03-Pipelining 2021
Lec03-Pipelining 2021
dce 2021
ADVANCED SYSTEM
ARCHITECTURES
Pipelining
BK
TP.HCM Trần Ngọc Thịnh
http://www.cse.hcmut.edu.vn/~tnthinh
©2021, dce
2
1 2
dce dce
2021
What is pipelining? 2021
3 4
3 4
dce dce
2021
5 6
dce dce
2021
Sequential Laundry 2021
Pipelined Laundry Start work ASAP
6 PM 7 8 9 10 11 Midnight 6 PM 7 8 9 10 11 Midnight
Time Time
30 40 20 30 40 20 30 40 20 30 40 20 30 40 40 40 40 20
T T
a
s
A a A
s
k k
O
B B
O
r r
d C d C
e e
r
D r
D
Sequential laundry takes 6 hours for 4 loads Pipelined laundry takes 3.5 hours for 4 loads
If they learned pipelining, how long would laundry take? Speedup = 6/3.5 = 1.7
7 8
7 8
dce dce
2021
Pipelining Lessons 2021
Pipelining Example: Laundry
6 PM 7 8 9 Pipelining doesn’t help • Pipelined Laundry Observations:
latency of single task, it helps
Time throughput of entire workload – At some point, all stages of washing will be
T Pipeline rate limited by operating concurrently
30 40 40 40 40 20
a slowest pipeline stage – Pipelining doesn’t reduce number of stages
s Multiple tasks operating
k A simultaneously
• doesn’t help latency of single task
Potential speedup = Number • helps throughput of entire workload
O
r B pipe stages
Unbalanced lengths of pipe
d – As long as we have separate resources, we can
stages reduces speedup
e C pipeline the tasks
r Time to “fill” pipeline and time
to “drain” it reduces speedup – Multiple tasks operating simultaneously use
D
different resources
9 10
9 10
dce dce
2021
Pipelining Example: Laundry 2021
CPU Pipelining
• Pipelined Laundry Observations: • 5 stages of a MIPS instruction
– Speedup due to pipelining depends on the number – Fetch instruction from instruction memory
of stages in the pipeline – Read registers while decoding instruction
– Execute operation or calculate address, depending on
– Pipeline rate limited by slowest pipeline stage the instruction type
• If dryer needs 45 min , time for all stages has to be 45 – Access an operand from data memory
min to accommodate it
– Write result into a register
• Unbalanced lengths of pipe stages reduces speedup
• We can reduce the cycles to fit the stages.
– If one load depends on another, we will have to Load Ifetch Reg/Dec Exec Mem Wr
wait (Delay/Stall for Dependencies)
11 12
11 12
dce dce
2021
CPU Pipelining 2021
CPU Pipelining
• Example: Resources for Load Instruction • Note that accessing source & destination registers is performed in two
different parts of the cycle
– Fetch instruction from instruction memory (Ifetch) • We need to decide upon which part of the cycle should reading and
writing to the register file take place.
– Instruction memory (IM)
Reading Time (clock cycles) Writing
– Read registers while decoding instruction
(Reg/Dec) I
n
– Register file & decoder (Reg)
ALU
Inst 0 Im Reg Dm Reg
s
– Execute operation or calculate address, depending t
ALU
r. Inst 1 Im Reg Dm Reg
on the instruction type (Exec)
ALU
O Im Reg Dm Reg
– ALU r Inst 2
ALU
e Inst 3 Im Reg Dm Reg
– Data memory (DM) r
ALU
Im Reg Dm Reg
– Write result into a register (Wr) Inst 4
13 14
dce dce
2021
CPU Pipelining: Example 2021
CPU Pipelining: Example
• Single-cycle, pipelined execution
• Single-Cycle, non-pipelined execution
– Improve performance by increasing instruction throughput
•Total time for 3 instructions: 24 ns – Total time for 3 instructions = 14 ns
– Each instruction adds 2 ns to total execution time
– Stage time limited by slowest resource (2 ns)
P ro g ra m
e x e c u tio n – Assumptions:
2 4 6 8 10 12 14 16 18
o rd e r Time • Write to register occurs in 1st half of clock
(in in str u c tio ns ) • Read from register occurs in 2nd half of clock
Instruction Reg ALU Data Reg
lw $ 1 , 1 0 0 ( $ 0 ) fetch access P ro g ra m
e x e c u t io n 2 4 6 8 10 12 14
8 ns Instruction Reg ALU Data Reg Time
fetch access o rd e r
lw $ 2 , 2 0 0 ( $ 0 ) ( in in s t ru c tio n s )
Instruction Instruc tion Da ta
8 ns fetch lw $1, 100($0) R eg ALU Reg
fetch acc ess
...
lw $ 3 , 3 0 0 ( $ 0 )
In struction D a ta
8 ns lw $2, 200($0) 2 ns
fetc h
Re g ALU
a ccess
Reg
Instruc tion D a ta
lw $3, 300($0) 2 ns
fetch
R eg ALU
access
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
15 16
15 16
dce dce
2021
CPU Pipelining: Example 2021
MIP dataflow
• Assumptions:
– Only consider the following instructions:
lw, sw, add, sub, and, or, slt, beq IF/ID ID/EX EX/MEM MEM/WB
4
ADD M
– Operation times for instruction classes are: u
• Memory access 2 ns x
PC Branch
Comp.
• ALU operation 2 ns IR6...10 taken
17 18
17 18
dce dce
2021
CPU Pipelining Example: (1/2) 2021
CPU Pipelining Example: (2/2)
• Theoretically: • If we have 3 consecutive instructions
– Non-pipelined needs 8 x 3 = 24 ns
– Speedup should be equal to number of stages ( n
– Pipelined needs 14 ns
tasks, k stages, p latency) => Speedup = 24 / 14 = 1.7
19 20
dce dce
2021
Pipelining MIPS Instruction Set 2021
Pipelining MIPS Instruction Set
• MIPS was designed with pipelining in mind
2. MIPS has limited instruction format
=> Pipelining is easy in MIPS:
– All instruction are the same length
– Source register in the same place for each
– Limited instruction format instruction (symmetric)
– Memory operands appear only in lw & sw instructions – 2nd stage can begin reading at the same time as
– Operands must be aligned in memory decoding
– If instruction format wasn’t symmetric, stage 2
1. All MIPS instruction are the same length should be split into 2 distinct stages
– Fetch instruction in 1st pipeline stage
=> Total stages = 6 (instead of 5)
– Decode instructions in 2nd stage
– If instruction length varies (e.g. 80x86), pipelining will be
more challenging
21 22
21 22
dce dce
2021
Pipelining MIPS Instruction Set 2021
Pipelining MIPS Instruction Set
3. Memory operands appear only in lw & sw 4. Operands must be aligned in memory
instructions – Transfer of more than one data operand can be
– We can use the execute stage to calculate done in a single stage with no conflicts
memory address
– Access memory in the next stage – Need not worry about single data transfer
– If we needed to operate on operands in memory instruction requiring 2 data memory accesses
(e.g. 80x86), stages 3 & 4 would expand to – Requested data can be transferred between the
• Address calculation CPU & memory in a single pipeline stage
• Memory access
• Execute
23 24
23 24
dce dce MIPS In-Order Single-Issue Integer Pipeline Ideal
2021
Instruction Pipelining Review 2021
Operation
– MIPS In-Order Single-Issue Integer Pipeline Fill Cycles = number of stages -1
(No stall cycles)
25 26
dce dce
2021
5 Steps of MIPS Datapath 2021
Visualizing Pipelining
Instruction Instr. Decode Execute Memory Write Figure A.2, Page A-8
Fetch Reg. Fetch Addr. Calc Access Back Time (clock cycles)
Next PC
MUX
Next SEQ PC Next SEQ PC Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Adder
I
4
ALU
RS1
Zero? n Ifetch Reg DMem Reg
Write
s destination
MUX MUX
MEM/WB
Address
Memory
IF ID EX MEM WB
EX/MEM
RS2
Reg File
t register
ID/EX
IF/ID
ALU
r. in first half
ALU
Ifetch Reg DMem Reg
Memory
of WB cycle
Data
MUX
O
r
ALU
IR <= mem[PC]; Ifetch Reg DMem Reg
WB Data
PC <= PC + 4
Sign
Extend
d
Imm
e
A <= Reg[IRrs];
r
ALU
RD RD RD Ifetch Reg DMem Reg
B <= Reg[IRrt]
rslt <= A opIRop B Read operand registers
in second half of ID cycle
WB <= rslt
• Data stationary control
Reg[IRrd] <= WB Operation of ideal integer in-order 5-stage pipeline
– local decode for each instruction phase
/ pipeline stage
27 28
dce dce
2021
Pipelining Performance Example 2021
Pipeline Hazards
• Example: For an unpipelined CPU:
• Hazards are situations in pipelining which prevent the next
– Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5
cycles for memory operations with instruction frequencies of 40%, instruction in the instruction stream from executing during the
20% and 40%, respectively. designated clock cycle possibly resulting in one or more stall
– If pipelining adds 0.2 ns to the machine clock cycle then the speedup
in instruction execution from pipelining is: (or wait) cycles.
Non-pipelined Average instruction execution time = Clock cycle x • Hazards reduce the ideal speedup (increase CPI > 1) gained
Average CPI from pipelining and are classified into three classes:
= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns
– Structural hazards: Arise from hardware resource conflicts when the
In the pipelined implementation five stages are used with an available hardware cannot support all possible combinations of
average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns instructions.
Speedup from pipelining = Instruction time unpipelined – Data hazards: Arise when an instruction depends on the result of a
Instruction time pipelined previous instruction in a way that is exposed by the overlapping of
= 4.4 ns / 1.2 ns = 3.7 times faster instructions in the pipeline
– Control hazards: Arise from the pipelining of conditional branches and
other instructions that change the PC
29 30
29 30
dce dce
2021
How do we deal with hazards? 2021
Stalls and performance
• Often, pipeline must be stalled • Stalls impede progress of a pipeline and result in deviation
• Stalling pipeline usually lets some instruction(s) in from 1 instruction executing/clock cycle
pipeline proceed, another/others wait for data, • Pipelining can be viewed to:
resource, etc. – Decrease CPI or clock cycle time for instruction
– Let’s see what affect stalls have on CPI…
• A note on terminology:
– If we say an instruction was “issued later than instruction x”, • CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction
we mean that it was issued after instruction x and is not as = 1 + Pipeline stall cycles per instruction
far along in the pipeline
• Ignoring overhead and assuming stages are balanced:
31 32
31 32
dce dce
2021
Even more pipeline performance issues! 2021
Structural Hazards
• This results in: • In pipelined machines overlapped instruction execution
Clock cycle unpipeline d requires pipelining of functional units and duplication of
Clock cycle pipelined
Pipeline depth
resources to allow all possible combinations of instructions in
Pipeline depth
Clock cycle unpipeline d the pipeline.
• Which leads to: Clock cycle pipelined
1 Clock cycle unpipeline d • If a resource conflict arises due to a hardware resource being
Speedup from pipelining
1 Pipeline stall cycles per instructio n Clock cycle pipelined required by more than one instruction in a single cycle, and
1 one or more such instructions cannot be accommodated,
Pipeline depth
1 Pipeline stall cycles per instructio n then a structural hazard has occurred, for example:
– when a pipelined machine has a shared single-memory pipeline stage
• If no stalls, speedup equal to # of pipeline stages in for data and instructions.
ideal case stall the pipeline for one cycle for memory data access
33 34
33 34
dce dce
2021
An example of a structural hazard 2021
How is it resolved?
ALU
Load Mem Reg DM Reg
ALU
ALU
Instruction 1 Mem Reg DM Reg
ALU
ALU
Instruction 2 Mem Reg DM Reg
ALU
ALU
Instruction 3 Mem Reg DM Reg
ALU
Time
Pipeline generally stalled by
Time
What’s the problem here? 35
inserting a “bubble” or NOP 36
35 36
dce dce
2021
Or alternatively… 2021
A Structural Hazard Example
• Given that data references are 40% for a specific instruction
mix or program, and that the ideal pipelined CPI ignoring
Clock Number
hazards is equal to 1.
Inst. # 1 2 3 4 5 6 7 8 9 10
LOAD IF ID EX MEM WB • A machine with a data memory access structural hazards
Inst. i+1 IF ID EX MEM WB
requires a single stall cycle for data references and has a
Inst. i+2 IF ID EX MEM WB
clock rate 1.05 times higher than the ideal machine. Ignoring
Inst. i+3 stall IF ID EX MEM WB
Inst. i+4 IF ID EX MEM WB
other performance losses for this machine:
Inst. i+5 IF ID EX MEM
Average instruction time = CPI X Clock cycle time
Inst. i+6 IF ID EX
Average instruction time = (1 + 0.4 x 1) x Clock cycle ideal
LOAD instruction “steals” an instruction fetch cycle
1.05
which will cause the pipeline to stall.
= 1.3 X Clock cycle time ideal
Thus, no instruction completes on clock cycle 8 Therefore the machine without the hazard is better.
37 38
37 38
dce dce
2021
Remember the common case! 2021
Data Hazards
• All things being equal, a machine without structural • Data hazards occur when the pipeline changes the order of
hazards will always have a lower CPI. read/write accesses to instruction operands in such a way
that the resulting access order differs from the original
sequential instruction operand access order of the
• But, in some cases it may be better to allow them unpipelined machine resulting in incorrect execution.
than to eliminate them. • Data hazards may require one or more instructions to be
stalled to ensure correct execution.
• Example:
• These are situations a computer architect might have ADD R1, R2, R3
to consider: SUB R4, R1, R5
AND R6, R1, R7
– Is pipelining functional units or duplicating them costly in OR R8,R1,R9
terms of HW? XOR R10, R1, R11
– Does structural hazard occur often? – All the instructions after ADD use the result of the ADD instruction
– What’s the common case??? – SUB, AND instructions need to be stalled for correct execution.
39 40
39 40
dce dce
2021
Data Hazard on R1 2021
Minimizing Data hazard Stalls by Forwarding
Time (clock cycles) • Forwarding is a hardware-based technique (also called register
bypassing or short-circuiting) used to eliminate or minimize data
IF ID/RF EX MEM WB hazard stalls.
I
• Using forwarding hardware, the result of an instruction is copied
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
directly from where it is produced (ALU, memory read port etc.), to
n
s where subsequent instructions need it (ALU input register, memory
write port etc.)
ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg
ALU
O and r6,r1,r7 Ifetch Reg DMem Reg
ALU input latches as needed instead of the register operand value read in the ID
r stage.
d – Similarly, the Data Memory Unit result from the MEM/WB register may be fed back
to the ALU input latches as needed .
ALU
Reg
or r8,r1,r9
Ifetch Reg DMem
e – If the forwarding hardware detects that a previous ALU operation is to write the
r register corresponding to a source for the current ALU operation, control logic
selects the forwarded result as the ALU input rather than the value read from the
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
register file.
42
41 42
dce dce
2021
HW Change for Forwarding 2021
ALU
Ifetch Reg DMem Reg
s
mux
Registers
t
MEM/WR
EX/MEM
r.
ALU
ID/EX
ALU
Reg
sub r4,r1,r3 Ifetch Reg DMem
Data O
mux
Memory r
ALU
and r6,r1,r7 Ifetch Reg DMem Reg
mux
Immediate d
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
43 44
dce dce
2021
Forwarding 2021
Forwarding to Avoid LW-SW Data Hazard
Fix data hazards by
forwarding results as Time (clock cycles)
ALU
add $1, IM Reg DM Reg
soon as they are
I
n available to where they
I add r1,r2,r3 Ifetch
ALU
Reg DMem Reg
are needed
n
ALU
s sub $4,$1,$5 IM Reg DM Reg
t s
lw r4, 0(r1)
ALU
t Ifetch Reg DMem Reg
r. r.
ALU
IM Reg DM Reg
and $6,$1,$7
ALU
Ifetch Reg DMem Reg
O O sw r4,12(r1)
r r
ALU
IM Reg DM Reg
or $8,$1,$9 d
d
ALU
Ifetch Reg DMem Reg
e or r8,r6,r9
e r
ALU
IM Reg DM Reg
r xor $4,$1,$5
ALU
Ifetch Reg DMem Reg
xor r10,r9,r11
45 46
dce dce
2021
Data Hazard Classification 2021
Data Hazard Classification
Given two instructions I, J, with I occurring before J in an I (Write) I (Read)
instruction stream:
I Shared
• RAW (read after write): A true data dependence Shared
I ..
Operand
J tried to read a source before I writes to it, so J .. Operand
..
incorrectly gets the old value.
.. J J (Read) J (Write)
• WAW (write after write): A name dependence
Program
J tries to write an operand before it is written by I Read after Write (RAW)
J Order
Write after Read (WAR)
The writes end up being performed in the wrong order.
Program I (Write) I (Read)
• WAR (write after read): A name dependence Order
J tries to write to a destination before it is read by I, Shared
Shared
so I incorrectly gets the new value. Operand Operand
• RAR (read after read): Not a hazard.
J (Write) J (Read)
Write after Write (WAW) Read after Read (RAR) not a hazard
47 48
47 48
dce dce
2021
Read after write (RAW) hazards 2021
Write after write (WAW) hazards
• With RAW hazard, instruction j tries to read a source operand • With WAW hazard, instruction j tries to write an operand
before instruction i writes it. before instruction i writes it.
• Thus, j would incorrectly receive an old or incorrect value
• The writes are performed in wrong order leaving the value
• Graphically/Example: written by earlier instruction
49 50
49 50
dce dce
2021
Write after read (WAR) hazards 2021
Data Hazards Requiring Stall Cycles
• With WAR hazard, instruction j tries to write an operand • In some code sequence cases, potential data hazards cannot be handled
before instruction i reads it. by bypassing. For example:
LW R1, 0 (R2)
SUB R4, R1, R5
• Instruction i would incorrectly receive newer value of its
AND R6, R1, R7
operand;
OR R8, R1, R9
– Instead of getting old value, it could receive some newer, undesired
value: • The LW instruction has the data in clock cycle 4 (MEM cycle).
• The SUB instruction needs the data of R1 in the beginning of that cycle.
• Graphically/Example: • Hazard prevented by hardware pipeline interlock causing a stall cycle.
i: SUB R4, R1, R3
… j i … j: ADD R1, R2, R3
Instruction j is a Instruction i is a
write instruction read instruction
issued after i issued before j
51 52
51 52
dce dce
Data Hazard Even with Forwarding Data Hazard Even with Forwarding
2021 2021
ALU
ALU
Reg DMem Reg Reg DMem Reg
n n
s s
t t
ALU
Bubble
ALU
sub r4,r1,r6 Ifetch Reg DMem Reg
sub r4,r1,r6 Ifetch Reg DMem Reg
r. r.
O O Bubble
ALU
ALU
Ifetch Reg DMem Reg
and r6,r1,r7 Ifetch Reg DMem Reg
and r6,r1,r7
r r
d d
Bubble
ALU
Ifetch
e e
Reg DMem
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r r
53 54
dce dce
2021
Hardware Pipeline Interlocks 2021
Data hazards and the compiler
• A hardware pipeline interlock detects a data hazard and stalls • Compiler should be able to help eliminate
the pipeline until the hazard is cleared.
some stalls caused by data hazards
• The CPI for the stalled instruction increases by the length of
the stall.
• For the Previous example, (no stall cycle): • i.e. compiler could not generate a LOAD
LW R1, 0(R1)
SUB R4,R1,R5
IF ID
IF
EX
ID
MEM
EX
WB
MEM WB
instruction that is immediately followed by
AND R6,R1,R7
OR R8, R1, R9
IF ID
IF
EX
ID
MEM
EX
WB
MEM WB
instruction that uses result of LOAD’s
With Stall Cycle: destination register.
Stall + Forward
55 56
dce dce Static Compiler Instruction Scheduling (Re-Ordering)
2021
Some example situations 2021
Dependence LW R1, 45(R2) Comparators detect the use of R1 in the • Rather than allow the pipeline to stall, the compiler could sometimes
ADD R5, R1, R7 ADD and stall the ADD (and SUB and OR)
requiring stall before the ADD begins EX
schedule the pipeline to avoid stalls.
SUB R8, R6, R7
OR R9, R6, R7 • Compiler pipeline or instruction scheduling involves rearranging the code
Dependence LW R1, 45(R2) Comparators detect the use of R1 in SUB sequence (instruction reordering) to eliminate or reduce the number of
ADD R5, R6, R7 and forward the result of LOAD to the ALU
overcome by stall cycles.
SUB R8, R1, R7 in time for SUB to begin with EX
forwarding OR R9, R6, R7 Static = At compilation time by the compiler
Dependence with LW R1, 45(R2) No action is required because the read of Dynamic = At run time by hardware in the CPU
ADD R5, R6, R7 R1 by OR occurs in the second half of the
accesses in order SUB R8, R6, R7 ID phase, while the write of the loaded
data occurred in the first half.
OR R9, R1, R7
57 58
57 58
dce dce
2021
Static Compiler Instruction Scheduling Example 2021
Performance of Pipelines with Stalls
• Hazard conditions in pipelines may make it necessary to stall the pipeline
• For the code sequence: by a number of cycles degrading performance from the ideal pipelined
a, b, c, d ,e, and f
a=b+c are in memory
CPU CPI of 1.
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
d=e-f
= 1 + Pipeline stall clock cycles per instruction
• Assuming loads have a latency of one clock cycle, the following
code or pipeline compiler schedule eliminates stalls: • If pipelining overhead is ignored and we assume that the stages are
perfectly balanced then speedup from pipelining is given by:
Speedup = CPI unpipelined / CPI pipelined
Original code with stalls: Scheduled code with no stalls:
LW Rb,b = CPI unpipelined / (1 + Pipeline stall cycles per instruction)
LW Rb,b
LW Rc,c
Stall LW Rc,c
ADD Ra,Rb,Rc • When all instructions in the multicycle CPU take the same number of
LW Re,e
SW Ra,a cycles equal to the number of pipeline stages then:
LW Re,e ADD Ra,Rb,Rc
LW Rf,f LW Rf,f Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction)
SW Ra,a
Stall SUB Rd,Re,Rf
SW Rd,d No stalls for scheduled code
SUB Rd,Re,Rf
2 stalls for original code
SW Rd,d
59 60
59 60
dce dce Control Hazard on Branches
2021
ALU
Ifetch Reg DMem Reg
until the branch condition is known (branch is resolved).
– Otherwise the PC may not be correct when needed in IF
• In current MIPS pipeline, the conditional branch is resolved in stage 4 (MEM
ALU
Reg
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
Branch successor + 3 IF ID EX
22: add r8,r1,r9
Branch successor + 4 IF ID
Branch successor + 5 IF
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
61 62
dce dce
2021
Reducing Branch Stall Cycles 2021
MUX
SEQ PC
2- Compute the taken PC earlier in the pipeline. Pipeline:
Adder
Adder
Zero?
Conditional
In MIPS: 4 RS1
Branche
Completed in
MEM/WB
Address
Memory
EX/MEM
RS2
Reg File
– In MIPS branch instructions BEQZ, BNE, test a register for equality to
ID/EX
ALU
ID Stage
IF/ID
zero.
Memory
MUX
Data
– This can be completed in the ID cycle by moving the zero test into that
MUX
cycle.
Branch resolved in
WB Data
– Both PCs (taken and not taken) must be computed early. Sign
stage 2 (ID) Imm
Extend
– Requires an additional adder because the current ALU is not useable until Branch Penalty = 2
EX cycle. RD RD RD
-1=1
– This results in just a single cycle stall on branches.
63
63 64
dce dce
2021
Branch Stall Impact 2021
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
• If CPI = 1, 30% branch,
#2: Predict Branch Not Taken
Stall 3 cycles => new CPI = 1.9!
– Execute successor instructions in sequence
• Two part solution: – “Squash” instructions in pipeline if branch actually taken
– Determine branch taken or not sooner, AND – Advantage of late pipeline state update Most
Common
– Compute taken branch address earlier – 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
• MIPS branch tests if register = 0 or 0
#3: Predict Branch Taken
• MIPS Solution: – 53% MIPS branches taken on average
– Move Zero test to ID/RF stage – But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
– Adder to calculate new PC in ID/RF stage
• Other machines: branch target known before outcome
– 1 clock cycle penalty for branch versus 3 – What happens when hit not-taken branch?
65 66
dce dce
Predict Branch Not-Taken Scheme
2021 2021
sequential successorn
Assuming the MIPS pipeline with reduced branch penalty = 1
branch target if taken
Stall when the branch is taken – 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
Pipeline stall cycles from branches = frequency of taken branches X branch penalty
– MIPS uses this
67
67 68
dce dce
2021
Delayed Branch Example 2021
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through
Not Taken Branch (no stall)
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then or $7,$8,$9
delay slot sub $4,$5,$6
Taken Branch (no stall)
becomes becomes becomes
sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 or $7,$8,$9
add $1,$2,$3
if $1=0 then
sub $4,$5,$6 sub $4,$5,$6
Single Branch Delay Slot Used • A is the best choice, fills delay slot & reduces instruction count (IC)
Assuming branch penalty = 1 cycle • In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
69
69 70
dce dce
2021
Delayed Branch-delay Slot Scheduling Strategies 2021
Delayed Branch
The branch-delay slot instruction can be chosen from • Compiler effectiveness for single branch delay slot:
three cases: – Fills about 60% of branch delay slots
A An independent instruction from before the branch: Common
– About 80% of instructions executed in branch delay slots
Always improves performance when used. The branch useful in computation
must not depend on the rescheduled instruction.
– About 50% (60% x 80%) of slots usefully filled
B An instruction from the target of the branch:
Improves performance if the branch is taken and may require instruction • Delayed Branch downside: As processor go to
duplication. This instruction must be safe to execute if the branch is not taken. deeper pipelines and multiple issue, the branch
C An instruction from the fall through instruction stream: delay grows and need more than one delay slot
Improves performance when the branch is not taken. The instruction must be safe – Delayed branching has lost popularity compared to more
to execute when the branch is taken.
expensive but more flexible dynamic approaches
The performance and usability of cases B, C is improved by using
a canceling or nullifying branch. – Growth in available transistors has made dynamic
approaches relatively cheaper
71
71 72
dce dce
2021
Evaluating Branch Alternatives 2021
Pipeline Performance Example
• Assume the following MIPS instruction mix:
Pipeline speedup = Pipeline depth
1 +Branch frequencyBranch penalty Type Frequency
Arith/Logic 40%
Assume: 4% unconditional branch, Load 30% of which 25% are followed immediately by
6% conditional branch- untaken, an instruction using the loaded value
10% conditional branch-taken Store 10%
branch 20% of which 45% are taken
Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall • What is the resulting CPI for the pipelined MIPS with forwarding and
Stall pipeline 3 1.60 3.1 1.0 branch address calculation in ID stage when using a branch not-taken
scheme?
Predict not taken 1x0.04+3x0.10 1.34 3.7 1.19 Branch Penalty = 1 cycle
Predict taken 1x0.14+2x0.06 1.26 4.0 1.29 • CPI = Ideal CPI + Pipeline stall clock cycles per instruction
Delayed branch 0.5 1.10 4.5 1.45 = 1 + stalls by loads + stalls by branches
= 1 + .3 x .25 x 1 + .2 x .45 x 1
= 1 + .075 + .09
= 1.165
74
73 74
dce dce
2021
Pipelining Summary 2021
Pipelining Summary
• Pipelining overlaps the execution of multiple instructions. • Speed Up VS. Pipeline Depth; if ideal CPI is 1, then:
• With an idea pipeline, the CPI is one, and the speedup is Pipeline Depth Clock Cycle Unpipelined
Speedup = X
equal to the number of stages in the pipeline. 1 + Pipeline stall CPI Clock Cycle Pipelined
• However, several factors prevent us from achieving the ideal • Hazards limit performance
speedup, including – Structural: need more HW resources
– Not being able to divide the pipeline evenly – Data: need forwarding, compiler scheduling
– The time needed to empty and flush the pipeline – Control: early evaluation & PC, delayed branch, prediction
– Overhead needed for pipelining • Increasing length of pipe increases impact of hazards;
– Structural, data, and control hazards pipelining helps instruction bandwidth, not latency
• Just overlap tasks, and easy if tasks are independent • Compilers reduce cost of data and control hazards
– Load delay slots
– Branch delay slots
– Branch prediction
75 76
75 76
dce dce
2021
Example 2021
Example
• (1) lw $1, 40($6)
• (1) lw $1, 40($6) • (2) add $6, $2, $2
• (2) add $6, $2, $2 • (3) sw $6, 50($1)
• (4) add $4, $5, $6
• (3) sw $6, 50($1) • (5) lw $6, 10($4)
• (4) add $4, $5, $6
• B) Identify which dependencies will cause data hazards in the pipeline
• (5) lw $6, 10($4) implementation (without forwarding hardware).
77 78
dce dce
2021
Example 2021
Example
• (1) lw $1, 40($6) • E) Now assume that the architecture has special hardware to implement
forwarding so that no-ops are added only in the cases where forwarding does not
• (2) add $6, $2, $2
resolve the hazard, how many cycles will it take to execute the code?
• (3) sw $6, 50($1)
• (4) add $4, $5, $6
• (5) lw $6, 10($4)
• C) Show the code after adding no-ops (stall) to avoid hazards (needed for
correctness in the absence of forwarding hardware).
• The instr.(1) forwards the load result ($1) from MEM/WB register to ID/EXE as an ALU input
at cycle 5.
• The instr.(2) forwards the add result ($6) from EXE/MEM register to ID/EXE as a store input
at cycle 5.
• The instr.(2) forwards the add result ($6) from MEM/WB register to ID/EXE as a ALU input at
cycle 6.
• The instr.(4) forwards the add result ($4) from EXE/MEM register to ID/EXE as a ALU input
• D) How many cycles will it take to execute this code (with no-ops added)? at cycle 7.
• With forwarding, it takes 9 cycles to execute the code.
– According the form in (c), it takes 13 cycles to execute this code.
79 80
79 80