Lecture 5
Lecture 5
20% 18%
scheme predicts
15%
branches using 15% 12% 11% 12%
profile 10%
9%
information 10%
6%
collected from 4%
5%
earlier runs, and
modify 0%
prediction t
s ot cc li uc
r d p or
based on last e s n t s so g d ea o2 jl d c
pr eq re do ydr m
d
su
2
run: m
es
p h
co
Integer Floating Point
Dynamic Branch Prediction
• Why does prediction work?
– Underlying algorithm has regularities
– Data that is being operated on has regularities
– Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems
• Is dynamic branch prediction better than static
branch prediction?
– Almost always yes
– There are a small number of important branches in programs
which have dynamic behavior
Dynamic Branch Prediction
NT
Predict Taken Predict Taken
T
• Red: stop, not taken T NT NT
• Green:Predict
go, takenNot Predict Not
T Taken
Takento decision making process
• Adds hysteresis
NT
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table:
20% 18%
18%
Misprediction Rate
16%
14% 12%
12% 10%
10% 9% 9% 9%
8%
6% 5% 5%
4%
2% 1%
0%
0%
t li
tot so g cc ice d uc ice p pp 300 sa7
n es p p
eq r s do s fp ri x na
p at
es m
Integer
Floating Point
Correlated Branch Prediction
• Idea: record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper n-bit branch history table
– Behavior of recent 4
branches selects 2-bits per branch predictor
between four
predictions of next
branch, updating just Prediction
that prediction
20%
8%
6% 6% 6%
6%
5% 5%
4%
4%
2%
1% 1%
0%
0%
spice
gcc
nasa7
fpppp
expresso
matrix300
tomcatv
doducd
li
eqntott
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Tournament Predictors
• Local predictor
– Local history table: 1024 10-bit entries recording last 10
branches, index by branch address
– The pattern of the last 10 occurrences of that particular branch
used to index table of 1K entries with 3-bit saturating counters
Comparing Predictors (Fig. 2.8)
• Advantage of tournament predictor is ability to
select the right predictor for a particular branch
– Particularly crucial for integer benchmarks.
– A typical tournament predictor will select the global predictor
almost 40% of the time for the SPEC integer benchmarks and
less than 15% of the time for the SPEC FP benchmarks
Pentium 4 Misprediction Rate
(per 1000 instructions, not per branch)
14
13
6% misprediction rate per branch SPECint
13
12 (19% of INT instructions are branch)
12
Branch mispredictions per 1000 Instructions
11
2% misprediction rate per branch SPECfp
11
(5% of FP instructions are branch)
10
9 9
8
7
7
6
5
5
1 1
0 0 0
0
zi
p r c cf ty e im id pl
u a
vp gc .m af is gr es
.g 5. 6. cr w sw ap m
4
17 17 18
1 . up 1. .m . .
16 6 w 17
2
17
3
17
7
18 8. 17
16
SPECint2000 SPECfp2000
Branch Target Buffers (BTB)
• Note: Dynamic execution creates WAR and WAW hazards and makes
exceptions harder
Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not
exist in the five-stage integer pipeline and its logical extension to an in-order floating-point
pipeline.
Consider the following MIPS floating-point code sequence:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
SUB.D F8,F10,F14
MUL.D F6,F10,F8
There is an antidependence between the ADD.D and the SUB.D, and if the pipeline executes
the SUB.D before the ADD.D (which is waiting for the DIV.D), it will violate the
antidependence, yielding a WAR hazard. Likewise, to avoid violating output dependences,
such as the write of F6 by MUL.D, WAW hazards must be handled.
Dynamic Scheduling Step 1
• Results to FU from RS, not through registers, over Common Data Bus that
broadcasts results to all FUs
– Avoids RAW hazards by executing an instruction only when its operands are available
• Integer instructions can go past branches (predict taken), allowing FP ops beyond
basic block in FP queue
Tomasulo Organization
From Mem FP Op FP Registers
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5 Store
Load6
Buffers
Add1
Add2 Mult1
Add3 Mult2
Reservation To Mem
Stations
FP
FP adders
adders FP
FP multipliers
multipliers
Clock cycle
counter
Tomasulo Example Cycle 1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)
Add3 No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)
Add3 No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
• Complexity
– delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Each CDB must go to multiple functional units
high capacitance, high wiring density
– Number of functional units that can complete per cycle
limited to one!
» Multiple CDBs more FU logic for parallel assoc stores
• Non-precise interrupts!
– We will address this later
And In Conclusion … #1
• Leverage Implicit Parallelism for Performance:
Instruction Level Parallelism
• Loop unrolling by compiler to increase ILP
• Branch prediction to increase ILP
• Dynamic HW exploiting ILP
– Works when can’t know dependence at compile time
– Can hide L1 cache misses
– Code for one machine runs well on another
And In Conclusion … #2
• Reservations stations: renaming to larger set of registers +
buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Helps cache misses as well
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Intel Pentium 4, IBM Power 5, AMD
Athlon/Opteron, …