07 Branch Prediction
07 Branch Prediction
1
Tomasulo Review
• Reservations stations: renaming to larger set
of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium II; PowerPC
604; MIPS R10000; HP-PA 8000; Alpha 21264
2
Outline
3
Dynamic Branch Prediction
4
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction only if get misprediction twice:
5
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the
table
• 4096 entry table programs vary from 1%
misprediction (nasa7, tomcatv) to 18%
(eqntott), with spice at 9% and gcc at 12%,
show
• 4096 about as good as infinite table
(in Alpha 21164), show
• Branch penalty and branch frequency are
also important
6
BHT Accuracy
8
Correlating Branches
9
Examples
1 bit predictor, (d is 0 or 2)
11
Correlating Prediction
Performance
12
Correlating Branches
(2,2) predictor
– Then behavior of recent
branches selects between,
say, four predictions of next
branch, updating just that
prediction
• Simple implementation:
– global history can be stored
in a shift register
13
Number of Stored Bits
14
Accuracy of Different Schemes
18%
18%
16%
14%
Unlimited Entries 2-bit BHT
12%
1024 Entries (2,2) BHT 11%
Frequency of Mispredictions
10%
8%
6% 6% 6%
6%
Frequency
5% 5%
4%
4%
2% 1% 1%
0%
0%
0%
gcc
doducd
li
spice
fpppp
eqntott
nasa7
tomcatv
espresso
matrix300
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
15
Tournament Branch Predictor
• Used in Alpha 21264: Track both “local” and global
history
• Intended for mixed types of applications
• Global history: T/NT history of past k branches, e.g. 0
1 0 1 0 1 (NT T NT T NT T)
PC
Local Global Choice
Predictor Predictor Predictor
mux
Global
history
NT/T
16
Predictor Select
17
Local Predictor Percentage
18
Performance Comparison
19
Tournament Branch Predictor
21
Branch Prediction (Articles)
22
Reducing Branch Stalls
23
Need Address
at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index
to get prediction AND branch address (if taken)
24
Branch Target Buffer flow chart
25
Example
27
Branch Target Buffer (summary)
• Branch Target Buffer (BTB): Address of branch index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address
• Example: BTB combined with BHT
Branch PC Predicted PC
PC of instruction
FETCH
=? Extra
Yes: instruction is prediction state
branch and use bits
No: branch not predicted PC as
predicted, proceed normally next PC
(Next PC = PC+4) 28
Return Addresses Prediction
30
Branch Prediction With n-way Issue
1. Branches will arrive up to n times faster in an n-
issue processor
2. Relative impact of the control stalls will be larger
with the lower potential CPI in an n-issue
processor
31
Integrated Instruction Fetch
Units
Branches will arrive up to n times faster in
an n-issue processor
1. Integrated branch prediction: branch
predictor becomes part of the instruction
fetch unit
2. Instruction prefetch: fetch ahead to deliver
multiple instructions per cycle
3. Instruction memory access and buffering:
may access multiple cache lines in one
cycle, use prefetch to hide the cost
32
Instruction Fetch Unit
Fetch • Fetch predictor Predicts
Predictor I-cache next fetch addresses to
avoid fetch delay; may
Branch Fetch pre-predict branch
direction; may be
Predictor integrated with I-cache
Decode/REN
• Branch predictor
overrides and trains
fetch predictor
Out-of-order Execution Engine
In-order commit
33
Short Seminar
2. Pentium 4 Tomasulo
34
Dynamic Branch Prediction
Summary
• Prediction becoming important part of scalar
execution.
• Branch History Table: 2 bits for loop accuracy.
• Correlation: Recently executed branches correlated
with next branch.
– Either different branches.
– Or different executions of same branches.
• Tournament Predictor: more resources to competitive
solutions and pick between them.
• Branch Target Buffer: include branch address &
prediction.
• Return address stack for prediction of indirect jump.
36