0% found this document useful (0 votes)
21 views

07 Branch Prediction

Dynamic branch prediction techniques aim to reduce branch penalty by predicting the outcome of conditional branches before they are resolved. These include using a branch history table (BHT) to track past branch outcomes, correlating predictors that consider the outcomes of recent branches, and tournament predictors that track both local and global branch histories. More advanced predictors like the Alpha 21264's use local predictors for short histories and global predictors for longer histories. Additional techniques like branch target buffers provide the target address at the same time as the prediction to reduce branch stalls.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

07 Branch Prediction

Dynamic branch prediction techniques aim to reduce branch penalty by predicting the outcome of conditional branches before they are resolved. These include using a branch history table (BHT) to track past branch outcomes, correlating predictors that consider the outcomes of recent branches, and tournament predictors that track both local and global branch histories. More advanced predictors like the Alpha 21264's use local predictors for short histories and global predictors for longer histories. Additional techniques like branch target buffers provide the target address at the same time as the prediction to reduce branch stalls.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

Dynamic Branch Prediction

1
Tomasulo Review
• Reservations stations: renaming to larger set
of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium II; PowerPC
604; MIPS R10000; HP-PA 8000; Alpha 21264
2
Outline

• Dynamic Branch Prediction


– Branch prediction buffer or branch history table
– Correlating branch predictors
– Tournament predictors

• Branch target buffers


• Integrated Instruction fetch unit
• Return address predictors

3
Dynamic Branch Prediction

• Performance = ƒ(accuracy, cost of misprediction)


• Branch History Table (branch-prediction buffer) is simplest
– Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
– No address check
• Problem: in a loop, 1-bit BHT will cause two mispredictions
(example: 9 iterations before exit  80%):
• Solution  2 bit

4
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction only if get misprediction twice:

• Dark: stop, not taken


• Light: go, taken

5
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the
table
• 4096 entry table programs vary from 1%
misprediction (nasa7, tomcatv) to 18%
(eqntott), with spice at 9% and gcc at 12%,
show
• 4096 about as good as infinite table
(in Alpha 21164), show
• Branch penalty and branch frequency are
also important

6
BHT Accuracy

4096 entry, two bit prediction


7
Unlimited Entries

8
Correlating Branches

• Hypothesis: recent branches are correlated; that is,


behavior of recently executed branches affects
prediction of current branch
• Idea: record m most recently executed branches as
taken or not taken, and use that pattern to select the
proper branch history table
• In general, (m,n) predictor means record last m
branches to select between 2m history tables each
with n-bit counters
– Old 2-bit BHT is then a (0,2) predictor

9
Examples

Code from eqntott from


SPEC92

b3 has correlation with b1, b2


10
Branch Prediction Result

1 bit predictor, (d is 0 or 2)
11
Correlating Prediction
Performance

One bit predictor with one bit correlation

12
Correlating Branches

(2,2) predictor
– Then behavior of recent
branches selects between,
say, four predictions of next
branch, updating just that
prediction
• Simple implementation:
– global history can be stored
in a shift register

Branch address is concatenated with


global branch history and then indexed.

13
Number of Stored Bits

• For an (m,n) predictor:


– 2^m * n * Number of prediction entries
• Example:
• 2-bit predictor with 4096 entries:
– 2^0 * 2 * 4k = 8k
• (2,2) predictor, how many entries to be 8k:
– 2^2 * 2 * x = 8k  x = 1k

• Comparison in the next slide

14
Accuracy of Different Schemes
18%
18%

4096 Entries 2-bit BHT


of Mispredictions

16%

14%
Unlimited Entries 2-bit BHT
12%
1024 Entries (2,2) BHT 11%
Frequency of Mispredictions

10%

8%
6% 6% 6%
6%
Frequency

5% 5%
4%
4%

2% 1% 1%
0%
0%
0%

gcc
doducd

li
spice

fpppp

eqntott
nasa7

tomcatv

espresso
matrix300

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

15
Tournament Branch Predictor
• Used in Alpha 21264: Track both “local” and global
history
• Intended for mixed types of applications
• Global history: T/NT history of past k branches, e.g. 0
1 0 1 0 1 (NT T NT T NT T)

PC
Local Global Choice
Predictor Predictor Predictor

mux
Global
history
NT/T
16
Predictor Select

17
Local Predictor Percentage

18
Performance Comparison

19
Tournament Branch Predictor

• Local predictor: use 10-bit local history, 3-bit


counters
PC Local history Counters
NT/T
table (1Kx10) (1Kx3)
10 1
• Global and choice predictors:

Global history Counters NT/T


12-bit 12 (4Kx2) 1

010101010101 NT/T Counters


(4Kx2) local/global
1
20
Branch Prediction (Articles)
1. Branch prediction using both global and local branch
history information - JNL02 (LGshare)
2. Improving branch prediction by dynamic data flow-
based identification of correlated branches from a
large global history - CNF03
3. Comparison of branch prediction schemes for
superscalar processors ICEEC 2004 - CNF04
4. Better branch prediction through prophet-critic
hybrids - JNL05.pdf

21
Branch Prediction (Articles)

1. Worst-case execution time analysis for


dynamic branch predictors - CNF04
2. Low-cost branch folding for embedded
applications with small tight loops -
CNF99
3. Simulation Differences Between
Academia and Industry- A Branch
Prediction Case Study  Simplescalar

22
Reducing Branch Stalls

• In MIPS, branch predicted as taken


– We need the target address 

• High Performance Instruction Delivery


– Branch target buffer

– integrated instruction fetch unit

– predicting return addresses

23
Need Address
at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index
to get prediction AND branch address (if taken)

24
Branch Target Buffer flow chart

25
Example

Prediction accuracy: 90% (for instructions in the buffer)


Hit rate in the buffer: 90% (for branches predicted taken)
Taken branch frequency: 60%

Branch penalty 1 (branch is predicted taken but not taken):


hit rate * incorrect predictions * 2
Branch penalty 2 (branch is taken but not in buffer):
miss rate * taken branches * 2
Penalty = 90% * 10% * 2 + 10% * 60% * 2 = .3 clock cycles
Compare that with .5 clock cycles in delayed branch
26
Branch Folding
• Idea: to store one or more target instructions
– instead of, or in addition to, the predicted target address.
• Advantages:
– it allows the branch-target buffer access to take longer
than the time between successive instruction fetches
– allows us to perform an optimization called branch
folding
• Branch Folding:
– zero-cycle unconditional branches, and sometimes zero-
cycle conditional branches.

27
Branch Target Buffer (summary)
• Branch Target Buffer (BTB): Address of branch index to get prediction AND
branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address
• Example: BTB combined with BHT

Branch PC Predicted PC
PC of instruction
FETCH

=? Extra
Yes: instruction is prediction state
branch and use bits
No: branch not predicted PC as
predicted, proceed normally next PC
(Next PC = PC+4) 28
Return Addresses Prediction

• Register indirect branch hard to predict


address
– If we use branch prediction buffer techniques in
this situation doesn’t work:
– Many callers, one callee
– Jump to multiple return addresses from a single
address (no PC-target correlation)
• SPEC89 85% such branches for procedure
return
• Use stack discipline for procedures, save
return address in small buffer that acts like a
stack: 8 to 16 entries has small miss rate
29
Accuracy of Return Address
Predictor

30
Branch Prediction With n-way Issue
1. Branches will arrive up to n times faster in an n-
issue processor
2. Relative impact of the control stalls will be larger
with the lower potential CPI in an n-issue
processor

31
Integrated Instruction Fetch
Units
 Branches will arrive up to n times faster in
an n-issue processor
1. Integrated branch prediction: branch
predictor becomes part of the instruction
fetch unit
2. Instruction prefetch: fetch ahead to deliver
multiple instructions per cycle
3. Instruction memory access and buffering:
may access multiple cache lines in one
cycle, use prefetch to hide the cost

32
Instruction Fetch Unit
Fetch • Fetch predictor Predicts
Predictor I-cache next fetch addresses to
avoid fetch delay; may
Branch Fetch pre-predict branch
direction; may be
Predictor integrated with I-cache

Decode/REN
• Branch predictor
overrides and trains
fetch predictor
Out-of-order Execution Engine

In-order commit
33
Short Seminar

1. Section 2.10 on Pentium 4,


Branch prediction

2. Pentium 4 Tomasulo

34
Dynamic Branch Prediction
Summary
• Prediction becoming important part of scalar
execution.
• Branch History Table: 2 bits for loop accuracy.
• Correlation: Recently executed branches correlated
with next branch.
– Either different branches.
– Or different executions of same branches.
• Tournament Predictor: more resources to competitive
solutions and pick between them.
• Branch Target Buffer: include branch address &
prediction.
• Return address stack for prediction of indirect jump.

36

You might also like