Computer Architecture: Branching
Computer Architecture: Branching
Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches : 20% (of branches) Conditional (80%)
66% forward (i.e., slightly over 50% of total branches). Evenly split between Taken and Not-Taken 33% backward. Almost all Taken
Easiest solution
Wait till outcome of the branch is known
The problem is that we are optimizing for the less frequent case! Nonetheless it will be the default for dynamic branch prediction since it is so easy to implement.
Basic idea
Use a Branch Prediction Buffer (BPB)
Also called Branch Prediction Table (BPT), Branch History Table (BHT) Records previous outcomes of the branch instruction How it will be indexed, updated etc. see later
A prediction using BPB is attempted when the branch instruction is fetched (IF stage or equivalent) It is acted upon during ID stage (when we know we have a branch)
Prediction Outcomes
Has a prediction been made (Y/N)
If not use default Not Taken
10
Case 1:
NT/NT 0 penalty T/T need to compute address: 0 or 1 bubble
Case 3:
NT/NT 0 penalty
Case 4:
NT/T delay + 1bubbles
Case 2
NT/T delay + 1?bubbles T/NT delay + 1?bubbles
11
Hybrid predictors
Choose dynamically the best among 2 predictors
Branch pred. CSE 471 Autumn 01 12
Simple indexing
Cache-like
PC
PC
Branch pred. CSE 471 Autumn 01 13
Simplest design
BPB addressed by lower bits of the PC One bit prediction
Prediction = direction of the last time the branch was executed Will mispredict at first and last iterations of a loop
Known implementation
Alpha 21064. The 1-bit table is associated with an I-cache line, one bit per line (4 instructions)
14
taken
not taken
predict taken
predict taken
taken
taken
not taken
predict not taken
not taken
predict not taken
taken
^
not taken
15
16
17
Performance of BPBs
Prediction accuracy is only one of several metrics Others metrics:
Need to take into account branch frequencies Need to take into account penalties for Misfetch (correct prediction but time to compute the address; e.g. for unconditional branches or T/T if no BTB) Mispredict (incorrect branch prediction) These penalties might need to be multiplied by the number of instructions that could have been issued
18
Prediction accuracy
2-bit vs. 1-bit
Significant gain: approx. 92% vs. 85% for f-p in Spec benchmarks, 90% vs. 80% in gcc but about 88% for both in compress
19
20
BTB layout
Tag cache-like (Partial) PC Target instruction address or Icache line target address 2-bit counter Next PC (target address) Prediction
During IF, check if there is a hit in the BTB. If so, the instruction must be a branch and we can get the target address if predicted taken during IF. If correct, no stall
Branch pred. CSE 471 Autumn 01 21
22
Decoupled design
Separate and different sizes BPB and BTB BPB. If it predicts taken then go to BTB (see next slide) Power PC 620: 2K entries BPB + 256 entries BTB HP PA-8000: 256*3 BPB + 32 (fully-associative) BTB
23
Decoupled BTB
BPB Tag Hist BTB
Tag
Note: the BPB does not require a tag, so could be much larger
/* branch b1 */
/* branch b2 */ /* branch b3 */
25
26
General idea: implementation using a global history register and a global PHT
PHT
27
GA
GA
GAg (5,2)
GAp(5,2)
29
PC
PC
PAg (4,2)
Branch pred. CSE 471 Autumn 01
PAp(4,2)
30
XOR
PC
31
The green, red, and blue arrows might correspond to different indexing functions
PC
Global
32
Evaluation
The more hardware (real estate) the better!
GA s for a given number of s the larger G the better; for a given G length, the larger the number of s the better.
Note that the result of a branch might not be known when the GA (or PA) needs to be used again (because we might issue several instructions per cycle). It must be speculatively updated (and corrected if need be). Ditto for PHT but less in a hurry?
33
Prog. Exec.
Event selec.
Pred. Index.
One level (BPB) Two level (History +PHT) Decoupled BTB + BPB
Recovery? Feedback
Branch outcome Update pred. mechanism Update history (updates might be speculative)
Pred. Mechan.
Static (ISA) 1 or 2-bit saturating counters
34
Pentium Pro
512 4-way set-associative BTB 4 bits of branch history GAg (4+x,2) ?? Where do the extra bits come from in PC?
35
Hence addition of a small return stack; 4 to 8 entries are enough (1 in MIPS R10000, 4 in Alpha 21064, 4 in Sparc64, 12 in Alpha 21164)
Checked during IF, in parallel with BTB.
Branch pred. CSE 471 Autumn 01 36
Resume buffer
In some old machines (e.g., IBM 360/91 circa 1967), branch prediction was implemented by fetching both paths (limited to 1 branch) Similar idea: resume buffer in MIPS R10000.
If branch predicted taken, it takes one cycle to compute and fetch the target During that cycle save the Not-Taken sequential instruction in a buffer (4 entries of 4 instructions each). If mispredict, reload from the resume buffer thus saving one cycle
37