Branch Prediction: Prof. Mikko H. Lipasti University of Wisconsin-Madison
Branch Prediction: Prof. Mikko H. Lipasti University of Wisconsin-Madison
Lecture Overview
Program control flow Branch Prediction
Implicit sequential control flow Disruptions of sequential control flow
Branch instruction processing Branch instruction speculation
UCB Study [Lee and Smith, 1984] IBM Study [Nair, 1992]
Reservation Stations
Branch Prediction
Target address generation Target Speculation
Access register:
PC, General purpose register, Link register
Perform calculation:
Perform calculation:
Comparison of data register(s)
Issue
Completion Buffer
Condition Resolution
Fetch Decode Buffer CC reg. Decode Dispatch Buffer Dispatch Reservation Stations Branch Execute Finish Complete Store Buffer Retire
Issue
Completion Buffer
FA-mux
PC(seq.) = FA (fetch address) Fetch Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Branch
Issue
Execute Finish
Completion Buffer
Branch inst. Information Branch target address for predict. address (most recent)
Branch Target Buffer: small cache in fetch stage Previously executed branches, address, taken history, target(s) Fetch stage compares current FA against BTB If match, use prediction If predict taken, use BTB target When branch executes, BTB is updated Optimization: Size of BTB: increases hit rate Prediction algorithm: increase accuracy of prediction
2.
Software Prediction
Extra bit in each branch instruction
Set to 0 for not taken Set to 1 for taken
Bit set by compiler or user; can use profiling Static prediction, same behavior every time
3.
4.
Branch types
Unconditional: always taken or always not taken Subroutine call: always taken Loop control: usually taken Decision: either way, if-then-else Computed goto: always taken, with changing target Supervisor call: always taken Execute: always taken (IBM 370)
IBM1: compiler IBM2: cobol (business app) IBM3: scientific IBM4: supervisor (OS)
T NT
IBM1 IBM2 IBM3 IBM4 DEC CDC Avg 0.640 0.657 0.704 0.540 0.738 0.778 0.676 0.360 0.343 0.296 0.460 0.262 0.222 0.324
Prediction effectiveness based on opcode only, or history IBM1 66 64 92 93 94 95 95 IBM2 69 64 95 97 97 97 97 IBM3 71 70 87 91 91 92 92 IBM4 55 54 80 83 84 84 84 DEC 80 74 97 98 98 98 98 CDC 78 78 82 91 94 95 96
Results:
Workload Accuracy
History of past several branches encoded by FSM Current state used to generate prediction IBM1 93 IBM2 97 IBM3 IBM4 91 83 DEC 98 CDC 91
TN n? T
N T
Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 520% performance enhancement.
tomcatv
gcc espresso li eqntott
0.00
2.30 3.61 2.41 0.91
0.00
1.32 0.58 1.92 0.47
6.10
0.00
0.24
0.02
0.01
4.85 0.31 1.37
11.01 0.80
gcc
espresso li eqntott
Branch history table size: Direct-mapped array of 2k entries Some programs, like gcc, have over 7000 conditional branches In collisions, multiple branches share the same predictor
Constructive and destructive interference Destructive interference
FA-mux
FA (fetch address) Fetch Decode Buffer Decode Dispatch Buffer Dispatch BRN SFX SFX CFX Reservation Stations LS FPU
Finish
Completion Buffer
I-cache
FAR
BHT update
Decode
Dispatch Buffer
Dispatch
BRN SFX SFX CFX Reservation Stations LS FPU
Completion Buffer