0% found this document useful (0 votes)

80 views

L02 Branch Prediction V2021

The document discusses branch prediction techniques used in computer architectures. It describes the problem of control hazards when processing conditional branch instructions in a pipelined processor. Specifically, it takes 3 stages for the processor to determine if a branch was taken or not taken, which can cause issues fetching subsequent instructions. Branch prediction techniques aim to predict the outcome of conditional branches earlier in the pipeline to avoid stalling or flushing the pipeline. The document covers static and dynamic branch prediction methods.

Uploaded by

fjuopregheru5734

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views

L02 Branch Prediction V2021

Uploaded by

fjuopregheru5734

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Course on: “Advanced Computer Architectures”

Branch Prediction Techniques

Prof. Cristina Silvano

Politecnico di Milano
email: [email protected]
Outline

 The Problem of Control Hazards in the Processor

Pipeline

 Branch Prediction Techniques

• Static Branch Prediction

• Dynamic Branch Prediction

Prof. Cristina Silvano – Politecnico di Milano -2-

The Problem of Control Hazards

Prof. Cristina Silvano – Politecnico di Milano -3-

Conditional Branch Instructions

 A branch is taken if the condition is satisfied: the branch

target address is stored in the Program Counter (PC)
instead of the address of the next instruction in the
sequential instruction stream (PC + 4).

 Examples of conditional branches for MIPS processor:

beq (branch on equal) and bne (branch on not equal)
• beq $s1, $s2, L1 # go to L1 if ($s1 == $s2)
• bne $s1, $s2, L1 # go to L1 if ($s1 != $s2)

Prof. Cristina Silvano – Politecnico di Milano -4-

Execution of conditional branches for 5-stage MIPS
pipeline

beq$x,$y,offset
beq $x,$y,L1

Instr. Fetch Register Read ALU Op. ($x-$y) Write of

& PC Increm. $x e $y & (PC+4+offset) PC

· Instruction fetch and PC increment

· Registers read ($x and $y) from Register File.
· ALU operation to compare registers ($x and $y) to derive Branch
Outcome (branch taken or branch not taken).
• Computation of Branch Target Address (PC+4+offset): the value

(PC+4) is added to the least significant 16 bit of the instruction after

sign extension
· The result of registers comparison from ALU (Branch Outcome) is used
to decide the value to be stored in the PC: (PC+4) or (PC+4+offset).

Prof. Cristina Silvano – Politecnico di Milano -6-

Execution of conditional branches for 5-stage MIPS
pipeline

IF ID EX ME WB
Instruction Fetch Instruction Decode Execution Memory Access Write Back

beq
beq $x,$y,offset
$x,$y,L1
Instr. Fetch Register Read ALU Op. ($x-$y) Write of
& PC Increm. $x e $y & (PC+4+offset) PC

 Branch Outcome and Branch Target Address are ready at

the end of the EX stage (3th stage)
 Conditional branches are solved when PC is updated at
the end of the ME stage (4th stage)
Prof. Cristina Silvano – Politecnico di Milano -7-
Execution of conditional branches for MIPS

 Processor resources to execute conditional branches:

2-bit Left
S hifter
WR Adder
To control logic of
[25-21] R egis ter conditional branch Branch Target
R ead 1 Addres s
Content
Is truction [20-16] R egis ter regis ter 1
Zero P C +4
R ead 2
(from fetch unit)
R eg is ter F ile
R egis ter AL U
Content
write regis trer 2
Write
Data

[15-0] S ign
16 bit E xtens ion 32 bit

Prof. Cristina Silvano – Politecnico di Milano -8-

Implementation of the 5-stage MIPS Pipeline
ID — Instruction Decode EX — Execution MEM — Memory Access WB —
M
U Write Back
X Branch Target Address PC Write
IF /ID ID/E X E X/ME M ME M/WB
+4
Adder
Adder

2-bit Left
S hifter
WR Branch
R ead [25-21] R egis ter OP Outcome WR RD
PC Addres s R ead 1 Content
R ead
[20-16] R egis ter regis ter 1 AL U
Ins truction Zero Addres s
R ead 2 M
Write R ead Data
RF U
Ins truction Addres s X
M R es ult
Memory R egis ter Content U
Write regis ter 2 X Data
M Write Write
Data Data Memory
[15-11] U
X

[15-0] S ign
16 bit extens ion 32 bit
IF — Instruction Fetch

Prof. Cristina Silvano – Politecnico di Milano -9-

The Problem of Control Hazards

 Control hazards: Attempt to make a decision on the

next instruction to fetch before the branch condition is
evaluated.
 Control hazards arise from the pipelining of conditional
branches and other instructions changing the PC.
 Control hazards reduce the performance from the ideal
speedup gained by the pipelining since they can make it
necessary to stall the pipeline.

Prof. Cristina Silvano – Politecnico di Milano - 10 -

Branch Hazards

 To feed the pipeline we need to fetch a new instruction

at each clock cycle, but the branch decision (to change
or not change the PC) is taken during the MEM stage.
 This delay to determine the correct instruction to fetch
is called Control Hazard or Conditional Branch Hazard
 If a branch changes the PC to its target address, it is a
taken branch
 If a branch falls through, it is not taken or untaken.

Prof. Cristina Silvano – Politecnico di Milano - 11 -

Branch Hazards: Example

beq $1, $3, L1 IF ID EX ME WB

and $12, $2, $5 IF ID EX ME WB
or $13, $6, $2 IF ID EX ME WB
add $14, $2, $2 IF ID EX ME WB
L1: lw $4, 50($7) IF ID EX ME WB

 The branch instruction may or may not change the PC in MEM stage,
but the next 3 instructions are fetched and their execution is
started.
 If the branch is not taken, the pipeline execution is fine
 If the branch is taken, it is necessary to flush the next 3 instructions
in the pipeline before they are writing their results, then we need to
fetch the lw instruction at the branch target address (L1)

Prof. Cristina Silvano – Politecnico di Milano - 12 -

Branch Hazards: Solutions

 If the branch is not taken, introducing three cycles

penalty is not justified  throughput reduction.
 Solution: We can assume the branch not taken, and
flush the next 3 instructions in the pipeline only if the
branch will be taken. (We cannot assume the branch
taken because we don’t know the branch target address)
 This solution introduces the idea of branch prediction
 But, let’s assume to be conservative…

Prof. Cristina Silvano – Politecnico di Milano - 13 -

Branch Hazards: Conservative assumption

 Conservative assumption: To stall the pipeline until the

branch decision is taken (stalling until resolution), then
fetch the correct instruction flow.
• Without forwarding : We need to stall for 3 clock cycles

• With forwarding: We need to stall for 2 clock cycles

Prof. Cristina Silvano – Politecnico di Milano - 14 -

Branch Stalls without Forwarding

beq $1, $3, L1 IF ID EX ME WB

and $12, $2, $5 stall stall stall IF ID EX ME WB

or $13, $6, $2 IF ID EX ME WB

add $14, $2, $2 IF ID EX ME WB

 Conservative assumption: Stalling until resolution at the

end of the ME stage.
 Each branch costs three stalls to fetch the correct
instruction flow: (PC+4) or Branch Target Address
Prof. Cristina Silvano – Politecnico di Milano - 15 -
Branch Stalls with Forwarding

beq $1, $3, L1 IF ID EX ME WB

and $12, $2, $5 stall stall IF ID EX ME WB

or $13, $6, $2 IF ID EX ME WB

add $14, $2, $2 IF ID EX ME WB

 Conservative assumption: Stalling until resolution at the

end of the EX stage (when the BO and BTA are known)
 Each branch costs two stalls to fetch the correct
instruction flow: (PC+4) or Branch Target Address
Prof. Cristina Silvano – Politecnico di Milano - 16 -
Early Evaluation of the PC

 To improve performance in case of branch hazards, we

need to add more hardware resources to:
1. Compare registers to derive the Branch Outcome

2. Compute the Branch Target Address

3. Update the PC register

as soon as possible in the pipeline.

 MIPS processor anticipated the comparison of registers,

computation of BTA and update of PC during ID stage.

Prof. Cristina Silvano – Politecnico di Milano - 17 -

MIPS Processor: Early Evaluation of the PC
PC Write
M
U
X Branch Target Address
ID/EX
EX/MEM MEM/WB
IF/ID
+4
Adder

Adder
2-bit Left

WR Shift

Read Register OP
PC Read 1 Cont.
Address
Register
Instruction
Read 2
Reg. 1
Branch ALU M
=
Instruction Register File Outcome U
M X
Memory Register Cont. Result
Write U
Reg. 2 X
Write Data
M Data Memory
U
X
Sign
16 bit Extension
32 bit

Prof. Cristina Silvano – Politecnico di Milano - 18 -

MIPS Processor: Early Evaluation of the PC

beq $1, $3, L1 IF ID EX ME WB

and $12, $2, $5 stall IF ID EX ME WB

or $13, $6, $2 IF ID EX ME WB

add $14, $2, $2 IF ID EX ME WB

 Conservative assumption: Stalling until resolution at the

end of the ID stage (when the BO and BTA are known)
 Each branch costs one stall to fetch the correct
instruction flow: (PC+4) or Branch Target Address

Prof. Cristina Silvano – Politecnico di Milano - 19 -

MIPS Processor: Early Evaluation of the PC

 Consequence of early evaluation of the branch decision in

ID stage:
• In case of add instruction followed by a branch testing

the result  we need to introduce one stall before ID

stage of branch to enable the forwarding (EX-ID) of
the result from EX stage of previous instruction.
• As usual we need one stall after the branch for branch

resolution.
addi $1, $1, 4 IF ID EX ME WB
beq $1, $6, L1 IF stall ID EX ME WB
and $12, $2, $5 stall IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 20 -

MIPS Processor: Early Evaluation of the PC

 Consequence of early evaluation of the branch decision in

ID stage:
• In case of load instruction followed by a branch

testing the result  we need to introduce two stalls

before ID stage of branch to enable the forwarding (ME-
ID) of the result from EX stage of previous instruction.
• As usual we need one stall after the branch for branch

resolution.

lw $1, 32($2) IF ID EX ME WB
beq $1, $6, L1 IF stall stall
ID EX ME WB
and $12, $2, $5 stall IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 21 -

MIPS Processor: Early Evaluation of the PC

 With the branch decision made during ID stage, there is a reduction of

the cost associated with each branch (branch penalty):
• We need only one-clock-cycle stall after each branch

• Or a flush of only one instruction following the branch

 One-cycle-delay for every branch still yields a performance loss of 10%

to 30% depending on the branch frequency

 Pipeline Stall Cycles per Instruction due to Branches =

Branch Frequency x Branch Penalty

 We will examine some branch prediction techniques to deal with this

performance loss.

Prof. Cristina Silvano – Politecnico di Milano - 22 -

Branch Prediction Techniques

Prof. Cristina Silvano – Politecnico di Milano - 23 -

Branch Prediction Techniques
 Main goal: try to predict as early as possible the outcome of a branch
instruction.

 The performance of a branch prediction technique depends on:

• Accuracy measured in terms of percentage of incorrect predictions given
by the predictor.
• Cost of a incorrect prediction measured in terms of time lost to execute
useless instructions (misprediction penalty) given by the processor
architecture: the cost increases for deeply pipelined processors
• Branch frequency given by the application: the importance of accurate
branch prediction is higher in programs with higher branch frequency.

Prof. Cristina Silvano – Politecnico di Milano - 24 -

Branch Prediction Techniques

 There are two types of methods to deal with the performance loss
due to branch hazards:
• Static Branch Prediction Techniques: The actions
(taken/untaken) for a branch prediction are fixed at compile
time for each branch during the entire execution.
• Dynamic Branch Prediction Techniques: The actions

(taken/untaken) for a branch prediction can change at runtime

during the program execution.
 In both cases, we need to do not change the processor state and
registers until the Branch Outcome is definitely known.

Prof. Cristina Silvano – Politecnico di Milano - 25 -

Static Branch Prediction Techniques

Prof. Cristina Silvano – Politecnico di Milano - 26 -

Static Branch Prediction Techniques

 Static Branch Prediction is used when the expectation is

that the branch behavior of the target application is
highly predictable at compile time.
 Static Branch Prediction can also be used to assist
dynamic predictors.

Prof. Cristina Silvano – Politecnico di Milano - 27 -

Static Branch Prediction Techniques

1) Branch Always Not Taken (Predicted-Not-Taken)

2) Branch Always Taken (Predicted-Taken)

3) Backward Taken Forward Not Taken (BTFNT)

4) Profile-Driven Prediction

5) Delayed Branch

Prof. Cristina Silvano – Politecnico di Milano - 28 -

1) Branch Always Not Taken

 We assume the branch will not be taken, thus the

sequential instruction flow we have fetched can continue
as if the branch condition was not satisfied.
 If the BO at the end of ID stage will result not taken (the
prediction is correct), we can preserve performance.
Pred. BO: Untaken
Untaken Pred.correct
Untaken branch IF ID EX ME WB
Instruction i+1 IF ID EX ME WB
Instruction i+2 IF ID EX ME WB
Instruction i+3 IF ID EX ME WB
Instruction i+4 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 29 -

1) Branch Always Not Taken

 If the BO at the end of ID stage will result taken (the

prediction is incorrect):
• We need to flush the next instruction already fetched (the next
instruction is turned into a nop) and we restart the execution
by fetching the instruction at the Branch Target Address
 One-cycle performance penalty
Pred. BO: Taken
untaken Misprediction
Taken branch IF ID EX ME WB
Instruction i+1 IF nop nop nop nop
Branch target IF ID EX ME WB
Branch target+1 IF ID EX ME WB
Branch target+2 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 30 -

2) Branch Always Taken

 An alternative scheme is to consider every branch as taken: as soon

as the branch is decoded and the Branch Target Address is
computed, we assume the branch to be taken and we begin fetching
and executing at the target address.
 The predicted-taken scheme makes sense for pipelines where the
branch target address is known before the branch outcome.
 In MIPS pipeline, we don’t know the branch target address earlier
than the branch outcome, so there is no advantage in the
application of this technique.
• We should anticipate the computation of BTA at the IF stage

(before the ID stage) or we need a Branch Target Buffer, a

cache to store the predicted value of the BTA for the next
instruction after each branch.

Prof. Cristina Silvano – Politecnico di Milano - 32 -

3) Backward Taken Forward Not Taken (BTFNT)

 The prediction is based on the branch direction:

• Backward-going branches are predicted as taken

• Example: the branches at the end of loops go back

at the beginning of the next loop iteration

 we assume the backward-going branches are
always taken.
• Forward-going branches are predicted as not taken

• Example: the branches going forward to an ELSE label

of an IF-ELSE clause  if we assume the conditions
associated to the ELSE as less probable, it is better to
consider the forwarding branches always not taken

Prof. Cristina Silvano – Politecnico di Milano - 33 -

4) Profile-Driven Prediction

 Let us assume we can profile the behavior of a target

application program by executing several runs with
different data sets
 The branch prediction is based on profiling information
collected from earlier runs.
 The profile-driven prediction method can use compiler
hints associated to each branch.

Prof. Cristina Silvano – Politecnico di Milano - 34 -

5) Delayed Branch Technique

 Scheduling technique: The compiler statically schedules

an independent instruction in the branch delay slot.
 The instruction in the branch delay slot is executed
whether or not the branch is taken.
 If we assume a branch delay of one-cycle (as for MIPS)
 we have only one-delay slot to fill in
 It is possible to have for some deeply pipeline processors
a branch delay longer than one-cycle

Prof. Cristina Silvano – Politecnico di Milano - 35 -

5) Delayed Branch Technique

 The MIPS compiler always schedules a branch

independent instruction after the branch.
 Example: A previous add instruction with no effects on
the branch is scheduled in the Branch Delay Slot

beq $1, $2, L1 IF ID EX ME WB

add $4, $5, $6 IF ID EX ME WB BRANCH DELAY SLOT
lw $3, 300($0) IF ID EX ME WB
lw $7, 400($0) IF ID EX ME WB
lw $8, 500($0) IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 36 -

5) Delayed Branch Technique

 The behavior of the branch delay instruction is the same

whether or not the branch is taken (and it is not flushed!)
• If the branch is untaken  execution continues with
the instruction after the branch (and the instruction in
the branch delay slot is NOT flushed!)

BO: Untaken

Untaken branch IF ID EX ME WB
Branch delay instr. IF ID EX ME WB BRANCH DELAY SLOT

Instr. i+1 IF ID EX ME WB
Instr. i+2 IF ID EX ME WB
Instr. i+3 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 37 -

5) Delayed Branch Technique

• If the branch is taken  execution continues at the

branch target (and the instruction in the branch delay
slot is NOT flushed!)

BO:Taken

Taken branch IF ID EX ME WB
Branch delay instr. IF ID EX ME WB BRANCH DELAY SLOT

Branch target instr. IF ID EX ME WB

Branch target instr.+ 1 IF ID EX ME WB
Branch target instr.+ 2 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 38 -

5) Delayed Branch Technique

 The job of the compiler is to make the instruction

placed in the branch delay slot valid and useful.
 There are four ways in which the branch delay slot can
be scheduled:
1. From before

2. From target

3. From fall-through

4. From after

Prof. Cristina Silvano – Politecnico di Milano - 39 -

5) Delayed Branch Technique

Prof. Cristina Silvano – Politecnico di Milano - 40 - June 2006

5) Delayed Branch Technique: From Before

 The branch delay slot is scheduled with an independent instruction

from before the branch
 The instruction in the branch delay slot is always executed
(whether the branch is taken or untaken).
 Then execution will continue based on the Branch Outcome in the
right direction and the add instruction in the delay slot will never
be flushed.

add $1, $2, $3

if $2 == 0 then if $2 == 0 then
br. delay slot add $1, $2, $3

Prof. Cristina Silvano – Politecnico di Milano - 41 -

5) Delayed Branch Technique: From Target
 The use of $1 in the branch condition prevents add instruction
(whose destination is $1) from being moved after the branch.
 The branch delay slot is scheduled from the target of the branch
(usually the target instruction sub needs to be copied because it can
be reached by another path).
 This strategy is preferred when the branch is taken with high
probability, such as loop branches (backward branches).
 If the branch is untaken (misprediction), the sub instruction in the
delay slot needs to be flushed!
sub $4, $5, $6

sub $4, $5, $6

add $1, $2, $3 add $1, $2, $3

if $1 == 0 then if $1 == 0 then

br. delay slot sub $4, $5, $6

Prof. Cristina Silvano – Politecnico di Milano - 42 -

5) Delayed Branch Technique: From Fall-Through

 The use of $1 in the branch condition prevents add instruction (whose

destination is $1) from being moved after the branch.
 The branch delay slot is scheduled from the not taken fall-through path.
 This strategy is preferred when the branch is not taken with high
probability, such as forward branches.
 If the branch is taken (misprediction), the or instruction in the delay
slot needs to be flushed!

add $1, $2, $3 add $1, $2, $3

if $1 == 0 then if $1 == 0 then
br. delay slot or $7, $8, $9
or $7, $8, $9

sub $4, $5, $6 sub $4, $5, $6

Prof. Cristina Silvano – Politecnico di Milano - 43 -

5) Delayed Branch Technique

 To make the optimization legal for the target and fall-

through cases, it must flushed or it must be OK to
execute the moved instruction when the branch goes in
the unexpected direction.
 By OK we mean that the instruction in the branch delay
slot is executed but the work is wasted (the program will
still execute correctly).
 For example, if the destination register is an unused
temporary register when the branch goes in the
unexpected direction.

Prof. Cristina Silvano – Politecnico di Milano - 44 -

5) Delayed Branch Technique

 In general, compilers are able to fill in about the 50% of

delayed branch slots with valid and useful instructions,
the remaining slots are filled with nops.
 In deeply pipelined processors, the delayed branch is
longer that one cycle: many slots must be filled for every
branch.
• Since it is more difficult for the compiler to fill in all

the slots with useful instructions  almost all

processors with delayed branch technique have a
single delay slot

Prof. Cristina Silvano – Politecnico di Milano - 45 -

5) Delayed Branch Technique

 The main limitations on delayed branch scheduling arise

from:
• The restrictions on the instructions that can be

scheduled in the delay slot.

• The ability of the compiler to statically predict the

outcome of the branch.

Prof. Cristina Silvano – Politecnico di Milano - 46 -

5) Delayed Branch Technique

 To improve the ability of the compiler to fill the branch

delay slot  most processors have introduced a
canceling or nullifying branch: the instruction includes
the direction that the branch was predicted.
• When the branch behaves as predicted  the instruction in the
branch delay slot is executed normally.
• When the branch is incorrectly predicted  the instruction in the
branch delay slot is turned to a nop (flushed)
 In this way, the compiler need not be as conservative
when filling the delay slot.

Prof. Cristina Silvano – Politecnico di Milano - 47 -

5) Delayed Branch Technique

 MIPS architecture has the branch-likely instruction, that

behaves as cancel-if-not-taken branch:
• The instruction in the branch delay slot is executed whether the
branch is taken.
• The instruction in the branch delay slot is not executed (it is
turned to a nop) whether the branch is untaken.
 Useful approach for backward branches (such as loop
branches).

Prof. Cristina Silvano – Politecnico di Milano - 48 -

Dynamic Branch Prediction Techniques

Prof. Cristina Silvano – Politecnico di Milano - 49 -

Dynamic Branch Prediction

 Basic Idea: To use the past branch behavior to predict

the future.
 We use hardware to dynamically predict the outcome of
a branch: the prediction will depend on the behavior of
the branch at run time and will change if the branch
changes its behavior during execution.
 We start with a simple branch prediction scheme and
then examine approaches that increase the branch
prediction accuracy.

Prof. Cristina Silvano – Politecnico di Milano - 50 -

Dynamic Branch Prediction Schemes
 Dynamic branch prediction is based on two interacting hardware
blocks placed in the Instruction Fetch stage to predict the next
instruction to read in the Instruction Cache:
 Branch Outcome Predictor (BOP):
• To predict the direction of a branch (i.e. taken or not taken).

 Branch Target Predictor or Branch Target Buffer (BTB):

• To predict the branch target address in case of taken branch
(Predicted Target Address - PTA)

BOP:T/NT BO = BOP?
BTB:-- BTA

Branch IF ID EX ME WB
Predicted Instr. IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 51 -

Dynamic Branch Prediction Schemes
 If branch is predicted by BOP in IF stage as not taken
 PC is incremented (BTB not useful in this case).
 If the BO at the end of ID stage will result as not taken (the
prediction is correct), we can preserve performance.

BOP: untaken BO=BOP untaken

BTB :-- Prediction:OK

Untaken branch IF ID EX ME WB
Prediction: Instruction i+1 IF ID EX ME WB
Instruction i+2 IF ID EX ME WB
Instruction i+3 IF ID EX ME WB
Instruction i+4 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 52 -

Dynamic Branch Prediction Schemes
 If the BO at the end of ID stage will result taken (misprediction):
• We need to flush the next instruction already fetched (the next
instruction is turned into a nop) and we restart the execution by
fetching at the Branch Target Address  One-cycle penalty

BOP: untaken BO:Taken

BTB :-- Misprediction

Taken branch IF ID EX ME WB
Prediction: Instruction i+1 IF nop nop nop nop
Branch target IF ID EX ME WB
Branch target+1 IF ID EX ME WB
Branch target+2 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 53 -

Dynamic Branch Prediction Schemes
 If branch is predicted by BOP in IF stage as taken
 BTB gives the Predicted Target Address (PTA)
 If the BO at the end of ID stage will result as taken (the
prediction is correct), we can preserve performance.

BOP: taken BO=BOP taken

BTB: PTA Prediction:OK

Taken branch IF ID EX ME WB
Prediction: Branch Target IF ID EX ME WB
Branch Target + 1 IF ID EX ME WB
Branch Target + 2 IF ID EX ME WB
Branch Target + 3 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 54 -

Dynamic Branch Prediction Schemes
 If the BO at the end of ID stage will result untaken (misprediction):
• We need to flush the next instruction already fetched (the next
instruction is turned into a nop) and we restart the execution by
fetching at the Branch Target Address  One-cycle penalty

BOP:taken BO Untaken
BTB:PTA Misprediction

Untaken branch IF ID EX ME WB
Prediction: Branch target IF nop nop nop nop
Instr. i+1 IF ID EX ME WB
Instr. i+2 IF ID EX ME WB
Instr. i+3 IF ID EX ME WB

Prof. Cristina Silvano – Politecnico di Milano - 55 -

Dynamic Branch Prediction Techniques

1) Branch History Table

2) Correlating Branch Predictors

3) Two-level Adaptive Branch Predictors

4) Branch Target Buffer

Prof. Cristina Silvano – Politecnico di Milano - 56 -

1) Branch History Table

 Branch History Table (or Branch Prediction Buffer):

• Table containing 1 bit for each entry that says whether the

branch was recently taken or not.

• Table indexed by the lower portion k-bit of the address of the
branch instruction (to keep the size of the table limited)
• For locality reasons, we would expect that the most significant

bits of the branch address are not changed

Prof. Cristina Silvano – Politecnico di Milano - 57 -

1) Branch History Table

n-bit Branch Address

BHT

k-bit Branch Address 2k entries

T/NT (1-bit)

Prof. Cristina Silvano – Politecnico di Milano - 58 -

1) Branch History Table

 Prediction: hint that it is assumed to be correct, and fetching begins

in the predicted direction.
• If the hint turns out to be wrong, the prediction bit is inverted
and stored back. The pipeline is flushed and the correct
sequence is executed with one cycle penalty.
 The table has no tags (every access is a hit) and the prediction bit
could has been put there by another branch with the same low-order
address bits: but it doesn’t matter. The prediction is just a hint!

Prof. Cristina Silvano – Politecnico di Milano - 59 -

1) Accuracy of the Branch History Table

 A misprediction occurs when:

• The prediction is incorrect for that branch

or
• The same index has been referenced by two different branches,

and the previous history refers to the other branch (This can
occur because there is no tag check)
• To reduce this problem it is enough to increase the number

of rows in the BHT (that is to increase k) or to use a hashing

function (such as in GShare).

Prof. Cristina Silvano – Politecnico di Milano - 60 -

1) FSM for 1-bit Branch History Table

Taken Not Taken

Not Taken
Not
Taken
Taken Taken

Prof. Cristina Silvano – Politecnico di Milano - 61 - June 2006

1) 1-bit Branch History Table
 Shortcoming of the 1-bit BHT:
• In a loop branch, even if a branch is almost always taken and then not
taken once, the 1-bit BHT will mispredict twice (rather than once)
when it is not taken.
 That scheme causes two wrong predictions:
 At the last loop iteration, since the prediction bit will say taken,
while we need to exit from the loop.
 When we re-enter the loop, at the end of the first loop iteration
we need to take the branch to stay in the loop, while the
prediction bit say to exit from the loop, since the prediction bit
was flipped on previous execution of the last iteration of the loop.
 For example, if we consider a loop branch whose behavior is taken nine
times and not taken once, the prediction accuracy is only 80% (due to 2
incorrect predictions and 8 correct ones).

Prof. Cristina Silvano – Politecnico di Milano - 62 -

1) 2-bit Branch History Table

 The prediction must miss twice before it is changed.

 In a loop branch, at the last loop iteration, we do not
need to change the prediction.
 For each index in the table, the 2 bits are used to
encode the four states of a finite state machine.

Prof. Cristina Silvano – Politecnico di Milano - 63 -

1) FSM for 2-bit Branch History Table

Prof. Cristina Silvano – Politecnico di Milano - 64 -

1) n-bit Branch History Table

 Generalization: n-bit saturating counter for each entry in

the prediction buffer.
• The counter can take on values between 0 and 2n-1
• When the counter is greater than or equal to one-half of its
maximum value (2n-1), the branch is predicted as taken.
• Otherwise, it is predicted as untaken.
 As in the 2-bit scheme, the counter is incremented on a
taken branch and decremented on an untaken branch.
 Studies on n-bit predictors have shown that 2-bit
predictors behave almost as well.

Prof. Cristina Silvano – Politecnico di Milano - 65 -

1) Accuracy of 2-bit Branch History Table

 For IBM Power architecture executing SPEC89

benchmarks , a 4K-entry BHT with 2-bit per entry results
in:
• Prediction accuracy from 99% to 82% (i.e.

misprediction rate from 1% to 18%)

• Almost similar performance with respect to an infinite

buffer with 2-bit per entry.

Prof. Cristina Silvano – Politecnico di Milano - 66 -

2) Correlating Branch Predictors

 The 2-bit BHT uses only the recent behavior of a single

branch to predict the future behavior of that branch.
 Basic Idea: the behavior of recent branches are
correlated, that is the recent behavior of other branches
rather than just the current branch (we are trying to
predict) can influence the prediction of the current
branch.
 We try to exploit the correlation existing among
different branches: branches are partially based on the
same conditions => they can generate information that
can influence the behavior of other branches;

Prof. Cristina Silvano – Politecnico di Milano - 67 -

2) Example of Correlating Branches

subi r3,r1,2
bnez r3,L1; bb1
add r1,r0,r0
If(a==2) a = 0; bb1
L1: subi r3,r2,2
L1: If(b==2) b = 0; bb2
bnez r3,L2; bb2
L2: If(a!=b) {}; bb3
add r2,r0,r0
L2: sub r3,r1,r2
beqz r3,L3; bb3
L3:

Branch bb3 is correlated to previous branches bb1 and bb2.

If previous branches are both not taken,
then bb3 will be taken (a!=b)

Prof. Cristina Silvano – Politecnico di Milano - 68 -

2) Correlating Branch Predictors

 Branch predictors that use the behavior of other

branches to make a prediction are called Correlating
Predictors or 2-level Predictors.
 Example a (1,1) Correlating Predictors means a 1-bit
predictor with 1-bit of correlation: the behavior of last
branch is used to choose among a pair of 1-bit branch
predictors.

Prof. Cristina Silvano – Politecnico di Milano - 69 -

2) Correlating Branch Predictors: Example

T1: Branch History Table T2: Branch History Table

if last branch taken if last branch not taken
1 1

0 1

.... ....

2k entries
Branch Address
(k bit)

Last Branch Result

Branch Prediction

Prof. Cristina Silvano – Politecnico di Milano - 70 -

2) Correlating Branch Predictors
 Record if the most recently executed branches have
been taken o not taken.
 The branch is predicted based on the previous executed
branch by selecting the appropriate 1-bit BHT:
• One prediction is used if the last branch executed was taken
• Another prediction is used if the last branch executed was not
taken.
 In general, the last branch executed is not the same
instruction as the branch being predicted (although this
can occur in simple loops with no other branches in the
loops).

Prof. Cristina Silvano – Politecnico di Milano - 71 -

2) (m, n) Correlating Branch Predictors

 In general (m, n) correlating predictor records last m

branches to choose from 2m BHTs, each of which is a n-bit
predictor.
 The branch prediction buffer can be indexed by using a
concatenation of low-order bits from the branch address
with m-bit global history (i.e. global history of the most
recent m branches).

Prof. Cristina Silvano – Politecnico di Milano - 72 -

2) (2, 2) Correlating Branch Predictors
 A (2, 2) correlating predictor has 4 2-bit Branch History Tables.
• It uses the 2-bit global history to choose among the 4 BHTs.

1 0 1 0 1 0 1 0

0 0 1 1 1 1 1 1

.... .... .... ....

Branch Address
(k bit) 2k entries

2-bit global branch history

2-bit Prediction

Prof. Cristina Silvano – Politecnico di Milano - 73 -

2) Example of (2, 2) Correlating Predictor
 Example: a (2, 2) correlating predictor with 64 total
entries  6-bit index composed of: 2-bit global history
and 4-bit low-order branch address bits
1 0 1 0 1 0 1 0

0 0 1 1 1 1 1 1

.... .... ....

4-bit branch
....

address 24 entries

2-bit global branch history

2-bit Prediction
Prof. Cristina Silvano – Politecnico di Milano - 74 -
2) Example of (2, 2) Correlating Predictor

 Each BHT is composed of 16 entries of 2-bit each.

 The 4-bit branch address is used to choose four entries
(a row).
 2-bit global history is used to choose one of four entries
in a row (one out of four BHTs)

Prof. Cristina Silvano – Politecnico di Milano - 75 -

2) Accuracy of Correlating Predictors

 A 2-bit BHT predictor with no global history is simply a

(0, 2) predictor.
 By comparing the performance of a 2-bit simple
predictor with 4K entries and a (2,2) correlating
predictor with 1K entries.
 The (2,2) predictor not only outperforms the simply 2-bit
predictor with the same number of total bits (4K total
bits), it often outperforms a 2-bit predictor with an
unlimited number of entries.

Prof. Cristina Silvano – Politecnico di Milano - 76 -

3) Two-Level Adaptive Branch Predictors

 The first level history is recorded in one (or more) k-bit shift
register called Branch History Register (BHR), which records the
outcomes of the k most recent branches (i.e. T, NT, NT, T) (used
as a global history)
 The second level history is recorded in one (or more) tables called
Pattern History Table (PHT) of two-bit saturating counters (used
as a local history)
 The BHR is used to index the PHT to select which 2-bit counter to
use.
 Once the two-bit counter is selected, the prediction is made using
the same method as in the two-bit counter scheme.

Prof. Cristina Silvano – Politecnico di Milano - 78 -

3) GA Predictor

 The global 2-level predictor uses the correlation between

the current branch and the other branches in the global
history to make the prediction
 GAs: Global and local predictor
• 2-level predictor: PHT (local history) indexed by the

content of BHR (global history)

PHT

BHR

T NT NT T T/NT

Prof. Cristina Silvano – Politecnico di Milano - 79 -

3) GShare Predictor

 Variation of the GA predictor where want to correlate the BHR

recording the outcomes of the most recent branches (global history)
with the low-order bits of the branch address
 GShare: We make the XOR of 4-bit BHR (global history) with the
low-order 4-bit of PC (branch address) to index the PHT (local history).

PHT
4-bit
PC

BHR

T NT NT T T/NT
XOR

Prof. Cristina Silvano – Politecnico di Milano - 80 -

4) Branch Target Buffer

 Branch Target Buffer (Branch Target Predictor) is a

cache storing the Predicted Target Address (PTA)
 We access the BTB in the IF stage by using the address
of the fetched branch instruction to index the cache.
 Typical entry of the BTB:

Tags of Branch Address Predicted Target Address

 The Predicted Target Address is expressed as PC-relative

Prof. Cristina Silvano – Politecnico di Milano - 81 -

4) Structure of a Branch Target Buffer
Branch Address
TAGS INDEX TAGS for Predicted Target Address
associative lookup

Need also some validity bits

MISS, proceed with PC+4

=
HIT, PTA should be used as next PC

Prof. Cristina Silvano – Politecnico di Milano - 82 -

4) Structure of a Branch Target Buffer

Branch Address
 BTB entry: TAG INDEX
Tag PTA BOP
• Tag + Predicted Target
Address (expressed as PC-
relative for conditional T/NT

branches) + Branch
Outcome Predictor
(prediction state bit(s) as in =
a BHT)
PTA
Hit/Miss T/NT

Prof. Cristina Silvano – Politecnico di Milano - 83 -

Speculation

 Without branch prediction, the amount of parallelism is quite

limited, since it is limited to within a basic block – a straight-line
code sequence with no branches in except to the entry and no
branches out except at the exit.
 Branch prediction techniques can help to achieve significant amount
of parallelism.
 We can further exploit ILP across multiple basic blocks overcoming
control dependences by speculating on the outcome of branches and
executing instructions as if our guesses were correct.
 With speculation, we fetch, issue and execute instructions as if out
branch predictions were always correct, providing a mechanism to
handle the situation where the speculation is incorrect.
 Speculation can be supported by the compiler or by the hardware.

Prof. Cristina Silvano – Politecnico di Milano - 84 -

References

 An introduction to the branch prediction problem can be found in

Appendix A and Chapter 3 of the reference book: J. Hennessy and
D. Patterson, “Computer Architecture, a Quantitative Approach”,
Morgan Kaufmann, Fourth Edition.

Prof. Cristina Silvano – Politecnico di Milano - 87 -

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Bis2 Latex
No ratings yet
Bis2 Latex
85 pages
Arco en C Siemens Cios Fusion
No ratings yet
Arco en C Siemens Cios Fusion
19 pages
Lecture-6-13.01.2025 HPC
No ratings yet
Lecture-6-13.01.2025 HPC
17 pages
L13 MIPS Control Hazards
No ratings yet
L13 MIPS Control Hazards
40 pages
Lect6 Pipelining2 Sec2 PDF
No ratings yet
Lect6 Pipelining2 Sec2 PDF
31 pages
Lec14 Control Hazards
No ratings yet
Lec14 Control Hazards
48 pages
Control Hazard
No ratings yet
Control Hazard
20 pages
Branch Hazard in Pipelining
No ratings yet
Branch Hazard in Pipelining
35 pages
Computer Science 37 Lecture 22
No ratings yet
Computer Science 37 Lecture 22
14 pages
Pipeline Hazards: Structural Hazards: Resource Conflict
No ratings yet
Pipeline Hazards: Structural Hazards: Resource Conflict
49 pages
Comp206 Lecture9
No ratings yet
Comp206 Lecture9
53 pages
Anch Prediction
No ratings yet
Anch Prediction
183 pages
3.Control.Hazards.and.Branch.Prediction
No ratings yet
3.Control.Hazards.and.Branch.Prediction
111 pages
pipe3
No ratings yet
pipe3
32 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Group 17_2151177
No ratings yet
Group 17_2151177
15 pages
Branch Handling 1
No ratings yet
Branch Handling 1
50 pages
Conditional Branches
No ratings yet
Conditional Branches
35 pages
Co-4-2nd Part
No ratings yet
Co-4-2nd Part
4 pages
What About Branches?: Branch Outcomes Are Not Known Until EXE What Are Our Options?
No ratings yet
What About Branches?: Branch Outcomes Are Not Known Until EXE What Are Our Options?
27 pages
Lec5 PDF
No ratings yet
Lec5 PDF
23 pages
Chapter_04_processor_2
No ratings yet
Chapter_04_processor_2
28 pages
Lec 24
No ratings yet
Lec 24
3 pages
Pipelining: Basic Concepts
No ratings yet
Pipelining: Basic Concepts
20 pages
Kuliah 14 Pipeliningg
No ratings yet
Kuliah 14 Pipeliningg
28 pages
CA Lecture 4 Module 3
No ratings yet
CA Lecture 4 Module 3
27 pages
moduel 5
No ratings yet
moduel 5
46 pages
Pipelining Basic Concepts: Instruction Fetch Execute Operand Fetch IF OF EX
No ratings yet
Pipelining Basic Concepts: Instruction Fetch Execute Operand Fetch IF OF EX
28 pages
Correlating (Global) Branch Predictors Correlating Branch Predictors
No ratings yet
Correlating (Global) Branch Predictors Correlating Branch Predictors
3 pages
lec4
No ratings yet
lec4
35 pages
Chapter 3 PPTV 31 Sem IIv 31
No ratings yet
Chapter 3 PPTV 31 Sem IIv 31
40 pages
Branch Prediction
No ratings yet
Branch Prediction
38 pages
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
No ratings yet
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
188 pages
Control Hazard
No ratings yet
Control Hazard
4 pages
Chapter 5
No ratings yet
Chapter 5
38 pages
18 740 Fall15 Lecture05 Branch Prediction Afterlecture
No ratings yet
18 740 Fall15 Lecture05 Branch Prediction Afterlecture
93 pages
Lec3 PDF
No ratings yet
Lec3 PDF
15 pages
PS4-Solution
No ratings yet
PS4-Solution
6 pages
05 Risc v Pipeline
No ratings yet
05 Risc v Pipeline
31 pages
05 - Pipelining - Branch Prediction
No ratings yet
05 - Pipelining - Branch Prediction
20 pages
Pipelining
No ratings yet
Pipelining
44 pages
Rfghj
No ratings yet
Rfghj
20 pages
PipelineHazards
No ratings yet
PipelineHazards
4 pages
Cse590490 HW2
No ratings yet
Cse590490 HW2
5 pages
Pipeline - Instr - Super Branch
No ratings yet
Pipeline - Instr - Super Branch
48 pages
Lec 13
No ratings yet
Lec 13
36 pages
Embedded Systems Design: Pipelining and Instruction Scheduling
No ratings yet
Embedded Systems Design: Pipelining and Instruction Scheduling
48 pages
Branch Prediction: Prof. Mikko H. Lipasti University of Wisconsin-Madison
No ratings yet
Branch Prediction: Prof. Mikko H. Lipasti University of Wisconsin-Madison
22 pages
2b.pipeline RISC-V v2
No ratings yet
2b.pipeline RISC-V v2
13 pages
Computer Architecture: Branching
No ratings yet
Computer Architecture: Branching
37 pages
ch4-3
No ratings yet
ch4-3
61 pages
Branch Hazard.: Control Hazards
No ratings yet
Branch Hazard.: Control Hazards
4 pages
Modern CPU
No ratings yet
Modern CPU
14 pages
Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 16 Branch Prediction
No ratings yet
Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 16 Branch Prediction
26 pages
Lecture05 Branches
No ratings yet
Lecture05 Branches
47 pages
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
No ratings yet
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
7 pages
Sample Problems
No ratings yet
Sample Problems
5 pages
3 Pipeline
No ratings yet
3 Pipeline
21 pages
Calculated Encryption
From Everand
Calculated Encryption
John C Livingstone
No ratings yet
Corporate Finance Formulas: A Simple Introduction
From Everand
Corporate Finance Formulas: A Simple Introduction
K.H. Erickson
4/5 (8)
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Tracking - 4X Per Week - BL - FL
No ratings yet
Tracking - 4X Per Week - BL - FL
16 pages
Tracking - 3X Per Week - BL & FL
No ratings yet
Tracking - 3X Per Week - BL & FL
16 pages
Basal Metabolic Rate and Calorie Need Calculator Bulking Plan - Men
No ratings yet
Basal Metabolic Rate and Calorie Need Calculator Bulking Plan - Men
13 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
99 pages
Second Part Notes
No ratings yet
Second Part Notes
16 pages
L12A Introduction To Multiprocessors Part I
No ratings yet
L12A Introduction To Multiprocessors Part I
61 pages
Advanced Computer Architectures: Exception Handling
No ratings yet
Advanced Computer Architectures: Exception Handling
17 pages
JPA Exercises Without Solution
No ratings yet
JPA Exercises Without Solution
5 pages
Business Information Systems 1: Link To Download Program For Testing
No ratings yet
Business Information Systems 1: Link To Download Program For Testing
1 page
L14 Introduction To Performance Evaluation
No ratings yet
L14 Introduction To Performance Evaluation
48 pages
Business Information Systems: Chiara Francalanci Michele Brustia Francesco Frugiuele Alessandra Lieto Lucia Piolidori
No ratings yet
Business Information Systems: Chiara Francalanci Michele Brustia Francesco Frugiuele Alessandra Lieto Lucia Piolidori
8 pages
DB2 20210122 Text and Solutions
No ratings yet
DB2 20210122 Text and Solutions
7 pages
Istantanea Schermo 2021-02-09 (11.09.17)
No ratings yet
Istantanea Schermo 2021-02-09 (11.09.17)
1 page
5.1 Customer Desire Mind Reading Technique
No ratings yet
5.1 Customer Desire Mind Reading Technique
3 pages
DP-301P+ Printer Compatibility List
No ratings yet
DP-301P+ Printer Compatibility List
4 pages
Caterpillar Game Report
No ratings yet
Caterpillar Game Report
15 pages
Adder Cum Subtractor
100% (2)
Adder Cum Subtractor
4 pages
Windows Vista or Windows 7: Method 1: View System Window in Control Panel
No ratings yet
Windows Vista or Windows 7: Method 1: View System Window in Control Panel
5 pages
Oracle® Developer Suite: Quick Start Installation Guide
No ratings yet
Oracle® Developer Suite: Quick Start Installation Guide
60 pages
Aparted Manual
0% (1)
Aparted Manual
6 pages
SRINIVAS - Interest Update
No ratings yet
SRINIVAS - Interest Update
6 pages
Fortiadc: Application Delivery Controllers
No ratings yet
Fortiadc: Application Delivery Controllers
16 pages
Nikhil Resume (Revised) 1
No ratings yet
Nikhil Resume (Revised) 1
2 pages
Technical Sheet SSD Tforce G70
No ratings yet
Technical Sheet SSD Tforce G70
3 pages
Sample Story
100% (1)
Sample Story
2 pages
ColpoITPro Verion 2.0 Installation Guide PDF
No ratings yet
ColpoITPro Verion 2.0 Installation Guide PDF
8 pages
Fingerprint User Manual
No ratings yet
Fingerprint User Manual
26 pages
Function: Window Window
No ratings yet
Function: Window Window
7 pages
Management Information Systems, 10/e
No ratings yet
Management Information Systems, 10/e
34 pages
BIOS and CMOS
No ratings yet
BIOS and CMOS
20 pages
MFGPRO SysAdminReferenceGuide ProgressDatabase WindowsNTServer IG v09
No ratings yet
MFGPRO SysAdminReferenceGuide ProgressDatabase WindowsNTServer IG v09
140 pages
WiNG Manager 1.0.5 Release Notes
No ratings yet
WiNG Manager 1.0.5 Release Notes
6 pages
Dante-AVIO-USB-Datasheet-11623
No ratings yet
Dante-AVIO-USB-Datasheet-11623
2 pages
LAB_7.2
No ratings yet
LAB_7.2
1 page
Freeda Programmers PDF
No ratings yet
Freeda Programmers PDF
64 pages
Java Programming Lab Assignment - 5
No ratings yet
Java Programming Lab Assignment - 5
11 pages
Lenovo Legion Y7000 2019 Pg0 Platform Specifications: WXDXH: 14.17" X 10.51" X 0.95-0.99" 360Mm X 267Mm X 24.2-25.2Mm
No ratings yet
Lenovo Legion Y7000 2019 Pg0 Platform Specifications: WXDXH: 14.17" X 10.51" X 0.95-0.99" 360Mm X 267Mm X 24.2-25.2Mm
1 page
Vista 128 User Manual
No ratings yet
Vista 128 User Manual
80 pages
Lectur07-Introduction To CHAI3D
No ratings yet
Lectur07-Introduction To CHAI3D
4 pages
A Dell Latitude 7420 Business Laptop With An Intel Core I7-1185g7 Vpro Processor Received Better Marks For Performance and Battery Life
No ratings yet
A Dell Latitude 7420 Business Laptop With An Intel Core I7-1185g7 Vpro Processor Received Better Marks For Performance and Battery Life
13 pages
Bhawna Sde
No ratings yet
Bhawna Sde
1 page
CSC121 Scheme of Work
No ratings yet
CSC121 Scheme of Work
2 pages
Infineon BGMC1210 DataSheet v01 - 05 EN
No ratings yet
Infineon BGMC1210 DataSheet v01 - 05 EN
83 pages