0% found this document useful (0 votes)
40 views

CA Lecture 4 Module 3

This document summarizes techniques for reducing branch costs in computer architecture, specifically branch prediction. It discusses how as processors add parallelism, branching becomes a limiting factor. Branch destination and test can be known earlier through optimizations, such as in the second cycle rather than the third. Hazard detection units identify dependencies to avoid stalls from control dependencies. Overall it aims to reduce branch costs as parallelism increases in processor design.

Uploaded by

Balaji Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

CA Lecture 4 Module 3

This document summarizes techniques for reducing branch costs in computer architecture, specifically branch prediction. It discusses how as processors add parallelism, branching becomes a limiting factor. Branch destination and test can be known earlier through optimizations, such as in the second cycle rather than the third. Hazard detection units identify dependencies to avoid stalls from control dependencies. Overall it aims to reduce branch costs as parallelism increases in processor design.

Uploaded by

Balaji Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

CSCI 6461: Computer Architecture

Branch Prediction

Instructor: M. Lancaster

Corresponding to Hennessey and Patterson


Fifth Edition
Section 3.3 and Part of Section 3.9
Reducing Branch Costs

• The frequency of branches and jumps demands that we


also attack stalls arising from control dependencies
• As we are able to add parallel and multiple parallel units,
branching becomes a constraining factor
• On an n-issue processor, branches will arrive n times faster

September 2012 2
Review of a Branching Optimization
Branch destination and test known at end Branch destination and test known at
of third cycle of execution end of second cycle of execution
PCSrc IF.Flush

Hazard
ID/EX detection
0 unit
M
u WB M ID/EX
x EX/MEM u
1 x
WB
Control M WB EX/MEM
MEM/WB
M
EX M WB Control u M WB
IF/ID x MEM/WB
0

IF/ID EX M WB
Add

Add
4 Add result
RegWrite

4 Shift
Shift Branch
left 2
left 2

MemWrite
ALUSrc
M
u
Read x
=

MemtoReg
Instruction

PC Address register 1 Read Registers


Read
data 1 Instruction Data
PC ALU
register 2 Zero memory memory M
Instruction
Registers Read ALU ALU u
memory Write 0 Read
data 2 result Address 1 M x
register M data
Data M u
u x
Write x memory u
data x
1
0
Write
data Sign
extend
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction M
[20– 16]
0 ALUOp
u
M x
Instruction u Forwarding
[15– 11] x unit
1
RegDst

Program Time (in clock cycles) Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order order
(in instructions) (in instructions)

40 beq $1, $3, 7 IM Reg DM Reg 40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg 44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg 48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg 52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7) IM Reg DM Reg 72 lw $4, 50($7) IM Reg DM Reg

September 2012 Instruction Level Parallelism 3


Dynamic Branch Prediction

• Branch prediction buffer


– Simplest scheme
– A small memory indexed by the lower portion of the address of
the branch instruction
• Includes a bit that says whether the branch was taken recently or
not
• No other tags
• Useful only to reduce the branch delay when it its longer than the
time to compute the possible target PCs
• Since we only use low order bits, some other branch instruction
could have set the tag
– The prediction is a hint that is assumed to be correct, if it turns
out wrong, the prediction bit is inverted and stored back

September 2012 4
Dynamic Branch Prediction

• Branch prediction buffer is a cache


• The 1 bit scheme has a shortcoming
– Even if a branch is almost always taken, we will usually
predict incorrectly twice, rather than once, when it is not
taken
• Consider a loop branch that is taken nine times in a row then not
taken. What is the prediction accuracy for this branch, assuming
the prediction bit for this branch remains in the prediction buffer
– Mispredict on the the first and last predictions, as the loop
branch was not taken on the first one as is set to 0. Then on
the last loop it will not be taken and the prediction will be
wrong again.
– Down to 80% accuracy here

September 2012 5
Dynamic Branch Prediction

• To remedy this situation, 2 bit branch prediction schemes


are often used. A prediction must miss twice before it is
changed.
• A specialization of a more general scheme that has a n-bit
saturating counter for each entry in the prediction buffer.
With n bits,we can take on the values 0 to 2n-1. When the
counter is >= ½ of its max value, branch is predicted as
taken
• Count is incremented on a taken branch and decremented on
a not taken one
• 2 bits work almost as well as larger numbers

September 2012 6
The States in a 2 Bit Prediction Scheme

September 2012 7
Branch Prediction Buffer

• Implemented via a small special cache accessed with the


instruction address during the IF pipe stage, or as a pair of
bits attached to each block in the instruction cache and
fetched with each instruction.
• If the instruction is a branch and if predicted as taken,
fetching begins from the target as soon as the PC is known.
Otherwise sequential fetching and executing continue. If
prediction is wrong the prediction bits are changed as in
the state diagram.

September 2012 8
Branch Prediction Buffer

• Useful for many pipelines


• In our five stage pipeline the pipeline finds out whether the
branch is taken and what the target of the branch is at
roughly the same time as the branch predictor information
would have been use (the end of the second stage of the
execution of the branch).
• Therefore, this scheme does not help for our pipeline
• Next figure shows performance of 2-bit prediction for a
given benchmark (between 1-18% mispredictions)

September 2012 9
Prediction accuracy of a 4096 entry 2-bit prediction
buffer

September 2012 10
Increasing the size of the buffer does not help much

September 2012 11
Correlating Branch Predictors

• Branch predictions for integer programs are less accurate


• These 2 bit schemes use only recent behavior of a single
branch to predict the future behavior of that branch
• Look at other branches rather that just the branch we are
trying to predict
if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa!=bb){

September 2012 12
Correlating Branch Predictors

• MIPS Code
DSUBUI R3,R1,#2
BNEZ R3,L1 ;branch b1(aa!=2)
DADD R1,R0,R0 ;aa=0
L1: DSUBUI R3,R2,#2
BNEZ R3,L2 ;branch b2 (bb!=2)
DADD R2,R0,R0 ;bb=0
L2: DSUBU R3,R1,R2
BEQZ R3,L3 ;branch b3(aa==bb)

Branch b3 is correlated with branches b1 and b2 – if branches b1


and b2 are both not taken then b3 will be taken since they are
equal

September 2012 13
Correlating Branch Predictors

• Branch predictors that use the behavior of other branches


to make a prediction are called correlating predictors or
two level predictors.

September 2012 14
Correlating Branch Predictors

Look at the branches with d = 0,1, and 2

if (d==0) BNEZ R1,L1 ;branch b1 (d!=0)


d=1; DADDIU R1,R0,#1 ;d==0, set d=1
if (d==1) L1: DADDIU R3,R1,#-1
BNEZ R3,L2 ;branch b2 (d!=1)

L2;

September 2012 15
Correlating Branch Predictors

Initial value d==0? b1 Value of d d==1? b2


of d before b2
0 Yes Not taken 1 Yes Not taken

1 No Taken 1 Yes Not taken


2 No Taken 2 No Taken

Possible Execution Sequences

• If b1 is not taken then b2 will not be taken


• A 1 bit predictor initialized does not have the capability to take
advantage of this

September 2012 16
Correlating Branch Predictors

• To develop a branch predictor that uses correlation, let


every branch have two prediction bits, one prediction
assuming the last branch executed was not taken and
another prediction bit that is used the the last branch
executed was taken.
• The last branch executed is usually not the same
instruction as the branch being predicted, although this can
occur.

September 2012 17
1-Bit Correlation Prediction

Prediction Bits Prediction if last branch not Prediction if last branch


taken taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T

• This is a 1,1 predictor since it uses the behavior of the last


branch to choose from among a pair of 1-bit branch
predictors
• An (m,n) predictor uses the last m branches to choose from
2m branch predictors, each of which is an n bit predictor for
a single branch
September 2012 18
(m,n) Predictors

• Can yield higher prediction rates than the 2 bit scheme and
requires only a small amount of additional hardware We
can record the global history of the most recent m branches
in an m bit shift register, where each bit records whether
the branch was taken or not taken
• The branch prediction buffer can be indexed by using a
concatenation of the low order bits from the branch
address with the m bit global history. That is the address
indexes a row in the prediction buffer and the global buffer
chooses among them.

September 2012 19
Fig 14

September 2012 20
Comparison of Predictors – First is non-correlating for 4096 entries,
followed by a non-correlating 2 bit predictor with unlimited entries and finally a 2 bit
predictor with 2 bits of global history and 1024 entries

September 2012 21
Tournament Predictor for the Alpha 21264

September 2012 22
Fraction of Predictions Coming from the Local Predictor for
a Tournament Predictor using SPEC89 Benchmarks

September 2012 23
Branch Target Buffers
(Advanced Technique for Instruction Delivery)

• Reduce penalty in our 5 stage pipeline


– Determine next instruction address to fetch by the end of IF
• We must know whether an instruction (not yet decoded) is a
branch and, if so what the next PC should be
• If at the end of IF we know the instruction is a branch and we
know what the next PC should be, we have zero penalty
– A branch prediction cache that stores the predicted address for
the next instruction after a branch is called a branch target
buffer or branch target cache
– For the classic 5 stage pipeline, a branch prediction buffer is
accessed during the ID cycle. At the end of ID we know the
branch target address (computed in ID), the fall through
address (computed during IF), and the prediction

September 2012 24
Branch Target Buffers

• Reduce penalty in our 5 stage pipeline (continued)


– Thus by the end of ID we know enough to fetch the next
predicted instruction.
– For a branch target buffer, we access the buffer during the IF
stage using the instruction address of the fetched instruction (a
possible branch) to index the buffer
– If we get a hit, then we know the predicted instruction address
at the end of the IF cycle, which is one cycle earlier than for
the branch prediction buffer
– This address is predicted and will be sent out before decoding
the instruction. It must be known whether the fetched
instruction is predicted as a taken branch

September 2012 25
Fig 3.21 A Branch Target Buffer – The PC of the instruction being fetched is matched
against a set of instruction addresses stored in the first column; which represent the addresses of
known branches. If the PC matches one of these entries, then the instruction being fetched is a taken
branch, and the second field, predicted PC, contains the prediction for the next PC after the branch.
Fetching immediately begins at that address.

September 2012 26
Fig 3.22 Steps Involve In Handling an Instruction
with a Branch Target Buffer

September 2012 27

You might also like