0% found this document useful (0 votes)
2 views

The Bi-Mode Branch Predictora

The document presents the bi-mode branch predictor, a novel dynamic branch prediction technique that addresses the destructive aliasing problem in existing predictors. By dividing prediction tables into two halves and dynamically selecting the appropriate half based on the program's current mode, the bi-mode predictor improves prediction accuracy while minimizing hardware requirements. The study demonstrates that the bi-mode predictor outperforms the gshare predictor in terms of accuracy and cost-effectiveness across various benchmarks.

Uploaded by

947054481
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

The Bi-Mode Branch Predictora

The document presents the bi-mode branch predictor, a novel dynamic branch prediction technique that addresses the destructive aliasing problem in existing predictors. By dividing prediction tables into two halves and dynamically selecting the appropriate half based on the program's current mode, the bi-mode predictor improves prediction accuracy while minimizing hardware requirements. The study demonstrates that the bi-mode predictor outperforms the gshare predictor in terms of accuracy and cost-effectiveness across various benchmarks.

Uploaded by

947054481
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/232632424

The Bi-Mode Branch Predictora

Article · January 1998


DOI: 10.1109/MICRO.1997.645792 · Source: IEEE Xplore

CITATIONS READS
184 1,400

3 authors, including:

Trevor N. Mudge
University of Michigan
431 PUBLICATIONS 25,036 CITATIONS

SEE PROFILE

All content following this page was uploaded by Trevor N. Mudge on 08 February 2014.

The user has requested enhancement of the downloaded file.


To appear in MICRO-30, 1997

The Bi-Mode Branch Predictor


Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge
EECS Department, University of Michigan
1301 Beal Ave., Ann Arbor, Michigan 48109-2122
{leecc, icheng, tnm}@eecs.umich.edu

Abstract been pushed above 90%. As a result, two-level dynamic


branch predictors have been incorporated in several recent
Dynamic branch predictors are popular because they high-performance microprocessors. Perhaps the best
can deliver accurate branch prediction without changes to known examples, at the time of writing, are the Pentium Pro
the instruction set architecture or pre-existing binaries. [Gwennap95] and Alpha 21264 [Gwennap96].
However, to achieve the desired prediction accuracy, exist- Among two-level predictors, those using global history
ing dynamic branch predictors require considerable schemes have been shown to yield the best performance for
amounts of hardware to minimize the interference effects integer benchmarks [YehPatt93]. However, to achieve high
due to aliasing in the prediction tables. We propose a new levels of accuracy, current dynamic branch predictors re-
dynamic predictor, the bi-mode predictor, which divides the quire considerable amounts of hardware because their most
prediction tables into two halves and, by dynamically deter- significant weakness, the destructive aliasing problem, is
mining the current “mode” of the program, selects the ap- most easily solved by increasing the size of the predictors
propriate half of the table for prediction. This approach is [SechrestLeeMudge96]. This paper proposes a new tech-
shown to preserve the merits of global history based predic- nique, the bi-mode branch predictor, that is economical and
tion while reducing destructive aliasing and, as a result, im- simple enough to avoid critical timing paths. Furthermore,
proving prediction accuracy. Moreover, it is simple enough we demonstrate that on the IBS and SPEC CINT95 bench-
that it does not impact a processor’s cycle time. We con- marks the bi-mode predictor performs on average better
clude by conducting a comprehensive study into the mech- than gshare, one of the best global history based predictors,
anism underlying two-level dynamic predictors and for the same cost. Finally, we conduct a comprehensive
investigate the criteria for their optimal designs. The anal- study into the mechanism underlying two-level dynamic
ysis presented provides a general framework for studying predictors and investigate the criteria for their optimal de-
branch predictors. signs. The study explains why our proposed scheme per-
1. Introduction forms well and provides a general framework for studying
branch predictors.
The ability to minimize stalls or pipeline bubbles that The report is organized into five sections. In section 2,
may result from branches is becoming increasingly critical we summarize the aliasing problem, and then introduce our
as microprocessor designs implement greater degrees of in- solution for de-aliasing. Section 3 describes our simulation
struction level parallelism. There are several techniques for methodology and presents the simulation results. In section
reducing branch penalties including guarded execution, ba- 4 we present an analysis of aliasing in dynamic branch pre-
sic block enlargement, and static and dynamic branch pre- dictors that explains the source of the improved perfor-
diction [PnevmatikatosSohi94, Hwu93, Smith81, mance for the bi-mode predictor. Finally, in the conclusion
FisherFreudenberger92, YehPatt91, PanSoRahmeh92]. we propose future directions for this work.
Among these, dynamic branch prediction is perhaps the
2. Aliasing and De-aliasing
most popular, because it yields good results and can be im-
plemented without changes to the instruction set architec-
2.1 The aliasing problem
ture or pre-existing binaries.
The strength of dynamic branch prediction is that it can Branch outcomes are not usually the result of random ac-
track branch behavior closely at run-time, providing a de- tivities; most of the time they are correlated with past be-
gree of adaptivity that other approaches are lacking. This havior and the behavior of neighboring branches. By
adaptivity is especially critical when behavior of branches keeping track of the history of branch outcomes, it is possi-
can be affected by the input data of different program runs. ble to anticipate with a high degree of certainty which direc-
With the introduction of two-level schemes [YehPatt91], tion future branches will take.
the prediction accuracy of dynamic branch predictors has However, current dynamic branch predictors still exhibit
performance limits. These are due in part to the restricted
Global History Branch PC
availability of information upon which to base predictions, s
but more importantly due to shortcomings of design, espe- (m <= n) m n
cially the way that branch outcome history is exploited. In
current designs, dynamic predictors spend large amounts of n
hardware to memorize this branch outcome history. Each
static (per-address) branch often has a biased behavior so
that it is either usually taken or usually not-taken. This can
be exploited by the conventional two-bit counter scheme to Choice Predictor
predict future outcomes of a particular static branch. How- Direction Predictors
ever, two-bit counter schemes are limited because branches
may behave differently from their biases under some special
conditions. These conditions are not difficult to recognize,
but recognition requires memory space. Therefore, to Final prediction for the branch
achieve very high prediction accuracy, both the per-address
Figure 1: Proposed branch prediction scheme
bias and the special conditions need to be identified and
diagram
memorized by dynamic predictors.
Global history—the outcomes of neighboring branch-
es—is a common way to identify special branch conditions. 2.2 Proposed branch prediction scheme
Previous studies have shown that the global history indexed
The bi-mode branch predictor is aimed at the elimination
schemes achieve good performance by storing the outcomes
of destructive aliasing in global history indexed schemes.
of global history patterns in two-bit counters, e.g., the GAg
This scheme, shown in Figure 1, splits the second-level
and GAs schemes [PanSoRahmeh92, YehPatt92]. Another
two-bit counter table into two halves. Given a history pat-
way to identify special branch conditions is to use per-ad-
tern, two counters, one from each half, are selected. We re-
dress history—the past outcomes of a branch itself, such as
fer to these as the direction predictors. Meanwhile, another
PAg and PAs schemes [YehPatt91]. The per-address history
two-bit counter table, indexed by the branch addresses only,
scheme is also shown to be effective, especially for loop-in-
is used to provide a final selection for these two counters.
tensive floating-point programs. However, as we noted ear-
The counter table providing selection will be referred to as
lier, [YehPatt93] shows that, for integer programs, global
the choice predictor. The final prediction is then made by
history schemes tend to perform better than per-address his-
the state of the counter selected from the direction predic-
tory schemes because global schemes can make better pre-
tors and, importantly, only the selected counter will be up-
dictions for if-then-else branches due to their ability to track
dated with the branch outcome; the status of the unselected
correlation with neighboring branches.
one, will not be altered. The choice predictor is always up-
Nevertheless, the global history scheme is still limited by
dated with the branch outcome, except that when the choice
destructive aliasing that occurs when two branches have the
is opposite to the branch outcome but the selected counter
same global history pattern, but opposite biases
of the direction predictors makes a correct final prediction.
[TalcottNemirovskyWood95, YoungGloySmith95]. This is
This partial update policy is particularly effective when the
not due to the limited availability of information, but to the
total hardware budget is small.
indexing method which does not discriminate between
Our proposed scheme can improve global history in-
branches with the same global history patterns.
dexed schemes because although global history patterns are
One proposal to overcome the destructive aliasing,
still kept in the second level table, they are dynamically
gshare, randomizes the index by xor-ing the global history
classified before being stored. They are classified by a pre-
with the branch address [McFarling93]. It provides only
liminary prediction from the choice predictor which is sim-
limited improvement [SechrestLeeMudge96]. Recently,
ply a conventional two-bit counter scheme, and, as such,
there have been several new proposals to reduce aliasing
typically can provide 80% or better prediction accuracy
problems [ChangEversPatt96, Sprangle97,
with relatively modest cost. Thus, the bi-mode scheme di-
[MichaudSeznecUhlig97]. The best of these
vides branches into two groups according to the per-address
[MichaudSeznecUhlig97] employs a hardware hashing
bias of the choice predictor, and then uses the global history
scheme. A comparative study of these and the bi-mode
patterns to identify the special conditions for each of two
scheme can be found in [Lee97]. The study shows that hard-
groups separately. The effect of the choice predictor is to
ware hashing is useful for small low cost systems. For large
separate the destructive aliases while keeping the harmless
systems the bi-mode scheme is the best cost-effective
aliases together.
scheme to date.
3. Experimental Results Benchmarks Input data file
compress bigtest.in, reduced
In this section, we demonstrate that our proposed bi-

SPEC CINT95
gcc jump.i
mode branch predictor is more accurate and cost-effective
go 2stone9.in, train data, reduced
than one of the best two-level branch predictors, gshare. To
xlisp train.lsp
evaluate the improvement, we have conducted trace-driven perl scrabbl.in, reduced
simulations. vortex train data, reduced

3.1 Description of gshare scheme Table 1: Description of the input data files used in
In gshare, the global history is xor-ed together with the the SPEC CINT95 programs
low-order address bits of a branch to form an index. This in-
dex is then used to select a 2-bit saturating up-down counter
from a pattern history table (PHT)1. Depending on the sign static conditional dynamic conditional
Benchmarks
bit of the selected 2-bit counter, the branch is either predict- branches branches
ed as taken or not taken. compress 482 10,114,353

SPEC CINT95
To make a fair comparison with the gshare predictor, the gcc 16,035 26,520,618
best configuration of gshare must be determined and used. go 5,112 17,873,772
This point is often overlooked and the single-PHT gshare xlisp 636 25,008,567
configuration is used for comparisons. However, this sin- perl 1,974 39,714,684
gle-PHT gshare configuration is not the optimal configura- vortex 6,599 27,792,020

tion as was shown in [SechrestLeeMudge96]. To find the groff 6,333 11,901,481

best configuration, we exhaustively simulated all pair-wise gs 12,852 16,307,247


mpeg_play 5,598 9,566,290
combinations of history length and address length. In gen-
IBS-Ultrix

nroff 5,249 22,574,884


eral, the best combination has multiple PHTs. Since the best
real_gcc 17,361 14,309,867
configuration is different for each benchmark, we present sdet 5,310 5,514,439
results using the configuration that yields the best accuracy verilog 4,636 6,212,381
for the average of all the benchmarks studied. video_play 4,606 5,759,231

3.2 Description of input trace Table 2: Static and dynamic branch counts in the
To assess the performance of the bi-mode branch predic- IBS and SPEC CINT95 programs
tor, we conducted a trace-driven simulation using the Ultrix
version of the Instruction Benchmark Suite (IBS-Ultrix)
SPEC CINT95 are summarized in Table 2.
benchmarks [Uhlig95] and the SPEC CINT95 benchmarks
[SPEC95]. 3.3 Simulation results
The IBS-Ultrix benchmarks are a set of applications de-
Figure 2 shows the misprediction rates for the best
signed to reflect realistic workloads. The traces of these
gshare and bi-mode predictors. In our simulation the best
benchmarks were generated through hardware monitoring
configurations of gshare, which are labeled gshare.best, al-
of a MIPS R2000-based workstation. These traces were col-
ways have multiple PHTs in the second-level table. Note
lected under Ultrix 3.1, and include both kernel and user ac-
that gshare.best is the best for the averaged results, not nec-
tivities.
essary the best for individual benchmarks. For easy compar-
For the SPEC CINT95 benchmark, we use ATOM
ison with other published results, we also include the
[EustaceSrivastava95], a code instrumentation tool from
misprediction rates for the single-PHT gshare configura-
Digital Equipment Corporation, to generate and capture ad-
tion, which is labeled gshare.1PHT. In Figure 2, the vertical
dress traces. The benchmarks were first instrumented with
axis represents the branch misprediction rate, and the hori-
ATOM, then executed on a DEC 21064 workstation run-
zontal axis for the size of predictors. A lower curve indi-
ning OSF/1 3.0 to generate traces. These traces contained
cates that the scheme has better performance for the same
only user-level instructions. The input to the SPEC95
cost. Cost is measured by counting the number of bytes used
benchmarks was a reduced input data set and is described in
in the 2-bit counters. Note that the bi-mode predictors natu-
Table 1. The branch statistics of traces from the IBS and the
rally have a cost that is 1.5 times that of the next smaller
gshare scheme2. This reflects the cost of the choice predic-
1. The pattern history tables are the tables constituting the second-level tors.
table of the two-level predictors, as defined in [YehPatt92]. In the two- Figure 2 shows the bi-mode predictors outperforms
level predictor model, the number of PHTs is determined by the branch
address bits directly used as the index. gshare predictors for all sizes of predictors measured. This
CINT95-AVERAGE IBS-AVERAGE
14 10
gshare.1PHT gshare.1PHT
gshare.best 9 gshare.best
12 bi-mode bi-mode
8

Misprediction Rate (%)


Misprediction Rate (%)

10 7
6
8
5
6 4
3
4
2
2 1

0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes)

Figure 2: Averaged misprediction rates for SPEC CINT95 and IBS-Ultrix

compress gcc go
14 20 35
gshare.1PHT gshare.1PHT gshare.1PHT
gshare.best gshare.best gshare.best
12 bi-mode bi-mode 30 bi-mode
Misprediction Rate (%)

Misprediction Rate (%)

Misprediction Rate (%)


15
10 25

8 20
10
6 15

4 10
5
2 5

0 0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes) Predictor Size (K bytes)

xlisp perl vortex


12 12 7
gshare.1PHT gshare.1PHT gshare.1PHT
gshare.best gshare.best gshare.best
10 bi-mode 10 bi-mode 6 bi-mode
Misprediction Rate (%)

Misprediction Rate (%)

Misprediction Rate (%)

5
8 8
4
6 6
3
4 4
2
2 2 1

0 0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes) Predictor Size (K bytes)

Figure 3: Misprediction rates for SPEC CINT95

is indicated by lower curves. In addition, the bi-mode pre- est static branches, have no aliasing problems and thus can
dictors are more cost effective, because, for predictors larg- enjoy the benefit from correlation in branch histories. The
er than 4K bytes, they need less than half the size of gshare results of these two small benchmarks correspond to the
predictors to achieve the same misprediction rate. findings reported by Sechrest et al. [SechrestLeeMudge96].
Bi-mode predictors also outperform gshare on most of The case of the go benchmark, where the bi-mode method
the individual benchmark examined, see Figure 3 and is beaten by the multiple-PHTs, will be discussed in more
Figure 4. Moreover, the single-PHT gshare scheme is worse detail in the next section.
than the multiple-PHTs gshare scheme for all benchmarks
except the compress and xlisp, where it outperforms even 4. Analysis
the bi-mode scheme. These two benchmarks, with the few-
Many branches have a tendency to be either taken or not-
taken most of time. Common examples are branches for er-
2. In our experiments, all two-bit counters in gshare schemes are initial- ror checking and looping. These kinds of branches are usu-
ized to weakly-taken for each benchmark run. For the bi-mode scheme,
the choice predictor is reset to weakly-taken, and one bank of the direc- ally described as being strongly biased in one direction. As
tion predictor is reset to weakly-not-taken and the other bank is weakly- might be expected, strongly biased branches are much easi-
taken.
groff gs mpeg-play
9 10 10
gshare.1PHT gshare.1PHT gshare.1PHT
8 gshare.best 9 gshare.best 9 gshare.best
bi-mode 8 bi-mode 8 bi-mode
Misprediction Rate (%)

Misprediction Rate (%)

Misprediction Rate (%)


7
7 7
6
6 6
5
5 5
4
4 4
3
3 3
2 2 2
1 1 1
0 0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes) Predictor Size (K bytes)

nroff real-gcc sdet


8 14 10
gshare.1PHT gshare.1PHT gshare.1PHT
7 gshare.best gshare.best 9 gshare.best
bi-mode 12 bi-mode bi-mode
8
Misprediction Rate (%)

Misprediction Rate (%)

Misprediction Rate (%)


6
10 7
5 6
8
4 5
6 4
3
4 3
2
2
1 2
1
0 0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes) Predictor Size (K bytes)

verilog video-play
9 12
gshare.1PHT gshare.1PHT
8 gshare.best gshare.best
bi-mode 10 bi-mode
Misprediction Rate (%)

Misprediction Rate (%)

7
6 8
5
6
4
3 4
2
2
1
0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes)

Figure 4: Misprediction rates for IBS-Ultrix

er to predict than weakly biased branches in dynamic curacy than the traditional two-bit counter scheme
branch predictors, and this was confirmed by Chang et al. proposed by Smith [Smith81] is because, in addition to the
[Chang94]. In the same study, they also measured the dis- branch address, they incorporate the branch history infor-
tribution of branch biases for SPEC CINT92. Their mea- mation to form the index for the second-level two-bit
surement showed that on average about 50% of total counter table. The index for the second-level table divides
dynamic branches are attributed to the static branches that the dynamic branch stream into substreams that are direct-
are biased in either the taken or not-taken direction for ed to a saturating two-bit counter. Ideally, the index should
more than 90% of the time. generate highly biased substreams so that the value of the
In this section, our analysis extends the idea of bias to saturating counter selected by the index can stay at one of
the dynamic branch substreams that arrive at each two-bit the saturated values most of time. Global history, compared
counter in the second-level table. Using this concept, we to the branch address, can divide a dynamic branch stream
will demonstrate the advantages and drawbacks of two into more highly biased substreams, as we will show later.
kinds of information used in the two-level scheme, specifi- However, if the indexing method mixes oppositely biased
cally, the branch address and global history. The analysis substreams together, then destructive aliasing can arise and
allows us explain why the bi-mode scheme can improve on the associated counter will perform badly as a predictor, be-
current dynamic branch predictors. cause it will oscillate between the two saturated values. Our
study will compare using branch addresses with global his-
4.1 Bias measurement for global-history based
tory to separate out oppositely biased substreams, and how
schemes aliasing can degrade the performance of two-level schemes
As we have noted before, the reason that two-level dy- using global history.
namic branch predictors can achieve higher prediction ac- To contrast the benefits of address and global history
dynamic count count of taken out-
branch normalized count from i=b
when using comes when using bias class
address, i to c, Nbc
counter c, |sic| counter c
0x 001 12 11 ST 12/50 = 24%
0x 005 20 1 SNT 20/50 = 40%
0x 100 8 3 WB 8/50 =16%
0x 150 10 1 SNT 10/50 = 20%

Table 3: An example of calculating the normalized count for a counter c

bits, we consider two alternative two-level gshare style pre- Thus the two conditions become:
dictors. Both have the same size second-level tables, 256 1. (Σi(Nic) | for those i such that sic ∈ WB) << (Σi(Nic) |
counters, but differ in that one employs more history bits, for those i such that sic ∉ WB)
representing history-indexed schemes, while the other rep- 2. (Σi(Nic) | for those i such that sic ∈ ST) should differ
resents address-index schemes. The first scheme xors 8 bits greatly from (Σi(Nic) | for those i such that sic ∈ SNT). In an
of branch address with 8 bits of global history to form the ideal situation, one of the sums should be 0.
index into the second-level table (“history-indexed”). The Table 3 illustrates the normalized count resulting from
second scheme xors 8 bits of branch address with only 2 bits three streams incident on the same counter c. In this exam-
of global history as the index (“address-indexed”). ple, there is a total of four static branches (i = 1,..,4) whose
We define three bias classes on a stream of branch out- addresses are 0x001, 0x005, 0x100 and 0x150, respectively,
comes: 1) strongly taken (ST) if the outcomes are taken that used the two-bit counter c for prediction during the pro-
90% of the time or more; 2) strongly not taken (SNT) if the gram execution (they may also use other counters too).
outcomes are not taken 90% of the time or more; and 3) These four streams fall into different bias classes with re-
weakly biased (WB) if the neither of the above apply. spect to c. The normalized count of ST class at the counter
We are interested in the stream of branch outcomes, sij, c is 24%, the SNT class is 60% (40%+20%), and the WB
from a particular static branch, i, to a particular prediction class is 16%. Because the SNT class is more frequent than
counter, j. This stream belongs to one of the three bias class- the ST class, the SNT is the dominant class in the counter c,
es, i.e., exactly one of the following is true: sij ∈ ST, sij ∈ and the ST is the non-dominant class. In fact, Table 3 shows
SNT, or sij ∈ WB. A good indexing method will create these an undesirable situation because the indexing method has
streams so that the following two conditions hold: done a poor job of separating the bias classes and the SNT
1. The number of streams that are in the WB class is kept class is not overwhelmingly dominant.
small. Figure 5 illustrates the bias classes for all of the predic-
2. Most of the streams incident on a particular prediction tion counters for the gcc benchmark. We have performed
counter, j = c, belong to only the ST class, or alternatively, the same experiments for other SPEC benchmarks, and we
only the SNT class, i.e., sic ∈ ST for most i, or sic ∈ SNT for select gcc because it is representative of the results from the
most i. A counter should not see an even mix of streams other benchmarks, see [SechrestLeeMudge96]. The X axis
from both classes or its prediction ability will be reduced. lists all the counters in the second-level table, and the Y axis
Condition 2 actually states that one of the two strongly represents the normalized counts of the three bias classes in
biased class should dominate the other strongly biased class each counter. The counters listed in the X axis are sorted ac-
at a counter. When this domination occurs, the counter will cording to the normalized dynamic frequency of WB class.
be biased at one saturated value with little destructive inter- It can be seen that the area size of WB region of the history-
ference. We will refer to the more frequent strongly-biased indexed scheme is smaller than that of the address-indexed
class at a counter as the dominant class, and the other less one. This suggests that the scheme employing more branch
frequent strongly-biased class as the non-dominant class. history can generate more highly biased substreams for pre-
To be more precise, we should consider streams weight- dictors. If there is no harmful aliasing problem in the histo-
ed by their lengths. If |sij| is the number of outcomes in the ry-index scheme, i.e., each counter only needs to deal with
stream sij, we define the normalized count that a branch, i = substreams of one bias class, the prediction accuracy will be
b, contributes to a particular prediction counter, j = c, to be: very high [TalcottNemirovskyWood95,
YoungGloySmith95].
s
bc
N bc = --------------------------------------------------------------------- However, in the usual situation where harmful aliasing
∑ s ic does exist, the performance of the history based scheme de-
grades. As shown in the same figure (Figure 5), the non-
over all static branches i
100 100

90 WB 90 WB
Normalized dynamic counts (%)

Normalized dynamic counts (%)


80 80
non-dominant
70 non-dominant 70

60 60

50 50

40 40

30 30 dominant
dominant
20 20

10 10

0 0
1 65 129 193 256 1 65 129 193 256
Individual counters Inividual counters

100 Figure 6: Bias breakdown for the bi-mode


90 scheme
WB
Normalized dynamic counts (%)

80 This figure shows the bias of branch substreams of each


counter for the bi-mode scheme. This bi-mode scheme has a
70
128-counter choice predictor and two 128-counter direction
60 non-dominant predictors. As shown in the figure, the dominant substreams
dominate most of the counters of the second-level table,
50
implying that interference is reduced significantly.
40

30
time, the resulting substreams merged at each counter
dominant
20 should be as unidirectional as possible; in other words, the
10 dominant area in Figure 5 should be large. Unfortunately,
0 neither the address-indexed scheme nor the history-indexed
1 65 129 193 256 scheme can achieve both of these two design goals simulta-
Individual counters
neously.
Figure 5: Bias breakdown for the gshare scheme
4.2 Bias measurement for the bi-mode scheme
in the SPEC CINT95 gcc benchmark.
History-indexed on the top, address- In this subsection, we repeat the analysis above for the
indexed on the bottom bi-mode prediction scheme. The configuration under exam-
This figure shows the bias of branch outcome substreams
ination has a 128-counter choice predictor indexed by the
arriving at each of 256 counters in a second-level table. The branch address and two banks of 128 counters in the sec-
top graph is for the history-index scheme (8 bits of branch ond-level table, each of which is indexed by 7 bits of branch
address xor-ed with 8 bits of global history); the bottom graph
is for the address-indexed scheme (8 bits of branch address
address xor-ed with 7 bits of global history. This system has
xor-ed with 2 bits of global history). These two graphs illus- about 50% more bytes than the predictors in the previous
trate the difference between the two indexing methods. The subsection, so the following analysis should be viewed
address-indexed scheme suffers from a larger number of
weakly biased (WB) branch substreams, while the history-
qualitatively.
indexed scheme suffers from more non-dominant sub- Figure 6 presents the measurement results. It can be seen
streams, implying a high degree of destructive interference that the weakly biased class in the bi-mode scheme is kept
between strongly but oppositely biased streams (between the
SNT and ST classes).
as small as the one in the history-indexed scheme, indicat-
ing that the advantage of employing history information is
dominant class in the history-indexed scheme is larger than preserved. On the other hand, Figure 6 also shows the bi-
the one in the address-indexed scheme. In other words, al- mode scheme yields a much larger area for the dominant
though the history-indexed selects the greater number of class than the history-indexed scheme, implying that de-
highly biased substreams, it does not separate the taken and structive aliasing has been reduced.
not-taken ones as well as address-indexed scheme. The counting arguments that we employ to classify the
To summarize the analysis above, an ideal dynamic ST, SNT, and WB classes are open to the criticism that they
branch predictor should generate as few weakly biased sub- do not capture the order in which the ST and SNT runs ap-
streams as possible; in other words, the area of the weakly pear. For example, it is undesirable for them to be inter-
biased region should be as small as possible. At the same mixed so that the stream changes between the two classes.
As a final experiment, we have counted the numbers of
SNT ST WB
Dominant Non-dominant WB 12

breakdown of misprediction rates (%)


history-indexed 3,826,578 3,589,689 2,252,874 10
bi-mode 3,685,544 2,717,563 2,226,353
8

Table 4: Numbers of changes between different 6


bias classes for the history-indexed and
4
bi-mode schemes
2
This table shows numbers of changes between branch out-
come streams of different bias classes in the history-indexed 0
and bi-mode schemes. We first count changes for each bias

(9)
(2)

(8)

(7)

(2)

(4)
)

)
(10

(15

(14
de
re

de

re

re
re
class in a counter, and then accumulate the counts of all

re

re

de
ha

ha

ha
mo
ha

mo

ha

ha

mo
gs

gs

gs
gs

bi-
bi-

gs

gs

bi-
counters for a scheme. For example, the count for the domi-
256 1K 32K
nant class of the history-indexed scheme is the total number
of changes of the dominant class due to interference by the Figure 7: Misprediction contributed by three bias
other two classes in the scheme.
classes in gcc
In this figure, three schemes are compared, a gshare using
changes between bias classes due to interference. Table 4 fewer history bits (representing the address-indexed
shows the results for the history-indexed and bi-mode scheme), a gshare using more history bits (history-indexed)
and the bi-mode scheme. For each scheme, three different
schemes. The bi-mode scheme has fewer changes, implying sizes of second-level tables are examined: 256, 1K and 32K
that its ST and SNT classes are less intermingled. This counters. gshare (m) represents a gshare scheme that uses
means less interference, and further illustrates why our pro- m-bit global history, while bi-mode (m) represents a bi-mode
scheme that uses m-bit global history for its direction predic-
posed prediction scheme perform better than conventional tors. The choice predictor of the bi-mode scheme is half the
two-level schemes. size of its second-level table. As shown in the figure, the
address-indexed scheme always has larger misprediction for
4.3 Breakdown of misprediction for the gshare the WB class. The history-indexed scheme has less mispre-
and bi-mode schemes diction for the WB class, but has more for the SNT and ST
classes due to interference. The bi-mode scheme reduces
We have also measured the misprediction contributed by error from the WB and reduces, in most cases, the mispredic-
tion for the SNT and ST by removing interference.
three biased classes for the gshare and bi-mode schemes.
Again, for the gshare scheme, the configurations using few-
SNT ST WB
er global history bits and more global history bits are both
breakdown of misprediction rates (%)

25
included for comparison.
Figure 7 presents the measurement results for the gcc 20
benchmark. Three different sizes are studied for the branch
predictors: 256, 1024, and 32,768 counters in the second 15

level table. For each configuration, the misprediction is bro-


10
ken down to three categories according to the bias classes.
In other words, the sum of mispredictions from three classes 5
is the misprediction rate for the corresponding scheme. For
0
the gshare predictors of the same size, the one using fewer
(2)

(4)

)
(2)

(8)

(7)

(9)
)

)
(15
(10

14
are

are

e(
are

are

de

de

are
re

global history bits always has the least error from the
mo

mo

od
h

h
h

ha
gs

gs

h
gs

gs

m
gs
bi-

bi-
gs

bi-

strongly-biased classes, but it suffers from poor prediction 256 1K 32K


for the weakly-biased substream. The bi-mode scheme
Figure 8: Misprediction contributed by three bias
keeps a reduced error for the weakly biased class, while suc-
classes in go
cessfully reducing the error from strongly-biased classes.
This figure shows misprediction due to three bias classes for
4.4 go benchmark the go benchmark. As the same in Figure 7, three schemes
with three different sizes of second-level tables are com-
In Section 3, we noted that the bi-mode scheme was not pared. As shown in the figure, the misprediction due to the
the best for the go benchmark. In this section, we provide WB class dominates in go for three schemes, and thus the
interference between SNT and ST classes is not the major
further analysis. concern. To improve prediction accuracy for go, more history
The go benchmark is intrinsically hard to predict because bits should be used because it is an effective way to remove
about half of its dynamic branches are in the WB class. the WB class. Note that as more history bits are used, the rel-
ative misprediction rates due to the WB class becomes
Figure 8 shows the misprediction contributed by the three smaller.
bias classes for the go benchmark. It is clear that for all the
schemes and configurations the misprediction for the WB References
class dominates—destructive aliasing is not the major con-
[Chang94] Chang, P., Hao, E., Yeh, T., and Patt, Y., “Branch
cern. There is not much room for the bi-mode scheme to im- Classification: a New Mechanism for Improving Branch Predictor
prove because it is targeted at eliminating harmful aliasing Performance,” IEEE Micro-27, Nov. 1994.
rather than improving prediction for the weakly biased sub-
streams. As observed in the previous subsection, the dy- [ChangEversPatt96] Chang, P., Evers, M., and Patt, Y., “Improv-
ing Branch Prediction Accuracy by Reducing Pattern History Ta-
namic frequency of the weakly biased class is mainly ble Interference,” International Conference on Parallel
determined by the number of global history bits used. From Architecture and Compilation Techniques, Oct. 1995.
Figure 8, we see that the error of the WB class is reduced as
[EustaceSrivastava95] Eustace, A. and Srivastava, A., “ATOM:
more global history bits are applied. The prediction accura- A flexible interface for building high performance program analy-
cy for programs like the go benchmark will only improve if sis tools,” Proceedings of the Winter 1995 USENIX Technical
more global history information is employed so that more Conference on UNIX and Advanced Computing Systems, 303-314,
Jan. 1995.
strongly biased substreams can be generated.
[FisherFreudenberger92] Fisher, J.A., and Freudenberger, S.M.
5. Concluding Remarks “Predicting Conditional Branch Directions From Previous Runs of
a Program, Proc.” 5th Annual Intl. Conf. on Architectural Support
for Programming Languages and Operating Systems, Oct. 1992.
In this paper, a new global-history based branch predic-
tion scheme, the bi-mode predictor, is proposed. It is de- [Gwennap95] Gwennap, L. “Intel’s P6 Uses Decoupled Super-
signed to improve predictions by eliminating the aliasing in scalar Design,” Microprocessor Report, Vol. 9, No.2, Feb. 16,
dynamic branch predictors. Its success relies on dynamical- 1995.
ly determining the taken or not-taken direction with an ac- [Gwennap96] Gwennap, L. “Digital 21264 Sets New Standard,”
curate but simple choice predictor. This classification can Microprocessor Report, Vol. 10, No. 14, Oct. 28, 1996.
help removing much of the destructive aliasing while keep-
[Hwu93] Hwu, W-W., Mahlke, S.A., Chen, W-Y., Chang, P-P.,
ing the harmless aliasing together for the two-bit counter ta- Warter, N.j., Bringmann, R.A., Ouellette, R.G., Hank, R.E., Kiyo-
bles. hara, T., Haab, G.E., Holm, J.G., and Lavery, D.M., “The super-
A detailed analysis on the mechanism of two-level block: An effective technique for VLIW and superscalar
compilation,” Journal of Supercomputing, 7(9-50), 1993.
scheme’s index system was also presented in the paper.
From the analysis we found that by using more global his- [Lee97] Lee, C-C., “Optimizing High Performance Dynamic
tory bits in an index can move more branches from the Branch Predictors,” Ph.D Dissertation, Univ. of Michigan, Ann
weakly biased group to the strongly biased group, but these Arbor, Nov. 1997.
indices suffer from destructive aliasing. Using more branch [McFarling93] McFarling, S. “Combining Branch Predictors,”
address bits reduces the destructive aliasing but increases WRL Technical Note TN-36, Jun. 1993.
the weakly biased group. The benefits of using branch ad-
[MichaudSeznecUhlig97] Michaud, P., Seznec, A., and Uhlig,
dresses and global history cannot be preserved in current R., “Trading Conflict and Capacity Aliasing in Conditional
two-level schemes simultaneously, but they can in the bi- Branch Predictors,” Proc. of the 24th Ann. Int. Symp. on Computer
mode scheme. Architecture, May 1997.
The bi-mode scheme can outperform other dynamic pre-
[PnevmatikatosSohi94] Pnevmatikatos, D.N. and Sohi, G.S.,
dictors, yet there is still room for improvement. One poten- “Guarded Execution and Branch Prediction in Dynamic ILP Pro-
tial shortcoming for the bi-mode scheme is that, though it cessors,” Proc. of the 21st Ann. Int. Symp. on Computer Architec-
can distinguish strongly taken and strongly not-taken sub- ture, Apr. 1994.
streams as shown in Figure 6, it can still suffer from inter- [PanSoRahmeh92] Pan, S.T., So, K., and Rahmeh, J.T., “Improv-
ference between the weakly biased and strongly biased ing the Accuracy of Dynamic Branch Prediction Using Branch
substreams. Therefore, there are at least two directions for Correlation,” Proceedings of the 5th Int. Conf. on Architectural
Support for Programming Languages and Operating Systems,
the future work: one is to find a cost-effective way to reduce Oct. 1992.
the weakly biased substreams, and the other is to further
separate the weakly-biased substreams from the strongly- [SechrestLeeMudge96] Sechrest, S, Lee, C-C, and Mudge, T,
“Correlation and Aliasing in Dynamic Branch Predictors,” Pro-
biased substreams for the counters. We are currently inves- ceedings of the 23rd International Symposium on Computer Archi-
tigating these issues. tecture, May 1996.

Acknowledgment [Smith81] Smith, J.E. “A Study of Branch Prediction Strategies,”


Proceedings of the 8th International Symposium on Computer Ar-
This work was supported by DARPA contract DAA chitecture, 135-148, May 1981.
H04-94-G-0327.
[SPEC95] SPEC CPU’95, Technical Manual, Aug. 1995.
[Sprangle97] Sprangle, E., Chappell R., Alsup, M., and Patt, Y.,
“The Agree Predictor: A Mechanism for Reducing Negative
Branch History Interference,” Proc. of the 24th Ann. Int. Symp. on
Computer Architecture, May 1997.

[TalcottNemirovskyWood95] Talcott, A.R., Nemirovsky, M.,


and Wood, R.C., “The Influence of Branch Prediction Table Inter-
ference on Branch Prediction Scheme Performance,” Proceedings
of the 3rd International Conference on Parallel Architectures and
Compilation Techniques, Jun. 1995.

[Uhlig95] Uhlig, R., Nagle, D, Mudge, T., Sechrest, S., and Emer,
J. “Instruction Fetching: Coping with Code Bloat,” Proceedings of
the 22th International Symposium on Computer Architecture, Ita-
ly, Jun. 1995.

[YehPatt91]Yeh, T-Y. and Patt, Y. “Two-Level Adaptive Train-


ing Branch Prediction,” Proceedings of the 24th International
Symposium on Microarchitecture, 51-61, Nov. 1991.

[YehPatt92]Yeh, T-Y. and Patt, Y. “Alternative Implementations


of two-level adaptive branch predictions,” Proceedings of the 19th
International Symposium on Computer Architecture, 124-134,
May 1992.

[YehPatt93]Yeh, T-Y. and Patt, Y. “A Comparison of Dynamic


Branch Predictors that use Two Levels of Branch History,” Pro-
ceedings of the 20th International Symposium on Computer Archi-
tecture, May 1993.

[YoungGloySmith95] Young, C., Gloy, N., and Smith, M. “A


Comparative Analysis of Schemes for Correlated Branch Predic-
tion,” Proceedings of the 22th International Symposium on Com-
puter Architecture, Italy, Jun. 1995.

View publication stats

You might also like