The Bi-Mode Branch Predictora
The Bi-Mode Branch Predictora
net/publication/232632424
CITATIONS READS
184 1,400
3 authors, including:
Trevor N. Mudge
University of Michigan
431 PUBLICATIONS 25,036 CITATIONS
SEE PROFILE
All content following this page was uploaded by Trevor N. Mudge on 08 February 2014.
SPEC CINT95
gcc jump.i
mode branch predictor is more accurate and cost-effective
go 2stone9.in, train data, reduced
than one of the best two-level branch predictors, gshare. To
xlisp train.lsp
evaluate the improvement, we have conducted trace-driven perl scrabbl.in, reduced
simulations. vortex train data, reduced
3.1 Description of gshare scheme Table 1: Description of the input data files used in
In gshare, the global history is xor-ed together with the the SPEC CINT95 programs
low-order address bits of a branch to form an index. This in-
dex is then used to select a 2-bit saturating up-down counter
from a pattern history table (PHT)1. Depending on the sign static conditional dynamic conditional
Benchmarks
bit of the selected 2-bit counter, the branch is either predict- branches branches
ed as taken or not taken. compress 482 10,114,353
SPEC CINT95
To make a fair comparison with the gshare predictor, the gcc 16,035 26,520,618
best configuration of gshare must be determined and used. go 5,112 17,873,772
This point is often overlooked and the single-PHT gshare xlisp 636 25,008,567
configuration is used for comparisons. However, this sin- perl 1,974 39,714,684
gle-PHT gshare configuration is not the optimal configura- vortex 6,599 27,792,020
3.2 Description of input trace Table 2: Static and dynamic branch counts in the
To assess the performance of the bi-mode branch predic- IBS and SPEC CINT95 programs
tor, we conducted a trace-driven simulation using the Ultrix
version of the Instruction Benchmark Suite (IBS-Ultrix)
SPEC CINT95 are summarized in Table 2.
benchmarks [Uhlig95] and the SPEC CINT95 benchmarks
[SPEC95]. 3.3 Simulation results
The IBS-Ultrix benchmarks are a set of applications de-
Figure 2 shows the misprediction rates for the best
signed to reflect realistic workloads. The traces of these
gshare and bi-mode predictors. In our simulation the best
benchmarks were generated through hardware monitoring
configurations of gshare, which are labeled gshare.best, al-
of a MIPS R2000-based workstation. These traces were col-
ways have multiple PHTs in the second-level table. Note
lected under Ultrix 3.1, and include both kernel and user ac-
that gshare.best is the best for the averaged results, not nec-
tivities.
essary the best for individual benchmarks. For easy compar-
For the SPEC CINT95 benchmark, we use ATOM
ison with other published results, we also include the
[EustaceSrivastava95], a code instrumentation tool from
misprediction rates for the single-PHT gshare configura-
Digital Equipment Corporation, to generate and capture ad-
tion, which is labeled gshare.1PHT. In Figure 2, the vertical
dress traces. The benchmarks were first instrumented with
axis represents the branch misprediction rate, and the hori-
ATOM, then executed on a DEC 21064 workstation run-
zontal axis for the size of predictors. A lower curve indi-
ning OSF/1 3.0 to generate traces. These traces contained
cates that the scheme has better performance for the same
only user-level instructions. The input to the SPEC95
cost. Cost is measured by counting the number of bytes used
benchmarks was a reduced input data set and is described in
in the 2-bit counters. Note that the bi-mode predictors natu-
Table 1. The branch statistics of traces from the IBS and the
rally have a cost that is 1.5 times that of the next smaller
gshare scheme2. This reflects the cost of the choice predic-
1. The pattern history tables are the tables constituting the second-level tors.
table of the two-level predictors, as defined in [YehPatt92]. In the two- Figure 2 shows the bi-mode predictors outperforms
level predictor model, the number of PHTs is determined by the branch
address bits directly used as the index. gshare predictors for all sizes of predictors measured. This
CINT95-AVERAGE IBS-AVERAGE
14 10
gshare.1PHT gshare.1PHT
gshare.best 9 gshare.best
12 bi-mode bi-mode
8
10 7
6
8
5
6 4
3
4
2
2 1
0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes)
compress gcc go
14 20 35
gshare.1PHT gshare.1PHT gshare.1PHT
gshare.best gshare.best gshare.best
12 bi-mode bi-mode 30 bi-mode
Misprediction Rate (%)
8 20
10
6 15
4 10
5
2 5
0 0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes) Predictor Size (K bytes)
5
8 8
4
6 6
3
4 4
2
2 2 1
0 0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes) Predictor Size (K bytes)
is indicated by lower curves. In addition, the bi-mode pre- est static branches, have no aliasing problems and thus can
dictors are more cost effective, because, for predictors larg- enjoy the benefit from correlation in branch histories. The
er than 4K bytes, they need less than half the size of gshare results of these two small benchmarks correspond to the
predictors to achieve the same misprediction rate. findings reported by Sechrest et al. [SechrestLeeMudge96].
Bi-mode predictors also outperform gshare on most of The case of the go benchmark, where the bi-mode method
the individual benchmark examined, see Figure 3 and is beaten by the multiple-PHTs, will be discussed in more
Figure 4. Moreover, the single-PHT gshare scheme is worse detail in the next section.
than the multiple-PHTs gshare scheme for all benchmarks
except the compress and xlisp, where it outperforms even 4. Analysis
the bi-mode scheme. These two benchmarks, with the few-
Many branches have a tendency to be either taken or not-
taken most of time. Common examples are branches for er-
2. In our experiments, all two-bit counters in gshare schemes are initial- ror checking and looping. These kinds of branches are usu-
ized to weakly-taken for each benchmark run. For the bi-mode scheme,
the choice predictor is reset to weakly-taken, and one bank of the direc- ally described as being strongly biased in one direction. As
tion predictor is reset to weakly-not-taken and the other bank is weakly- might be expected, strongly biased branches are much easi-
taken.
groff gs mpeg-play
9 10 10
gshare.1PHT gshare.1PHT gshare.1PHT
8 gshare.best 9 gshare.best 9 gshare.best
bi-mode 8 bi-mode 8 bi-mode
Misprediction Rate (%)
verilog video-play
9 12
gshare.1PHT gshare.1PHT
8 gshare.best gshare.best
bi-mode 10 bi-mode
Misprediction Rate (%)
7
6 8
5
6
4
3 4
2
2
1
0 0
0.25 0.5 1 2 4 8 16 32 0.25 0.5 1 2 4 8 16 32
Predictor Size (K bytes) Predictor Size (K bytes)
er to predict than weakly biased branches in dynamic curacy than the traditional two-bit counter scheme
branch predictors, and this was confirmed by Chang et al. proposed by Smith [Smith81] is because, in addition to the
[Chang94]. In the same study, they also measured the dis- branch address, they incorporate the branch history infor-
tribution of branch biases for SPEC CINT92. Their mea- mation to form the index for the second-level two-bit
surement showed that on average about 50% of total counter table. The index for the second-level table divides
dynamic branches are attributed to the static branches that the dynamic branch stream into substreams that are direct-
are biased in either the taken or not-taken direction for ed to a saturating two-bit counter. Ideally, the index should
more than 90% of the time. generate highly biased substreams so that the value of the
In this section, our analysis extends the idea of bias to saturating counter selected by the index can stay at one of
the dynamic branch substreams that arrive at each two-bit the saturated values most of time. Global history, compared
counter in the second-level table. Using this concept, we to the branch address, can divide a dynamic branch stream
will demonstrate the advantages and drawbacks of two into more highly biased substreams, as we will show later.
kinds of information used in the two-level scheme, specifi- However, if the indexing method mixes oppositely biased
cally, the branch address and global history. The analysis substreams together, then destructive aliasing can arise and
allows us explain why the bi-mode scheme can improve on the associated counter will perform badly as a predictor, be-
current dynamic branch predictors. cause it will oscillate between the two saturated values. Our
study will compare using branch addresses with global his-
4.1 Bias measurement for global-history based
tory to separate out oppositely biased substreams, and how
schemes aliasing can degrade the performance of two-level schemes
As we have noted before, the reason that two-level dy- using global history.
namic branch predictors can achieve higher prediction ac- To contrast the benefits of address and global history
dynamic count count of taken out-
branch normalized count from i=b
when using comes when using bias class
address, i to c, Nbc
counter c, |sic| counter c
0x 001 12 11 ST 12/50 = 24%
0x 005 20 1 SNT 20/50 = 40%
0x 100 8 3 WB 8/50 =16%
0x 150 10 1 SNT 10/50 = 20%
bits, we consider two alternative two-level gshare style pre- Thus the two conditions become:
dictors. Both have the same size second-level tables, 256 1. (Σi(Nic) | for those i such that sic ∈ WB) << (Σi(Nic) |
counters, but differ in that one employs more history bits, for those i such that sic ∉ WB)
representing history-indexed schemes, while the other rep- 2. (Σi(Nic) | for those i such that sic ∈ ST) should differ
resents address-index schemes. The first scheme xors 8 bits greatly from (Σi(Nic) | for those i such that sic ∈ SNT). In an
of branch address with 8 bits of global history to form the ideal situation, one of the sums should be 0.
index into the second-level table (“history-indexed”). The Table 3 illustrates the normalized count resulting from
second scheme xors 8 bits of branch address with only 2 bits three streams incident on the same counter c. In this exam-
of global history as the index (“address-indexed”). ple, there is a total of four static branches (i = 1,..,4) whose
We define three bias classes on a stream of branch out- addresses are 0x001, 0x005, 0x100 and 0x150, respectively,
comes: 1) strongly taken (ST) if the outcomes are taken that used the two-bit counter c for prediction during the pro-
90% of the time or more; 2) strongly not taken (SNT) if the gram execution (they may also use other counters too).
outcomes are not taken 90% of the time or more; and 3) These four streams fall into different bias classes with re-
weakly biased (WB) if the neither of the above apply. spect to c. The normalized count of ST class at the counter
We are interested in the stream of branch outcomes, sij, c is 24%, the SNT class is 60% (40%+20%), and the WB
from a particular static branch, i, to a particular prediction class is 16%. Because the SNT class is more frequent than
counter, j. This stream belongs to one of the three bias class- the ST class, the SNT is the dominant class in the counter c,
es, i.e., exactly one of the following is true: sij ∈ ST, sij ∈ and the ST is the non-dominant class. In fact, Table 3 shows
SNT, or sij ∈ WB. A good indexing method will create these an undesirable situation because the indexing method has
streams so that the following two conditions hold: done a poor job of separating the bias classes and the SNT
1. The number of streams that are in the WB class is kept class is not overwhelmingly dominant.
small. Figure 5 illustrates the bias classes for all of the predic-
2. Most of the streams incident on a particular prediction tion counters for the gcc benchmark. We have performed
counter, j = c, belong to only the ST class, or alternatively, the same experiments for other SPEC benchmarks, and we
only the SNT class, i.e., sic ∈ ST for most i, or sic ∈ SNT for select gcc because it is representative of the results from the
most i. A counter should not see an even mix of streams other benchmarks, see [SechrestLeeMudge96]. The X axis
from both classes or its prediction ability will be reduced. lists all the counters in the second-level table, and the Y axis
Condition 2 actually states that one of the two strongly represents the normalized counts of the three bias classes in
biased class should dominate the other strongly biased class each counter. The counters listed in the X axis are sorted ac-
at a counter. When this domination occurs, the counter will cording to the normalized dynamic frequency of WB class.
be biased at one saturated value with little destructive inter- It can be seen that the area size of WB region of the history-
ference. We will refer to the more frequent strongly-biased indexed scheme is smaller than that of the address-indexed
class at a counter as the dominant class, and the other less one. This suggests that the scheme employing more branch
frequent strongly-biased class as the non-dominant class. history can generate more highly biased substreams for pre-
To be more precise, we should consider streams weight- dictors. If there is no harmful aliasing problem in the histo-
ed by their lengths. If |sij| is the number of outcomes in the ry-index scheme, i.e., each counter only needs to deal with
stream sij, we define the normalized count that a branch, i = substreams of one bias class, the prediction accuracy will be
b, contributes to a particular prediction counter, j = c, to be: very high [TalcottNemirovskyWood95,
YoungGloySmith95].
s
bc
N bc = --------------------------------------------------------------------- However, in the usual situation where harmful aliasing
∑ s ic does exist, the performance of the history based scheme de-
grades. As shown in the same figure (Figure 5), the non-
over all static branches i
100 100
90 WB 90 WB
Normalized dynamic counts (%)
60 60
50 50
40 40
30 30 dominant
dominant
20 20
10 10
0 0
1 65 129 193 256 1 65 129 193 256
Individual counters Inividual counters
30
time, the resulting substreams merged at each counter
dominant
20 should be as unidirectional as possible; in other words, the
10 dominant area in Figure 5 should be large. Unfortunately,
0 neither the address-indexed scheme nor the history-indexed
1 65 129 193 256 scheme can achieve both of these two design goals simulta-
Individual counters
neously.
Figure 5: Bias breakdown for the gshare scheme
4.2 Bias measurement for the bi-mode scheme
in the SPEC CINT95 gcc benchmark.
History-indexed on the top, address- In this subsection, we repeat the analysis above for the
indexed on the bottom bi-mode prediction scheme. The configuration under exam-
This figure shows the bias of branch outcome substreams
ination has a 128-counter choice predictor indexed by the
arriving at each of 256 counters in a second-level table. The branch address and two banks of 128 counters in the sec-
top graph is for the history-index scheme (8 bits of branch ond-level table, each of which is indexed by 7 bits of branch
address xor-ed with 8 bits of global history); the bottom graph
is for the address-indexed scheme (8 bits of branch address
address xor-ed with 7 bits of global history. This system has
xor-ed with 2 bits of global history). These two graphs illus- about 50% more bytes than the predictors in the previous
trate the difference between the two indexing methods. The subsection, so the following analysis should be viewed
address-indexed scheme suffers from a larger number of
weakly biased (WB) branch substreams, while the history-
qualitatively.
indexed scheme suffers from more non-dominant sub- Figure 6 presents the measurement results. It can be seen
streams, implying a high degree of destructive interference that the weakly biased class in the bi-mode scheme is kept
between strongly but oppositely biased streams (between the
SNT and ST classes).
as small as the one in the history-indexed scheme, indicat-
ing that the advantage of employing history information is
dominant class in the history-indexed scheme is larger than preserved. On the other hand, Figure 6 also shows the bi-
the one in the address-indexed scheme. In other words, al- mode scheme yields a much larger area for the dominant
though the history-indexed selects the greater number of class than the history-indexed scheme, implying that de-
highly biased substreams, it does not separate the taken and structive aliasing has been reduced.
not-taken ones as well as address-indexed scheme. The counting arguments that we employ to classify the
To summarize the analysis above, an ideal dynamic ST, SNT, and WB classes are open to the criticism that they
branch predictor should generate as few weakly biased sub- do not capture the order in which the ST and SNT runs ap-
streams as possible; in other words, the area of the weakly pear. For example, it is undesirable for them to be inter-
biased region should be as small as possible. At the same mixed so that the stream changes between the two classes.
As a final experiment, we have counted the numbers of
SNT ST WB
Dominant Non-dominant WB 12
(9)
(2)
(8)
(7)
(2)
(4)
)
)
(10
(15
(14
de
re
de
re
re
re
class in a counter, and then accumulate the counts of all
re
re
de
ha
ha
ha
mo
ha
mo
ha
ha
mo
gs
gs
gs
gs
bi-
bi-
gs
gs
bi-
counters for a scheme. For example, the count for the domi-
256 1K 32K
nant class of the history-indexed scheme is the total number
of changes of the dominant class due to interference by the Figure 7: Misprediction contributed by three bias
other two classes in the scheme.
classes in gcc
In this figure, three schemes are compared, a gshare using
changes between bias classes due to interference. Table 4 fewer history bits (representing the address-indexed
shows the results for the history-indexed and bi-mode scheme), a gshare using more history bits (history-indexed)
and the bi-mode scheme. For each scheme, three different
schemes. The bi-mode scheme has fewer changes, implying sizes of second-level tables are examined: 256, 1K and 32K
that its ST and SNT classes are less intermingled. This counters. gshare (m) represents a gshare scheme that uses
means less interference, and further illustrates why our pro- m-bit global history, while bi-mode (m) represents a bi-mode
scheme that uses m-bit global history for its direction predic-
posed prediction scheme perform better than conventional tors. The choice predictor of the bi-mode scheme is half the
two-level schemes. size of its second-level table. As shown in the figure, the
address-indexed scheme always has larger misprediction for
4.3 Breakdown of misprediction for the gshare the WB class. The history-indexed scheme has less mispre-
and bi-mode schemes diction for the WB class, but has more for the SNT and ST
classes due to interference. The bi-mode scheme reduces
We have also measured the misprediction contributed by error from the WB and reduces, in most cases, the mispredic-
tion for the SNT and ST by removing interference.
three biased classes for the gshare and bi-mode schemes.
Again, for the gshare scheme, the configurations using few-
SNT ST WB
er global history bits and more global history bits are both
breakdown of misprediction rates (%)
25
included for comparison.
Figure 7 presents the measurement results for the gcc 20
benchmark. Three different sizes are studied for the branch
predictors: 256, 1024, and 32,768 counters in the second 15
(4)
)
(2)
(8)
(7)
(9)
)
)
(15
(10
14
are
are
e(
are
are
de
de
are
re
global history bits always has the least error from the
mo
mo
od
h
h
h
ha
gs
gs
h
gs
gs
m
gs
bi-
bi-
gs
bi-
[Uhlig95] Uhlig, R., Nagle, D, Mudge, T., Sechrest, S., and Emer,
J. “Instruction Fetching: Coping with Code Bloat,” Proceedings of
the 22th International Symposium on Computer Architecture, Ita-
ly, Jun. 1995.