0% found this document useful (0 votes)
30 views8 pages

Alloy Pact00

The document presents a taxonomy of branch misprediction types and proposes alloyed branch prediction as a technique to address wrong-history mispredictions. It shows that wrong-history mispredictions are an important cause of errors even with simple conflict reduction. Alloyed prediction combines local and global history to allow branches to see both simultaneously and better address wrong-history errors compared to hybrid prediction.

Uploaded by

Oualid_Demi_1715
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views8 pages

Alloy Pact00

The document presents a taxonomy of branch misprediction types and proposes alloyed branch prediction as a technique to address wrong-history mispredictions. It shows that wrong-history mispredictions are an important cause of errors even with simple conflict reduction. Alloyed prediction combines local and global history to allow branches to see both simultaneously and better address wrong-history errors compared to hybrid prediction.

Uploaded by

Oualid_Demi_1715
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Taxonomy of Branch Mispredictions, and

Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions


Kevin Skadron Margaret Martonosi Douglas W. Clark
Dept. of Computer Science Depts. of Electrical Engineering and Computer Science
University of Virginia Princeton University
Charlottesville, VA 22904 Princeton, NJ 08544
[email protected] [email protected] [email protected]
Abstract mispredictions. These arise when the type of history tracked by a
two-level predictor—either global or local history—is the wrong
The need for accurate conditional-branch prediction is well type of history for that branch. The paper describes alloyed branch
known: mispredictions waste large numbers of cycles, inhibit out- prediction—a generalization of the bi-mode branch predictor pro-
of-order execution, and waste power on mis-speculated compu- posed by Lee, Chen, and Mudge [9]—as an attractive way to at-
tation. Prior work on branch-predictor organization has focused tack this category of mispredictions. Finally, the paper shows that
mainly on how to reduce conflicts in the branch-predictor struc- individual branches dynamically vary between needing local and
tures, while relatively little work has explored other causes of mis- global history, and demonstrates that static selection in a hybrid
predictions. Some prior work has identified other categories of predictor is therefore undesirable. Alloying, on the other hand,
mispredictions, but this paper organizes these categories into a allows branches to see both global and local history simultane-
broad taxonomy of misprediction types. Using the taxonomy, this ously. Alloying has the further advantage over conventional hybrid
paper goes on to show that other categories—especially wrong- branch predictors that it does not subdivide the available branch-
history mispredictions—are often more important than conflicts. prediction hardware into distinct and much smaller—and thus less
This is true even if just a very simple conflict-reduction technique effective—components.
is used. Based on these observations, this paper proposes alloy-
As we developed our taxonomy, the category of wrong history
ing local and global history together in a two-level branch pre-
mispredictions suggested the development of the alloyed predic-
dictor structure. This simple technique, a generalization of the
tor. We found that using an alloyed predictor permitted us to add a
bi-mode predictor, attacks wrong-history mispredictions by mak-
category to our taxonomy. We therefore describe the alloyed pre-
ing both global and local history simultaneously available. Un-
dictor first, so that we can use the alloyed predictor in the rest of
like hybrid prediction, however, alloying gives robust performance
the paper as we develop the taxonomy.
for branch-predictor hardware budgets ranging from very large to
very small. Finally, this paper shows that individual branch refer- The next section presents the simulation methodology used
ences can also suffer wrong-history mispredictions as they alter- in this paper. Next, section 3 describes alloyed prediction and
nate between using global and local history, a phenomenon that presents a brief evaluation of its performance. Section 4 then de-
favors dynamic rather than static selection in hybrid predictors. velops the taxonomy (incorporating alloying) and quantifies the
importance of wrong-history mispredictions, and Section 5 further
1. Introduction explores the issue of wrong-history mispredictions and how it af-
fects conventional hybrid predictors. It shows that not just static
The question of how better to predict the direction of condi- branch instructions, but dynamic branch references can also suffer
tional branches has received intense study in recent years. Two- wrong-history mispredictions. Finally, Section 6 describes related
level [11, 22] and hybrid [10] predictors, which explicitly track work, and Section 7 concludes the paper.
prior branch history, have received special attention. Most of this
attention examines how to reduce aliasing errors (conflict mispre- 2. Simulation and Benchmark Details
dictions), which arise when unrelated branches happen to collide
in a particular branch-predictor entry and overwrite each other’s 2.1. Simulator
state. Conflicts are undeniably important, but a wealth of excel- This paper uses both instruction-level and detailed cycle-
lent techniques have been developed to reduce these destructive level simulation to compare the performance of different branch-
conflicts in the pattern history table (PHT)1 of two-level predic- predictor configurations. Cycle-level simulations are performed
tors. This paper shows that even without using aggressive anti- using HydraScalar, our modified, multipath-capable version of
aliasing techniques, conflicts only account for 15–20% of mispre- SimpleScalar 2.0’s sim-outorder [1]. HydraScalar has been con-
dictions in global-history predictors and 40–50% in local-history figured to approximately model an Alpha 21264 [7]. It performs
predictors. Naturally, these fractions are smaller when aggressive out-of-order execution with a 64-entry instruction window, and
conflict-reduction techniques are applied. A complete elimination issues up to 4 integer and 2 floating-point instructions each cy-
of conflicts therefore leaves many or most mispredictions remain- cle. The two-level, non-blocking cache hierarchy has 2-cycle, 64
ing to be solved. KByte first-level instruction and data caches and a 12-cycle, uni-
Further reductions in conflict mispredictions are indeed becom- fied, 8 MByte second-level cache. The branch history is updated
ing difficult, and prediction accuracies still lie only in the 90–97% speculatively at fetch time with suitable repair mechanisms [14],
range. This paper therefore looks beyond conflict mispredictions and HydraScalar models multiple layers of misprediction. The
and organizes a number of misprediction types into a taxonomy branch misprediction latency is 7 cycles. The taxonomy measure-
to help characterize their relative importance. The paper then goes ments use a modified version of SimpleScalar’s instruction-level
on to show that some other misprediction categories are often more sim-bpred simulator.
important than conflicts, especially the category of wrong-history
1 The PHT is the table of saturating two-bit counters used by most pre-
2.2. Benchmarks
dictor organizations. Different organizations assign branches or branch These evaluations use not only the SPECint95 benchmarks
streams to these two-bit counters differently. [19], but also four other primarily integer benchmarks. Table 1
summarizes the benchmarks’ characteristics. All are compiled
Copyright c 2000 IEEE. Published in the Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, Oct. 15–19, 2000 in Philadelphia, Pennsylvania, USA. Personal use of this material
is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of
this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
component #1 component #2
using gcc version 2.6.3 for the SimpleScalar PISA, with opti- (global) (local)
mization set at -O3 -funroll-loops (-O3 includes inlin- GBHR selection
ing). The SPEC programs use “ref” inputs. Some benchmarks BHT

come with multiple reference inputs, in which case one has gener-
ally been chosen. Xlisp is an exception; it used the 9-queens input. PHT PHT

Gnuchess was set to level 10, and the SPLASH benchmarks used
taken/not−taken
the largest input.
Figure 1. The organization of a hybrid predictor with
Warmup Conditional branch counts two different components. (The left-hand component is a
100 M insts 1 B insts global-history predictor, and the right-hand component is a
insts static dyn. static dyn. local-history predictor.) The selector can be dynamic, re-
go 925 M 4,627 11.2 M 5,331 112 M quiring a meta-predictor structure, or static, in which case
m88ksim 25 M 231 16.2 M 968 162 M each branch is assigned to a component at compile time.
gcc (cc1) 220 M 14,245 14.7 M 20,783 190 M
compress 2575 M 205 11.8 M 203 151 M area into these different and smaller components. If the total hard-
li (xlisp) 270 M 271 15.4 M 676 154 M ware budget is too small, the subcomponents will be smaller yet
ijpeg 823 M 657 5.1 M 1,415 58 M and ineffective as a result, yielding poor overall behavior.
perl 600 M 352 12.9 M 614 129 M
vortex 2450 M 3,134 12.2 M 3,203 124 M 3.2. The Importance of Small Branch Predictors
gnuchess 150 M 665 9.6 M 1,127 96 M
Some readers may wonder why 8 Kbit and 2 Kbit branch pre-
wolf 50 M 2,288 15.9 M 2,993 26 M
dictors are of any interest today, when some processors now use
radiosity 300 M 163 9.4 M 183 92 M
much larger predictors. For example, the Alpha 21264’s predic-
volrend 125 M 57 6.5 M 660 70 M
tor is about 28 Kbits [7]. But not all processors designed today
can afford to devote a large area to the branch predictor. For ex-
Table 1. Benchmark summary. ample, power constraints may dictate a smaller chip size, and cost
Data is given for simulations of both 100 million and 1 billion in- constraints likewise. Processors for embedded environments are
structions. “Warmup insts” indicates the length of the preliminary generally both space- and power-constrained. Despite these con-
phase of simulation, before statistics-gathering. straints, smaller branch-prediction environments still require the
Gnuchess comes from the IBS benchmark suite [20]; wolf is best branch prediction available, because prediction accuracy re-
the timberwolf circuit router and comes from Smith’s Unix-Utils mains a powerful lever over performance. Better prediction ac-
benchmark suite [18], and 1.7% of its instructions are floating- curacy also reduces power wasted on mis-speculated computation.
point operations. Radiosity and volrend were chosen from the One might think a simple bimodal2 organization would be the best
SPLASH2 suite [21] of parallel applications for shared memory choice. The data from Section 3.6 and [15] show otherwise.
because these two have significant misprediction rates. 3.3. Alloyed Predictors
Some benchmarks have substantial initial phases in which they
generate data (as in compress), read in data, or perform other ac- This paper proposes an alternative—alloying—as a superior
tions that differ from the main body of the execution. Simula- way to expose both global and local history and to attack the
tions produce substantially unrepresentative results if this initial wrong-history problem. Alloyed prediction performs competi-
phase comprises too much of the simulation [13]. The simula- tively to an equal-area hybrid predictor for large hardware bud-
tor therefore begins gathering statistics much later in the program. gets, and substantially outperforms hybrid predictors for smaller
During the preliminary phase, branches and memory references hardware budgets. In other words, alloying provides robust per-
are still presented to the simulator, warming up the predictor and formance for both large and small hardware budgets. Alloying
caches. After the warmup phase (whose duration is shown in also outperforms other two-level organizations.
Table 1) completes, the simulator runs in full-detail, cycle-level branch address
PHT
mode for a further 1 million instructions to prime all the processor
structures. Then statistics are gathered for the next 100 million GBHR
instructions for cycle-level simulations, and 1 billion instructions
for instruction-level simulations; in the latter case, gcc and wolf taken/not−taken

are short enough to run to completion. BHT


(combine bits)

3. Alloying: Description & Performance Figure 2. The organization of a two-level predictor with
3.1. Hybrid Predictors an alloyed index. This “MAs” predictor combines local his-
tory from the per-branch history table (BHT) and global his-
Hybrid predictors [10] are one way to attack the wrong-history tory from the global branch-history register (GBHR) with
problem. Hybrid predictors combine two or more prediction com- some address bits to compose the PHT index.
ponents, with some way to choose which component to use for
each dynamic branch encountered. If one component is a global- Alloying is a pseudo-hybrid organization that looks just like a
history predictor and the other is a local-history predictor, both two-level, local-history predictor, and merely adds a global-history
types of history are therefore available [2]. This reduces the register. The predictor then alloys global and local history bits
wrong-history problem if the selection mechanism does an effec- into one PHT index. Figure 2 shows the organization we pro-
tive job of choosing which component to use for each branch. The pose. This simple modification attacks the drawbacks of two-level
selector, however, may itself be a large prediction structure. Fig- organizations—by exposing both global and local history—and
ure 1 presents a high-level schematic of a hybrid predictor that the drawbacks of hybrid organizations—by avoiding the need for
combines global and local prediction components. a selector and by avoiding the need to subdivide the hardware into
Hybrid predictors have drawbacks. Designing an effective se- multiple branch-prediction components.
lection mechanism can be difficult. More importantly, hybrid pre-
diction only works well with a large hardware budget. This prob- 2 A bimodal or “2-bit” predictor—proposed by Smith in [17]—is just a
lem exists because a hybrid predictor must subdivide the available bare PHT, indexed by branch address.

2
We call the organization shown in this figure MAs, because it If the choice predictor is viewed as the BHT of a local-history
resembles GAs and PAs predictors3 [22] in concatenating the dif- predictor, and the two direction PHTs are viewed as logical halves
ferent types of bits. GAs and PAs predictors try to reduce con- of a physically unified table, the similarity between bi-mode
flicts in the PHT by concatenating the history—whether global and alloying can be seen. The choice predictor is tracking per-
or local—with some bits from the branch address. In this way, branch—i.e., local—history, and the high-order bit of its two-bit
two unrelated branches that share the same prior history should counter is used as the highest-order index bit, thereby selecting
be distinguished and mapped to different PHT entries by their dif- which half of the PHT to use. In particular, if the bi-mode predic-
fering branch addresses. MAs does this too, as shown by Figure tor uses bit-concatenation rather than xor’ing, bi-mode is almost
2. However, to obtain the same degree of anti-aliasing, MAs typ- exactly the same as an MAs predictor with one bit of local history.
ically needs fewer address bits than GAs or PAs. This is because
alloying global and local history itself provides some anti-aliasing 3.5. Further Alloying Considerations
capability: unrelated branches that alias with one kind of history The MAs organization, as described, may have a longer access
often can be distinguished using the other kind of history. time than a conventional two-level predictor or even a conventional
hybrid predictor. This is because the MAs predictor would per-
3.4. Alloyed vs. Bi-Mode Prediction form two table lookups in series. First it would probe the BHT,
Alloying is a generalization of the bi-mode predictor proposed in order to get the local-history bits to be concatenated with the
by Lee, Chen and Mudge [9] and shown in Figure 3. As de- global-history and address bits. Only then could the PHT be ac-
scribed in [9], the bi-mode predictor seems quite different from cessed. Fortunately, if the number of local-history bits is small,
the MAs predictor in Figure 2. A careful rearrangement, how- this problem can be avoided. The PHT can be broken into mul-
ever, shows the similarity. The bi-mode predictor was developed tiple physical tables, accessed in parallel, similar to the bi-mode
to attack destructive interference between branches that map to the organization (Figure 3). The local history bits are then used as the
same PHT entry but have opposite biases (i.e., one is taken, one is selector on a multiplexor that chooses the outcome from the ap-
not taken). Branches that alias but have the same bias are harm- propriate table. This organization is shown in Figure 4. It permits
less. The bi-mode predictor therefore maintains two PHTs, one the PHT and BHT lookups to proceed in parallel. Such an orga-
for branches with a bias toward taken, one for branches with a nization should be feasible for most MAs configurations, since we
bias toward not taken. These PHTs are indexed in the gshare [10] never found an MAs organization that needed more than 4 local-
manner of xor’ing a global-branch-history string with bits from history bits. While a 16-way multiplexor will have a non-trivial
the branch PC. A choice predictor, indexed only by the branch delay in its own right, this delay should be less than that of a table
PC, uses two-bit counters to learn each branch’s bias and therefore lookup. A further consideration is that the multiple simultaneous
indicate which PHTs the branch should use. table accesses will dissipate somewhat more power than the single
access to one large table. Of course, the roles of lookup time and
GBHR branch address
power cannot be evaluated in the experimental framework used for
xor this work.
PHTs choice predictor
Despite the concerns over access time, MAs has two major
virtues that make it attractive: it reduces wrong-history mispre-
dictions, and it avoids subdividing the branch-prediction hardware
into multiple components that may individually be too small to
predict effectively. MAs thus gives robust prediction accuracy for
a range of sizes. Conventional hybrid predictors perform poorly at
small sizes, and conventional two-level predictors reach a domain
selection
of diminishing returns too early and therefore perform poorly at
taken/not−taken large sizes.
Figure 3. The organization of a bi-mode predictor. The As mentioned before, other anti-aliasing schemes have been
“choice predictor” uses two-bit counters to learn for each proposed that outperform GAs and PAs, although often at the
branch whether it is biased toward taken or not taken. This cost of additional complexity. We restrict our evaluation of al-
value is then used to assign the branch to one of the two loying to GAs, PAs, and MAs predictors, and hybrid predictors
PHTs. that use them as components. Comparing predictors that use the
branch address
PHTs
same conflict-reduction technique keeps our experiments fair, and
GBHR should give an accurate picture of alloying’s usefulness. Alloy-
....
ing is certainly not restricted to an MAs configuration. Alloy-
(combine bits) ing would presumably benefit slightly less from the anti-aliasing
BHT than strictly global- or local-history predictors (because alloying
already achieves some anti-aliasing), but alloying would presum-
.... ably be more effective at removing wrong-history mispredictions.
p
selection A final important consideration for this paper is that “select”-
style anti-aliasing makes comparisons between alloying and hy-
taken/not−taken
brid straightforward: it is not obvious how to extend most pro-
Figure 4. An MAs predictor rearranged to permit simul- posed anti-aliasing techniques to hybrid prediction.
taneous PHT and BHT access. The original, unified PHT is
broken into 2p separate tables, all accessed simultaneously. 3.6. Performance of Alloyed Prediction
The p local-history bits are then used to select which value
to use for the final prediction. In the interests of space, a detailed evaluation of the perfor-
mance of alloyed prediction is left to a technical report [15]. That
3 In this naming scheme, the first letter gives the type of history tracked:
document presents per-benchmark comparisons of MAs against
Global, Per-address (local), or Merged (alloyed). The second letter indi- bimodal, GAs, PAs, and hybrid branch predictors, giving both pre-
cates whether the predictor’s PHT is Adaptive (i.e., dynamic), or Static. diction-rate and IPC data. Here we only summarize the results to
And the third letter indicates the PHT structure: “g” indicates no anti-
aliasing, “s” indicates select or concatenation-style anti-aliasing, and “p” show that MAs successfully attacks wrong-history mispredictions.
indicates perfect anti-aliasing (no conflicts ever; GAp, PAp, and MAp are To get the best comparison for each predictor size, the config-
ideal in this regard) [22]. urations that perform best overall for the entire benchmark suite

3
GAs PAs MAs
index BHT PHT index BHT PHT
64 Kbits 8g, 7a 8p, 6a 4K entries 16K entries 9g, 4p, 3a 8K entries 16K entries
8 Kbits 5g, 7a 4p, 7a 1K entries 2K entries 7g, 2p, 2a 2K entries 2K entries
2 Kbits 1g, 9a 2p, 7a 512 entries 512 entries 3g, 2p, 4a 512 entries 512 entries
Table 2. Predictor configurations used for equal-total-size comparison.
“g” indicates the number of global-history bits, “p” local-history bits, and “a” address bits.

GAs PAs selector


index PHT index BHT PHT index PHT
Dynamic 7g, 7a 16K entries 8p, 4a 1K entries 4K entries 6g, 7a 8K entries
Static 7g, 7a 16K entries 13p, 0a 1K entries 8K entries na
Table 3. Predictor configurations used for 64 Kbit dynamic and static hybrid predictors.
GAs PAs selector
index PHT index BHT PHT index PHT
Dynamic 4g, 7a 2K entries 2p, 7a 512 entries 512 entries 3g, 7a 1K entries
Static 4g, 7a 2K entries 2p, 8a 1k entries 1k entries na
Table 4. Predictor configurations used for 8 Kbit dynamic and static hybrid predictors.
64 Kbits 8 Kbits 2 Kbits
bimodal GAs PAs hybrid bimodal GAs PAs hybrid bimodal GAs PAs hybrid
1.154 1.031 1.034 1.000 1.092 1.033 1.032 1.029 1.038 1.036 1.020 na
Table 5. Mean speedup of MAs over each listed predictor organization for a 4-issue processor.
64 Kbits 8 Kbits 2 Kbits
GAs PAs GAs PAs GAs PAs
23.1% 22.8% 19.6% 16.9% 11.8% 6.8%
Table 6. Mean reduction in misprediction rate achieved by MAs.
must be used. Finding the best composition of PHT index bits wrong-history problem. Here we complete our brief evaluation of
was done using brute force, simulating all possible combinations MAs by comparing it against a dynamic-selection hybrid branch
of global, local, and address bits for the desired branch-predictor predictor [10]. The data are included in Table 5. We only exam-
size (plots of this design space for gcc and m88ksim can be found ine configurations of 64 Kbits and 8 Kbits; hybrid prediction is
in [16].) Finding equal-area configurations must also account for not feasible at 2 Kbits, because the components would simply be
the BHT’s size. We explored all possible BHT configurations for too small. Indeed, our data show that 8 Kbits is also too small for
the chosen size, ranging from wide and short (many local-history a hybrid predictor. An alloyed predictor, in contrast, works well
bits and few BHT entries) to narrow and tall. The GAs, PAs, and even down to 2 Kbits.
MAs configurations chosen appear in Table 2. The hybrid config- At 64 Kbits, hybrid prediction and MAs provide equivalent per-
urations chosen appear in Tables 3 and 4.4 formance: the average speedup of MAs over hybrid prediction for
all the benchmarks is 1.0, and hybrid never outperforms MAs by
Comparison Against Two-Level Predictors. Table 5 re- more than 1.2%. Both organizations do a good job of eliminating
ports the average speedup on a 4-issue, out-of-order processor for wrong-history mispredictions. Hybrid performs well because 64
MAs compared to each alternative: GAs, PAs, and bimodal. Un- Kbits is enough area to subdivide into different components.
like the taxonomy results in Section 4, these results do use a finite The picture is different for an 8 Kbit predictor. Here, MAs is
BHT. A bimodal predictor of the appropriate size is included to superior for all but one benchmark and usually by a substantial
serve as a reference. We also evaluated gshare-style [10] versions, margin, as high as 8.5%. The exception is go, where hybrid is
where the history and address strings are xor’d together. Like 1.3% better. The overall speedup for MAs compared to hybrid
Sechrest et al. [12], we found little added benefit: xor’ing actu- is 2.9%. This seems like a small speedup, but this corresponds
ally helps MAs slightly more than GAs or PAs. These speedups to an average 15% reduction in mispredictions. MAs does better
translate into substantial reductions in the misprediction rate. For at small sizes like 8 Kbits, because a hybrid predictor—whether
some benchmarks like m88ksim, perl, and vortex, a 64 Kbit MAs using dynamic or static selection—simply does not have enough
halves the misprediction rate compared to an equivalent-area GAs! area to subdivide into smaller components.
Table 6 reports how much overall MAs reduces the misprediction In summary, MAs performs at least as well as hybrid predic-
rate compared to GAs and PAs. Note that the reduction in mispre- tion for large hardware budgets, and outperforms hybrid predic-
dictions is mostly independent of issue width. tion for smaller hardware budgets. Overall, alloying’s robust per-
Comparison Against Hybrid Prediction. Of course, GAs formance makes it attractive for a range of processors, from high-
and PAs are restricted to one type of history, and suffer from performance processors with large branch-prediction budgets to
wrong-history mispredictions. We also must evaluate alloying small embedded processors.
against hybrid prediction, which like alloying can attack the
4. A Taxonomy of Mispredictions
4 While exploring configuration options for hybrid prediction, we made
sure to test hybrid organizations that use a simple, 2-bit, bimodal structure. A taxonomy of mispredictions has three virtues. It shows the
This benefited some benchmarks, especially for small hybrid predictors, relative importance of different misprediction types. It permits de-
but for both 64 Kbits and 8 Kbits, the organization that was best overall signers to tailor branch-prediction solutions to individual mispre-
did not use any bimodal structures. diction types—a divide-and-conquer approach—rather than devise

4
a single, all-purpose branch predictor. And it provides better un- helped us discover alloying. And most importantly, it substan-
derstanding of branch-predictor behavior: merely devising a tax- tially extends prior efforts at categorizing branch mispredictions,
onomy yields insight. Indeed, alloying simply suggested itself by organizing a number of recognized misprediction types into a
while we developed this taxonomy, and this in turn permitted us to single classification scheme and by describing a one-pass method
extend the taxonomy. We feel this taxonomy is perhaps the most for counting them.
important contribution of this paper.
As mentioned before, a great deal of work has explored ways 4.1. Taxonomy Categories
to prevent conflict mispredictions in two-level predictors. This pa- Destructive PHT and BHT conflicts. All dynamic predic-
per shows that predictors also suffer from other important types tors that track state can suffer from destructive conflicts when un-
of mispredictions. For example, two-level and hybrid predic- related branches map to the same predictor entry. Destructive
tors are sophisticated structures that can take a long time to PHT (pattern history table) conflicts arise when branches map
learn a branch’s behavior, often producing a substantial number to the same 2-bit PHT counter and these branches go in oppo-
of training-time mispredictions [4]. And—as we have already site directions.5 These conflict-mispredictions can be identified by
pointed out—most programs suffer severely from wrong-history running a finite and infinite PHT in parallel (GAs and GAp predic-
mispredictions. It is important to understand the relationship tors, or PAs and PAp). The two predictors behave the same, except
among these different sources of mispredictions, but we are aware that the infinite PHT cannot suffer from conflicts. A mispredic-
of no prior work that organizes such misprediction categories into tion in the finite PHT that does not occur in the infinite PHT must
a broad framework and measures the relative importance of so therefore be a destructive conflict.
many sources of mispredictions. Aliasing in the BHT (branch history table) can also cause mis-
EACH predictions. To simplify an already complicated measurement,
BRANCH
hit hit
here we omit their impact by assuming an interference-free BHT.
true hit true hit
GAs PAs This also provides better comparability of GAs and PAs results.
misprediction misprediction
Training-induced mispredictions. If a misprediction is not
GAp
hit destructive
PHT interference PAp
hit destructive
PHT interference
caused by PHT interference, it can instead occur because the pre-
misprediction misprediction
dictor has not yet learned the branch’s behavior. This happens
especially at the beginning of a program or after a context switch,
hit training hit training
bimod misprediction bimod misprediction but also occurs as programs transition from one phase to another.
misprediction misprediction
We have yet to devise a precise method for measuring training
hit hit
mispredictions, but the impact of training time can be estimated
wrong type wrong type
PAp of history
misprediction
GAp of history using a simple bimodal predictor. First eliminate conflict mispre-
misprediction dictions. Then training-time mispredictions occur when the main
needs both types
MAp
hit of history branch predictor fails but an idealized bimodal predictor succeeds.
misprediction It may seem odd for a simple bimodal predictor to follow a
remaining
global-history predictor in our cascade of tests. But recall that
we are not comparing the two. We only use the bimodal predic-
Figure 5. A flowchart depicting how the taxonomy tor to indicate if a branch reference could be predicted by some
categorizes misprediction types. Each dynamic branch very simple organization. The assumption is that if a branch mis-
flows down both sides until it is either categorized or falls predicts using global history, but predicts correctly in the bimodal
through. organization, the branch is predictable; the global history predictor
Figure 5 shows the sequence of tests used to classify each just has not yet learned its behavior. On the other hand, if not even
misprediction. Note that wrong-history mispredictions are only a simple predictor can predict a reference, then the problem is not
counted after conflict and training-time mispredictions. This pro- training time.
vides a measure of “true” wrong-history mispredictions. Measure- This procedure admittedly neglects the time it takes the bi-
ments are accomplished by running in parallel several predictor modal predictor to train. Yet a bimodal predictor is fast-training,
organizations of increasing sophistication. If a branch mispredicts and so we feel it provides a good estimate of the effect of training-
in one organization while predicting correctly in another, the dif- induced mispredictions. A better method for measuring training
ference between the two configurations isolates the misprediction mispredictions would be a clear contribution to this taxonomy.
category. The simulator performs the pictured cascade of tests un- Wrong type of history. Mispredictions can also occur be-
til the branch either predicts correctly, or the misprediction fails all cause the predictor tracks the wrong type of history for the branch
tests. Remaining branches are either inherently difficult to predict, in question: global instead of local, or vice-versa. These are the
or fall into a category not yet included in the taxonomy. The de- wrong-history mispredictions.
picted process simultaneously categorizes each dynamic branch’s Global history can expose correlation among branches, while
behavior for both GAs and PAs predictors. local history is well suited for branches that follow a consistent
We by no means claim the taxonomy is comprehensive: the pattern. Unfortunately, most programs have some branches that do
included categories can presumably be refined, and it lacks some well with global history and some branches that do well with local
obvious categories: update timing [14], history length, and history history. If the branch predictor only tracks one or the other, some
pollution [5]. This is partly because our work on taxonomies is branches therefore find that the predictor provides the wrong type
still in its beginning stages, partly because we restrict ourselves to
GAs and PAs predictors of fixed history length, and partly because of history.6 Evers et al. showed this to be important in [5]. Our
measurements using the taxonomy are difficult. This is therefore measurements find that these wrong-history mispredictions are es-
only a first step toward a rigorous and thorough analysis of what pecially severe in global-history predictors, comprising 35–50%
causes branch mispredictions, a step that we hope leads to further of the total misprediction rate.
research in understanding branch predictability. We expect that As mentioned, the measurements here separate “true” wrong-
readers will find many ways to improve this taxonomy, and this is 5 Note that constructive conflicts can also occur, so the expected gain
exactly why we seek to disseminate it. from eliminating PHT conflicts would only be the difference.
Yet even this simple taxonomy is an important contribution. It 6 By wrong history, we do not mean that the actual history bits are in-
shows the importance of wrong-history mispredictions. It shows correct; rather, the type of history being tracked is inappropriate for the
the risk of continuing to focus on conflict mispredictions. It has branch at hand.

5
GAs/GAp PAs/PAp MAp
32K entries 8 global, 7 address 14 local, 1 address 10 global, 4 local, 1 address
4K entries 5 global, 7 address 10 local, 2 address 7 global, 4 local, 1 address
1K entries 1 global, 9 address 10 local, 0 address 5 global, 4 local, 1 address
Table 7. Predictor configurations used for taxonomy measurements.
32K entry PHT 4K entry PHT 1K entry PHT
25 25 25

20 20 20
% of branches

% of branches

% of branches
global local
15 15 15

10 10 10

5 5 5
% of branches

remaining 0 0 0
combined history

m88ksim

m88ksim
compres

compres
m88ksim

vortex

vortex
compres

gcc

gnuchess

gnuchess
vortex

go

ijpeg

wolf

gcc
go

ijpeg

wolf
perl

perl
gcc

gnuchess
go

ijpeg

wolf

xlisp

volrend

xlisp

volrend
radiosity

radiosity
perl
xlisp

volrend
radiosity
wrong history
training
PHT conflict

Figure 6. Breakdown of misprediction types for GAs and PAs with 32K-entry, 8K-entry, and 4K-entry PHTs and an interference-
free BHT. KEY: For each benchmark, the left-hand bar represents GAs, and the right-hand bar PAs. Shorter bars mean fewer
mispredictions.
history mispredictions from those merely caused by aliasing. We these taxonomy measurements use a perfect BHT, its size is not
argue that the only true wrong-history mispredictions are those included in the total area. This does mean that the total mispre-
that cannot be solved by eliminating conflict or training-time mis- diction rate for PAs is understated, and the training time for PAs is
predictions. The above techniques are therefore used first, to elim- slightly overstated. Nevertheless, the bar segments faithfully de-
inate all conflict and training mispredictions. Then, if a mispredic- pict the relative importance of PHT conflict, wrong history, com-
tion remains in a GAs organization while a PAs organization pre- bined history, and uncategorizable mispredictions. PAs just lacks
dicts the branch correctly, global history must be the wrong type an additional segment to show the number of BHT conflicts.
of history for this branch instance. Similarly, if PAs fails while For each branch predictor size, all possible GAs, PAs, and MAs
GAs succeeds, local history must be the wrong type. configurations were tested, and the configuration that performs
A possible drawback of our approach is that the measurement best overall for the entire benchmark suite is the one used for the
of wrong-history mispredictions is tied to the anticipated predictor experiments. That configuration is reported in Table 7. MAp is
size. Yet any predictor under consideration will have some finite included to determine the impact of combined history.
size, and the behavior of the branches is dictated by the maxi- As expected, PHT conflicts are important, and as expected,
mum history length that size can entertain. Some wrong-history that importance declines with increasing PHT size. Still, even
mispredictions will therefore occur, even though they might be with the simple concatenation-style anti-aliasing used by GAs and
eliminated by some more idealized organization. At the limit, one PAs, PHT conflicts are often less important than training time and
might consider infinite history or prediction by partial matching, wrong history. This is especially true for global-history predic-
but this would not measure wrong-history, but rather the intrinsic tors. Overall—for each of the three sizes—conflicts comprise an
predictability of the branch. Our approach characterizes the degree average of 15–20% of mispredictions for the GAs predictor, and
to which a particular history length produces wrong-history mis- 40–52% for the PAs predictor.
predictions and a different history type of the same length could Wrong-history mispredictions are instead often the single most
remove those mispredictions. important cause of mispredictions for global-history predictors,
Having seen how important wrong-history mispredictions are, comprising an average of about 35% of mispredictions for the 8
we were motivated to explore ways to make both types of history Kbit and 32 Kbit GAs predictors, and 50% for the 2 Kbit GAs
available, and this led us to develop alloyed prediction. predictor. This is true even though only “true” wrong-history mis-
Needs combined history. For some branches, neither type of predictions are counted (all conflict and training-time mispredic-
history alone suffices; instead the branch needs both types of in- tions are first eliminated). Wrong-history mispredictions are less
formation simultaneously. This occurs if a branch correlates with dominant in local-history predictors, comprising about 14.5% and
other branches and also has some self-repeating pattern. The fre- 17.5% of the mispredictions for the 8- and 32-Kbit PAs predictors,
quency of these combined-history mispredictions can be estimated and 3% for the 2 Kbit predictor.
using an alloyed predictor like the one described in the next sec- Combined-history mispredictions are usually unimportant (6–
tion. The best method for measuring these mispredictions that we 7% of mispredictions), although alloying does help eliminate
have been able to devise is to use an alloyed predictor: in par- them.
ticular, an MAp predictor using an alias-free PHT, just as with It might seem curious that in the 2 Kbit predictor, wrong-
GAp and PAp. This maintains our “cascade” of tests and contin- history is especially important for global history, and especially
ues excluding conflict mispredictions. Mispredictions which do unimportant for local history. (This can be seen in the large wrong-
not fall into the preceding categories and which an MAp organiza- history segments for GAs in Figure 6, and the near-absence of
tion eliminates are combined-history mispredictions. Their num- those segments for PAs.) This happens because at that size, the
ber is always the same for both GAs and PAs. Note that a hybrid chosen GAs configuration tracks only 1 bit of branch history, while
predictor cannot eliminate this type of misprediction. PAs tracks 10 bits. Under these circumstances, PAs frequently has
better information than GAs, and GAs almost never has better in-
Remaining mispredictions. Mispredictions that cannot be formation than PAs. As the global history grows longer with larger
eliminated using these techniques fall into a “left-over” category. predictors, this problem diminishes. Furthermore, once the global
These remaining mispredictions are either inherently difficult to history is long enough, GAs also captures correlation behavior
remove, or fall into a category not yet included in the taxonomy. that local history never can. This means that some PAs mispre-
4.2. Taxonomy Results dictions that were uncategorizable at small sizes7 can be predicted
correctly by larger GAs predictors. This converts those mispredic-
Figure 6 presents a breakdown of misprediction types for GAs
and PAs predictors of different sizes: 64 Kbits (32K PHT entries), 7 These are properly labeled uncategorizable, because no small predic-
8 Kbits (4K PHT entries), and 2 Kbits (1K PHT entries). Because tor organization can make a correct prediction.

6
tions to wrong-history mispredictions for a large PAs. As a result, lies in the middle of the graph indicates branches that need access
PAs’s wrong-history component grows with predictor size, and the to both types of history. These branches are penalized in a static-
uncategorizable component shrinks. hybrid predictor. Go and gcc have many such branches; compress
has fewer, but they execute a huge number of times. M88ksim’s
5. Static vs. Dynamic Choice branches, on the other hand, need only one or the other type of
This section extends the information from the taxonomy and history. Further data can be found in [8].
shows that dynamic references by the same branch instruction can These data show that for the most part, individual branches
also suffer wrong-history mispredictions—that is, some individ- do alternate between using global and local history. While some
ual branches dynamically alternate between needing global history branches might conceivably change between history types just
and needing local history. once, preliminary measurements suggest that the frequency of this
Hybrid predictors can use either static or dynamic selection to alternation is rapid for most branches. We are unaware of any prior
choose which predictor component to use for each branch. Bran- recognition or characterization of this “individual-branch” wrong-
ches that do well with global history are directed to the global- history effect. This data indicates that, unless static selection can
history component, and branches that do well with local history do a substantially better job of eliminating conflicts, dynamic se-
are directed to the local-history component. Grunwald et al. [6] lection ought to outperform static selection.
argue that static selection outperforms dynamic selection. Profil- To see which is better in practice, we compared the perfor-
ing can identify which component branches prefer, and a static mance of dynamic-selection and static-selection hybrid predictors
assignment can be made at compile time. This dispenses with at 64 Kbits and 8 Kbits, this time using realistic (finite) BHTs.
the dynamic selector, and the extra area can be used to make the To make the comparison fair, we used the predictor configurations
prediction components larger. By assigning a branch to one or that perform best overall for the benchmark suite, first testing a
the other component, static selection also has the advantage that wide range of component sizes, selector sizes, and history lengths.
branches only cause conflicts in one component. Static selection The configurations eventually chosen are the ones used in Sec-
does require instruction set support, and also requires high-quality tion 3.6 (see Tables 3 and 4). Note that the equal-area comparison
training data to make the profiling accurate. Yet a static selector includes the presence/absence of the dynamic selector, so that dy-
permanently assigns each branch to one or the other, and so pe- namic selection is appropriately penalized for the area required by
nalizes any branches that alternate between history types, while its selection table. Indeed, to further benefit static selection, we
dynamic selection can accommodate such alternation. The tax- did not cross-train, using the same input data for measurement as
onomy’s wrong-history data in Figure 6 only provides cumulative for assigning the static selection.
data for programs as a whole. Here we measure whether indi- For 8 Kbit predictors, where conflicts are a more serious prob-
vidual branch sites dynamically vary the type of history they use, lem, static selection is better for a few benchmarks, but only by
using interference-free BHTs, an 8K-entry PHT for PAs, and a a small margin. Yet despite our attempts to favor static selection,
16K-entry PHT for GAs. We find that a significant number of in- dynamic selection is still better for 8 of the 12 benchmarks. For
dividual branches do indeed switch between using global and local 64 Kbits, dynamic selection is almost uniformly superior—static
history. selection is better for only a single benchmark. Curiously, these
results contradict those reported by Grunwald et al. [6]. They
find that more benchmarks prefer static selection than dynamic
selection. We suspect that the difference either arises from area
calculations—they use a 4-way associative BHT and do not count
the BHT tags against the area—or from training methodology.
They apparently use shorter inputs and do no warmup, a proce-
dure which penalizes the dynamic selector.

6. Related Work
A great deal of literature focuses on characterizing why mis-
predictions happen in two-level predictors. Most, however, fo-
cus on PHT interference. For example, Young et al. [23] charac-
terized PHT interference, and showed that while both significant
amounts of both constructive and destructive interference occur,
the destructive interference consistently dominates. Lee et al. [9]
observed that conflicting substreams may be strongly biased, just
Figure 7. X-Y scatter plots showing, for each branch site, in opposite directions, and this led to the development of the bi-
the number of times that a static branch site needs a global mode predictor. Most recently, Evers et al. [5] moved beyond
predictor or a local predictor. the question of conflicts to focus on the correlation characteris-
tics of branches, and found that many branches do benefit from
Figure 7 presents X-Y scatter graphs for four benchmarks, go, global history. Yet for a given prediction, most global history bits
m88ksim, gcc and compress. Each point represents one branch go unused—adding to interference—while often the most useful
site; only branches executed 100,000 times or more are plotted. branch outcomes have already been forced out of the history.
The graph’s x-coordinate gives the number of times that a par- Hybrid prediction was originally proposed by McFarling [10].
ticular branch can only be predicted with global history; its y- Chang et al. [2] extended his work, finding—as we did—that the
coordinate gives the number of times that a branch can only be pre- most beneficial components are a global-history predictor and a
dicted with local history. Branches that consistently use one com- local-history predictor. They also showed that a global-history se-
ponent lie on one or the other axis. Frequency counts were plotted lector outperforms a bimodal selector. Most recently, Grunwald et
rather than percentages, because this format reflects how often a al. [6] have proposed eliminating the selector in favor of a static
branch executes (less-frequently executed branches lie closer to selection mechanism. Our results suggest this does not work well
the origin no matter what their behavior). for two-component hybrid predictors. But the selection mecha-
For all four benchmarks, a substantial amount of mass lies at nism becomes more complex in multi-hybrid predictors [4] that
the origin: branches that are easily predicted by either structure, contain more than two components, and in this case static selec-
and branches that are mispredicted by both structures. Mass that tion might be beneficial.

7
Other than the bi-mode work [9], we are unaware of any pub- References
lished work describing alloying. Other researchers have described [1] D. C. Burger and T. M. Austin. The simplescalar tool set,
a variety of aggressive techniques for reducing conflict mispredic- version 2.0. Computer Architecture News, 25(3):13–25, June
tions. The YAGS predictor [3] is the most recent. It extends the 1997.
bi-mode predictor by identifying when branches disagree with the [2] P.-Y. Chang, E. Hao, and Y. N. Patt. Alternative implemen-
predicted bi-mode bias. It is important to note that MAs can easily tations of hybrid branch predictors. In Proc. Micro-28, pp.
be extended to incorporate such anti-aliasing schemes. 252–57, Dec. 1995.
[3] A. N. Eden and T. Mudge. The YAGS branch prediction
7. Conclusions and Future Work scheme. In Proc. Micro-31, pp. 69–77, Dec. 1998.
A great deal of prior branch prediction work has focused on [4] M. Evers, P.-Y. Chang, and Y. N. Patt. Using hybrid branch
ways to improve two-level predictors that use either local or global predictors to improve branch prediction accuracy in the pres-
history, but not both. Such work has mainly focused on reducing ence of context switches. In Proc. ISCA-23, pp. 3–11, May
conflict mispredictions due to aliasing in the pattern history table. 1996.
This paper has presented a new taxonomy of misprediction [5] M. Evers, S. J. Patel, R. S. Chappell, and Y. N. Patt. An
types to better understand what categories of mispredictions are analysis of correlation and predictability: What makes two-
important. Using the taxonomy, we have shown that conflict mis- level branch predictors work. In Proc. ISCA-25, pp. 52–61,
predictions are important, but other categories are often more im- June 1998.
portant. In particular, wrong-history mispredictions are often the [6] D. Grunwald, D. Lindsay, and B. Zorn. Static methods in
most important source of mispredictions, comprising up to 50% hybrid branch prediction. In Proc. PACT ’98, pp. 222–29,
of the total mispredictions. Wrong history occurs when a branch Oct. 1998.
requires one type of history (global or local) but the predictor pro- [7] R. E. Kessler, E. J. McLellan, and D. A. Webb. The Alpha
vides the other type. Hybrid predictors can attack these mispredic- 21264 microprocessor architecture. In Proc. ICCD 1998, pp.
tions, but an alloyed predictor is superior for several reasons. We 90–95, Oct. 1998.
also showed that dynamic references by the same branch instruc- [8] A. V. Lanning. Pipelined branch prediction: Characterizing
tion can also suffer wrong-history mispredictions—that is, some wrong-history misprediction. Senior thesis, Univ. of Virginia
individual branches dynamically alternate between needing global School of Engineering and Applied Science, Apr. 2000.
history and needing local history. In conventional hybrid predic- [9] C.-C. Lee, I.-C. K. Chen, and T. N. Mudge. The bi-mode
branch predictor. In Proc. Micro-30, pp. 4–13, Dec. 1997.
tors [2, 10], this favors dynamic selection over static selection [6]. [10] S. McFarling. Combining branch predictors. Tech. Note TN-
An alloyed predictor merges local and global history bits to- 36, DEC WRL, June 1993.
gether in a single PHT index. This is a generalization of the bi- [11] S.-T. Pan, K. So, and J. T. Rahmeh. Improving the accuracy
mode branch predictor proposed by Lee, Chen, and Mudge [9]. of dynamic branch prediction using branch correlation. In
Although such an organization is a minor change to existing two- Proc. ASPLOS-V, pp. 76–84, Oct. 1992.
level designs, it makes both types of history available all the time. [12] S. Sechrest, C.-C. Lee, and T. Mudge. Correlation and alias-
This attacks wrong-history mispredictions, as well as a further but ing in dynamic branch predictors. In Proc. ISCA-23, pp. 22–
less important category of mispredictions in which branches need 32, May 1995.
both types of history simultaneously. Alloying can also reduce [13] K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark.
PHT aliasing, because branches that alias with one type of his- Branch prediction, instruction-window size, and cache size:
tory are often distinguished by the other type of history. Our pro- Performance tradeoffs and simulation techniques. IEEE
posed alloyed predictor, MAs, achieves substantially better pre- Trans. Computers, 48(11):1260–81, Nov. 1999.
diction accuracies than other solo, two-level predictors, and also [14] K. Skadron, D. W. Clark, and M. Martonosi. Specula-
performs well against hybrid predictors, especially at smaller sizes tive updates of local and global branch history: A quanti-
where the components in a hybrid organization become too small. tative analysis. J. Instruction-Level Parallelism, Jan. 2000.
Conventional hybrid predictors perform poorly at small sizes, and (http://www.jilp.org/vol2).
conventional two-level predictors reach a domain of diminishing [15] K. Skadron, M. Martonosi, and D. W. Clark. Alloying global
returns too early and therefore perform poorly at large sizes. and local branch history: A robust solution to wrong-history
There are several promising avenues for further research. mispredictions. Tech. Report TR-606-99, Princeton Dept. of
The most important additional work is on understanding branch- Computer Science, Oct. 1999.
prediction behavior, extending our characterization of mispredic- [16] K. Skadron, M. Martonosi, and D. W. Clark. Alloying global
tion sources and individual-branch wrong-history effects. A sec- and local branch history: Taxonomy, performance, and anal-
ond area for future work lies in further exploring alloying. For ex- ysis. Tech. Report TR-594-99, Princeton Dept. of Computer
ample, alloying might provide further benefits as part of a hybrid Science, Jan. 1999.
predictor, and in particular, might work well with static selection, [17] J. E. Smith. A study of branch prediction strategies. In Proc.
where it could expose both types of history in one component and ISCA-8, pp. 135–48, May 1981.
possibly obviate the need for dynamic selection. New hash func- [18] M. D. Smith. Support for Speculative Execution in High-
tions might also be considered, and comparisons of alloyed pre- Performance Processors. PhD thesis, Stanford Univ., Nov.
diction against newly proposed predictors like YAGS [3] would 1992.
[19] Standard Performance Evaluation Corp. SPEC CPU95
be informative. Finally, isolating exactly which branches benefit Benchmarks. http://www.specbench.org/osg/cpu95.
most from alloying and characterizing the reasons for this would [20] R. Uhlig et al. Instruction fetching: Coping with code bloat.
also be informative. In Proc. ISCA-22, pp. 345–56, June 1995.
Acknowledgments [21] S. C. Woo et al. The SPLASH-2 programs: Characterization
We would like to thank Mikko Lipasti, Tom Conte, and the and methodological considerations. In Proc. ISCA-22, pp.
anonymous reviewers for their helpful comments. This work 24–36, June 1995.
[22] T.-Y. Yeh and Y. N. Patt. A comparison of dynamic branch
was supported in part by NSF grant CCR-94-23123, NSF Career
predictors that use two levels of branch history. In Proc.
Award CCR-95-02516 (Martonosi), and an NDSEG Graduate Fel-
ISCA-20, pp. 257–66, May 1993.
lowship (Skadron). We would also like to thank Adrian V. Lan- [23] C. Young, N. Gloy, and M. D. Smith. A comparative analysis
ning, whose senior thesis at the University of Virginia extended of schemes for correlated branch prediction. In Proc. ISCA-
our understanding of how branches alternate among using differ- 22, pp. 276–86, June 1995.
ent history types, and the SHRIMP research group at Princeton for
the extensive use of their computing resources.

You might also like