dynamically retrains them when they are in
error. Most mispredictions cost a single cycle.
The line and way predictors are correct 85%
to 100% of the time for most applications, so
training is infrequent. As an additional pre-
caution, a 2-bit hysteresis counter associated
with each fetch block eliminates overtrain-
ing—training occurs only when the current
prediction has been in error multiple times.
Line and way prediction is an important speed
enhancement since the mispredict cost is low
and line/way mispredictions are rare.
Beyond the speed benefits of direct cache
access, line and way prediction has other ben-
efits. For example, frequently encountered
predictable branches, such as loop termina-
tors, avoid the mis-fetch penalty often associ-
ated with a taken branch. The processor also
trains the line predictor with the address of
jumps and subroutine calls that use direct reg-
ister addressing. Code using dynamically
linked library routines will thus benefit after
the line predictor is trained with the target.
This is important since the pipeline delays
required to calculate the indirect (subroutine)
jump address are eight cycles or more.
An instruction cache miss forces the
instruction fetch engine to check the level-two
(L2) cache or system memory for the neces-
sary instructions. The fetch engine prefetch-
es up to four 64-byte (or 16-instruction) cache
lines to tolerate the additional latency. The
result is very high bandwidth instruction
fetch, even when the instructions are not
found in the instruction cache. For instance,
the processor can saturate the available L2
cache bandwidth with instruction prefetches.
Branch prediction
Branch prediction is more important to the
21264’s efficiency than to previous micro-
processors for several reasons. First, the seven-
cycle mispredict cost is slightly higher than
previous generations. Second, the instruction
execution engine is faster than in previous gen-
erations. Finally, successful branch prediction
can utilize the processor’s speculative execution
capabilities. Good branch prediction avoids the
costs of mispredicts and capitalizes on the most
opportunities to find parallelism. The 21164
could accept 20 in-flight instructions at most,
but the 21264 can accept 80, offering many
more parallelism opportunities.
The 21264 implements a sophisticated tour-
nament branch prediction scheme. The scheme
dynamically chooses between two types of
branch predictors—one using local history, and
one using global history—to predict the direc-
tion of a given branch.
8
The result is a tourna-
ment branch predictor with better prediction
accuracy than larger tables of either individual
method, with a 90% to 100% success rate on
most simulated applications/benchmarks.
Together, local and global correlation tech-
niques minimize branch mispredicts. The
processor adapts to dynamically choose the best
method for each branch.
Figure 4, in detailing the structure of the
tournament branch predictor, shows the local-
history prediction path—through a two-level
structure—on the left. The first level holds 10
bits of branch pattern history for up to 1,024
branches. This 10-bit pattern picks from one
of 1,024 prediction counters. The global pre-
dictor is a 4,096-entry table of 2-bit saturat-
ing counters indexed by the path, or global,
history of the last 12 branches. The choice pre-
diction, or chooser, is also a 4,096-entry table
of 2-bit prediction counters indexed by the
path history. The “Local and global branch
predictors” box describes these techniques in
more detail.
The processor inserts the true branch direc-
tion in the local-history table once branches
26
A
LPHA
21264
IEEE MICRO