CA - Slides
CA - Slides
Pavitra Y J
Electronics and Communication Engineering
COMPUTER ARCHITECTURE
Pavitra Y J
Electronics and Communication Engineering
Increasing instruction fetch bandwidth
• Multiple issue processor is required to increase instruction fetch
bandwidth and extract ILP
• A multiple-issue processor will require that the average number of
instructions fetched every clock cycle be at least as large as the
average throughput
• Fetching these instructions requires wide enough paths to the
instruction cache, but the most difficult aspect is handling branches
Branch Target Buffers(BTB)
• To reduce the branch penalty for deeper pipelines, we must know
whether the as-yet-undecoded instruction is a branch and, if so, what
the next program counter (PC) should be
• If the instruction is a branch and we know what the next PC should
be, we can have a branch penalty of zero
• A branch-prediction cache that stores the predicted address for the
next instruction after a branch is called a branch-target buffer or
branch-target cache
BTB
BTB
• Because a branch-target buffer predicts the next instruction address
and will send it out before decoding the instruction, we must know
whether the fetched instruction is predicted as a taken branch.
• If the PC of the fetched instruction matches an address in the
prediction buffer, then the corresponding predicted PC is used as the
next PC
• The hardware for this branch-target buffer is essentially identical to
the hardware for a cache
• If a matching entry is found in the branch-target buffer, fetching
begins immediately at the predicted PC
BTB
• unlike a branch-prediction buffer, the predictive entry must be
matched to this instruction because the predicted PC will be sent out
before it is known whether this instruction is even a branch
• If the processor did not check whether the entry matched this PC,
then the wrong PC would be sent out for instructions that were not
branches, resulting in worse performance
• Store only the predicted-taken branches in the branch target buffer,
since an untaken branch should simply fetch the next sequential
instruction, as if it were not a branch
Steps for using BTB with 5-stage pipeline
Steps for using BTB with 5-stage pipeline
• Dealing with the mispredictions and misses is a significant challenge,
since instruction fetch has to halt while rewrite the buffer entry
• Make this process fast to minimize the penalty
Exercise
Determine the total branch penalty for a branch-target buffer assuming
the penalty cycles for individual mispredictions from table(previous
slide).
Make the following assumptions about the prediction accuracy and hit
rate:
■ Prediction accuracy is 90% (for instructions in the buffer).
■ Hit rate in the buffer is 90% (for branches predicted taken).
Exercise
We compute the penalty by looking at the probability of two events:
1 the branch is predicted taken but ends up being not taken,
2 the branch is taken but is not found in the buffer
Both carry a penalty of two cycles
BTB
• improvement from dynamic branch prediction will grow as the
pipeline length and, hence, the branch delay grows; in addition,
better predictors will yield a larger performance advantage
• Modern high-performance processors have branch misprediction
delays on the order of 15 clock cycles; clearly, accurate prediction is
critical!
BTB
• One variation on the branch-target buffer is to store one or more
target instructions instead of, or in addition to, the predicted target
address.
• This variation has two potential advantages.
1 It allows the branch-target buffer access to take longer than the time
between successive instruction fetches, possibly allowing a larger
branch-target buffer.
2 Buffering the actual target instructions allows us to perform an
optimization called branch folding. Branch folding can be used to obtain
0-cycle unconditional branches and sometimes 0-cycle conditional
branches
Return Address Predictors
• Procedure returns are important
• Though procedure returns can be predicted with a branch-target
buffer, the accuracy of such a prediction technique can be low if the
procedure is called from multiple sites and the calls from one site are
not clustered in time
• designs use a small buffer of return addresses operating as a stack
• This structure caches the most recent return addresses: pushing a
return address on the stack at a call and popping one off at a return
Return address predictors
• If the cache is sufficiently large (i.e., as large as the maximum call
depth), it will predict the returns perfectly.
Integrated Instruction Fetch Units
• To meet the demands of multiple-issue processors, many recent
designers have chosen to implement an integrated instruction fetch
unit as a separate autonomous unit that feeds instructions to the rest
of the pipeline
• recent designs have used an integrated instruction fetch unit that
integrates several functions:
1. Integrated branch prediction—The branch predictor becomes part of
the instruction fetch unit and is constantly predicting branches, so as to
drive the fetch pipeline.
Integrated Instruction Fetch Units
2 Instruction prefetch—To deliver multiple instructions per clock, the
instruction fetch unit will likely need to fetch ahead. The unit
autonomously manages the prefetching of instructions integrating it
with branch prediction.
3. Instruction memory access and buffering—When fetching multiple
instructions per cycle a variety of complexities are encountered,
including the difficulty that fetching multiple instructions may require
accessing multiple cache lines. The instruction fetch unit encapsulates
this complexity, using prefetch to try to hide the cost of crossing cache
blocks. The instruction fetch unit also provides buffering, essentially
acting as an on-demand unit to provide instructions to the issue stage
as needed and in the quantity needed.
Speculation: Implementation Issues and
Extensions
• Four issues that involve the design trade-offs in speculation, starting
with the use of register renaming, the approach that is often used
instead of a reorder buffer
1 Speculation Support: Register Renaming versus Reorder Buffers
• Register Renaming
If the processor does not issue new instructions for a period of time, all
existing instructions will commit, and the register values will appear in
the register file, which directly corresponds to the architecturally visible
registers
Register renaming
[email protected]
+91 80 2672 1983 EXT:741