Instruction Pipelining (Ii) : Reducing Pipeline Branch Penalties
Instruction Pipelining (Ii) : Reducing Pipeline Branch Penalties
Datorarkitektur Fö 4 - 3 Datorarkitektur Fö 4 - 4
• The fetch unit also has the ability to recognize Penalty: 3 cycles
branch instructions and to generate the target
address. Thus, penalty produced by unconditional Branch is not taken
branches can be drastically reduced: the fetch unit Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12
computes the target address and continues to fetch
instructions from that address, which are sent to the ADD R1,R2 FI DI COFO EI WO
queue. Thus, the rest of the pipeline gets a
continuous stream of instructions, without stalling. BEZ TARGET FI DI COFO EI WO
• The rate at which instructions can be read (from the instr i+1 FI stall stall DI COFO EI WO
instruction cache) must be sufficiently high to avoid
Penalty: 2 cycles
an empty queue.
• With conditional branches penalties can not be
avoided. The branch condition, which usually • The idea with delayed branching is to let the CPU
depends on the result of the preceding instruction, do some useful work during some of the cycles
has to be known in order to determine the following which are shown above to be stalled.
instruction.
• With delayed branching the CPU always executes
the instruction that immediately follows after the
Observation
branch and only then alters (if necessary) the
• In the Pentium 4, the instruction cache (trace sequence of execution. The instruction after the
cache) is located between the fetch unit and the branch is said to be in the branch delay slot.
instruction queue (See Fö 2, slide 31).
Datorarkitektur Fö 4 - 7 Datorarkitektur Fö 4 - 8
Datorarkitektur Fö 4 - 11 Datorarkitektur Fö 4 - 12
• History information can be used not only to predict Some explanations to previous figure:
the outcome of a conditional branch but also to
avoid recalculation of the target address. Together
- Address where to fetch from: If the branch
with the bits used for prediction, the target address
instruction is not in the table the next instruction
can be stored for later use in a branch history table.
(address PC+1) is to be fetched. If the branch
instruction is in the table first of all a prediction
Instr. Target Pred.
based on the prediction bits is made. Depending
addr. addr. bits
on the prediction outcome the next instruction
(address PC+1) or the instruction at the target
address is to be fetched.
Update - Update entry: If the branch instruction has been
entry
in the table, the respective entry has to be
updated to reflect the correct or incorrect
Addr. of branch instr.
prediction.
Add new
Y
fro her
N Branch
to ddre
Datorarkitektur Fö 4 - 15 Datorarkitektur Fö 4 - 16
Decode
• Execute: shift and ALU • Branch prediction:
Execute
operations. - Dynamic two bits prediction
based on a 64 entry branch
Issue history table (branch target
Data memory • Data memory access: fetch/ address cache - BTAC).
access store data from/to D-cache. - If the instruction is not in
the BTAC, static prediction
Shift/ is done: taken if backward,
Register • Register write: results or Address not taken if forward.
write loaded data written back to
register. • Decoupling of the load/store
ALU/
Memory 1 pipeline from the ALU&MAC
The performance of the ARM9 is significantly superior (multiply-accumulate) pipeline.
to the ARM7: ALU operations can continue
while load/store operations
• Higher clock speed due to larger number of ALU/
complete (see next slide).
pipeline stages. Memory 2
• More even distribution of tasks among pipeline
stages; tasks have been moved away from the
execute stage Writeback
Datorarkitektur Fö 4 - 19 Datorarkitektur Fö 4 - 20
• Writeback: • Writeback:
results written Writeback Writeback write loaded
to register data to reg.;
commit store