0% found this document useful (0 votes)
11 views

pipe3

Uploaded by

pedro paulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

pipe3

Uploaded by

pedro paulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

6.

6 Branch Hazards (Control Hazards)

Control Hazards are simple to understand and occur


much less frequently than data hazards.

They can’t be solved as effectively as forwarding is


for data hazards.

There are 2 schemes for resolving these hazards and


one optimization to improve these schemes.

cs 152
cs 152 L1 3L1
.1 3 . DAP Fa97,  U.CB
a) Assume Branch not Taken

Stalling until branches complete is too slow and


reduces potential speedup too much.
Improvement strategy: Assume branch will not be
taken and continue execution down the sequential
instr stream. If the branch is taken, the instrs that
are being fetched and decoded must be discarded,
and execution continues at the branch target.
If branches are taken ½ the time, and if it costs little
to discard the instrs, this scheme halves the cost of
control hazards.
To discard instrs, the original control values are
changed to 0’s (like load hazards). Change also for
instrs in the IF, ID and EX stages when the branch
reaches MEM - Flush instrs in the IF, ID, and EX
stages.

cs 152 L1 3 .2 DAP Fa97,  U.CB


Impact of Branch Instrs on Pipe Efficiency

Program Time (in clock cycles)


execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7) IM Reg DM Reg

cs 152
cs 152 L1 3L1
.3 3 .3 DAP Fa97,  U.CB
b) Reducing the Delay of Branches

Reduce the cost of the taken branch.


In first implementation, the next PC for a branch is
selected in the MEM stage.

Move the branch execution to the ID stage. Special


HW in ID: adder for address calculation (moved
from MEM stage); Exclusive-OR and AND logic (for
equality testing).

Only one instruction to flush if the branch is taken.


To flush: control line, IF.Flush, that zeros the instr
control signals of the IF/ID pipe regst (the instr
becomes a nop).
04/26/99 4
cs 152
cs 152 L1 3L1
.4 3 .4 DAP Fa97,  U.CB
Datapath for Branch (HW to Flush Instr after branch)
IF.Flush

Hazard
detection
unit
M ID/EX
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0

IF/ID EX M WB

4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x

Sign
extend

M
u
x
Forwarding
unit

IF.Flush control signal actually comes from equality checker (evaluates


branch condition).
cs 152 L1 3 .5 DAP Fa97,  U.CB
Example: Pipelined Branch Execution

Assume the pipeline is optimized for branches not


taken and branch execution is moved to ID stage.

36 sub $10, $4, $8


40 beq $1, $3, 7 #PC-relative branch to 40+4+7*4 = 72

44 and $12, $2, $5


48 or $13, $2, $6
52 add $14, $4, $2
56 slt $15, $6, $7

72 lw $4, 50($7)
04/26/99 6
cs 152
cs 152 L1 3L1
.6 3 .6 DAP Fa97,  U.CB
Branch Execution of Sample Code
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>

IF.Flush

Hazard
detection
unit
72 ID/EX
M
u
48 x WB
EX/MEM
M
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72

4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x

Sign
extend

10

Forwarding
unit

Clock 3

lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>

IF.Flush

Hazard
detection
unit
ID/EX
M
u
76 x WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
76 72

Shift M $1
left 2 u
x
Registers
=
Instruction Data
PC ALU
memory memory M
76 72
u
M $3 x
u
x

Sign
extend

10

Forwarding
unit

Clock 4

cs 152 L1 3 .7 DAP Fa97,  U.CB


Dynamic Branch Prediction

Assuming Branch taken is a crude form of branch


prediction. Try to detect regularities in branch
behavior.

Algorithm: look up the address of the instr to see if a


branch was taken the last time this instr was
executed and, if so, begin fetching new instrs from
the same place as the last time.

04/26/99 8
cs 152
cs 152 L1 3L1
.8 3 .8 DAP Fa97,  U.CB
Implementation of Dynamic Branch Prediction

Implementation: branch prediction buffer or branch


history table, a small memory indexed by the lower
portion of the address of the branch instr. This
memory contains a bit that says wether the branch
was taken or not.
The branch prediction buffer can be implemented
as a small, special buffer accessed with the instr
address during the IF stage.
If the instr is predicted as taken, fetching begins
from the target as soon as the PC is known (this
can be as early as the ID stage).
Otherwise sequential fetching continues. If
prediction is wrong, prediction bit must be
changed.

cs 152 L1 3 .9 DAP Fa97,  U.CB


Example Branch Prediction (1 bit history table)

Consider the following loop with: i=1 initially, j=1, h=10


Loop: g = g + A[ i ];
i = i + j;
if (i != h) go to Loop;

Loop: add $t1, $s3, $s3 #Temp reg $t1 = 2 *i


i j
add $t1, $t1, $t1 #Temp reg $t1 = 4*i
add $t1, $t1, $s5 #$t1 = address of A[ i ]
lw $t0, 0($t1) #Temp reg $t0 = A[ i ]
add $s1, $s1, $t0 #g = g + A[ i ]
add $s3, $s3, $s4 #i = i + j
bne $s3, $s3, Loop # go to Loop if i != j
The algorithm will mispredict the first time (i = 1, bit =0) and last time
(i=10, bit =1).
04/26/99 * 9
cs 152
cs 152 L1 3L1
.103 .9 DAP Fa97,  U.CB
Loops and Prediction

The prediction may be wrong (it may have been put


there by another branch that has the same low
order bits). This doesn’t affect correctness. If the
prediction is wrong, the prediction bit is inverted
and stored back, and the proper bit sequence is
executed.
The prediction accuracy for this branch that is taken
90% of the time is only 80% (2 incorrect predictions
and eight correct ones).
For highly regular branches, we want the accuracy of
the predictor to match the taken branch frequency.
Solution: Use a 2 bit prediction buffer. A prediction
must be wrong twice before it is changed.

04/26/99 10
cs 152
cs 152 L1 3L1
.113 .10 DAP Fa97,  U.CB
Finite State Machine for 2-bit Prediction Scheme

Taken
Not taken
Predict taken Predict taken

Taken
Taken Not taken
Not taken
Predict not taken Predict not taken

Taken

Not taken

A branch that strongly favors taken or not taken will


be mispredicted only once. The 2 bits are used to
encode the four states in the system.

Assume code has exited an earlier loop:


when enters loop bits=10 (2nd state, right top) predicts correctly;
during other prediction bits = 11 (1st state, left top);
at last loop prediction bits = 11, predicts wrongly and changes to10;
after last prediction bits = 10 (2nd state, right top).

cs 152 97108/Patterson
L1 3 .12 DAP Fa97,  U.CB
Figure 06.53
Compiler Scheduling for Branch Delays

A delayed branch always executes the following instr,


but the 2nd instr following the branch will be affected
by the branch.
Compilers and assemblers try to place an instr that
always executes after the branch in the branch delay
slot. The SW tries to make the successor instr valid
and useful.
Limitations of scheduling:
a) restrictions on instrs that are scheduled in the slot,
b) the ability to predict at compile time wether a
branch will be taken or not.
As machines go to longer pipes and towards issuing
multiple instrs / cycle a single delay slot is not very
useful. This strategy is loosing popularity.
Transistors in chip increase, dynamic predictors
become more popular.
04/27/99 12
cs 152
cs 152 L1 3L1
.133 .12 DAP Fa97,  U.CB
3 Ways to Schedule Branch Delay Slot
a. From before b. From target c. From fall through
sub $t4, $t5, $t6
add $s1, $s2, $s3 add $s1, $s2, $s3

if $s2 = 0 then if $s1 = 0 then
add $s1, $s2, $s3
Delay slot Delay slot
if $s1 = 0 then

Delay slot sub $t4, $t5, $t6

Becomes Becomes Becomes

add $s1, $s2, $s3

if $s2 = 0 then if $s1 = 0 then


add $s1, $s2, $s3
add $s1, $s2, $s3 sub $t4, $t5, $t6
if $s1 = 0 then

sub $t4, $t5, $t6

In b) and c) performance is improved only when execution proceeds


in expected direction.
cs 152 L1 3 .14 DAP Fa97,  U.CB
Comments on Branch Delay Scheduling

In a) the slot is scheduled with an independent instr


from before the branch. This is the best choice, it
always improves performance.
b) and c) are used when a) is not possible ($s1
written by add and read by if).
In b) usually the target will have to be copied into
slot because it can be reached by another path.
This strategy is preferred when branch is taken
with high probability (such as loop braches).
In c) the slot is scheduled with the not-taken fall-
through.
To make b) and c) scheduling possible, it is
necessary to be OK to execute the sub instr when
the branch goes in the unexpected direction (work
is wasted but the program executes correctly, e. g.
$t4 is not used when branch takes unexpected
direction).
cs 152 L1 3 .15 DAP Fa97,  U.CB
Execution Model Performance Comparisson

Compare performance of single-cycle, multicycle


and pipelined control for instr mix of gcc (fig 4.54)
instr mix:
22% loads, 11% stores, 49% R-format, 16% branches, 2% jumps
operation times for major functional units: mem access = 2ns, ALU
operations = 2 ns, regs file read / write = 1 ns.

Single-cycle: 8 ns / instruction = cycle duration.

Multicycle:
No. of cycles for each operation (CPI) in multicycle implementation:
loads = 5, stores = 4, R-format = 4, branches = 3, jumps = 3 cycles.
CPI = 0.22*5 + 0.11*4 + 0.49*4 + 0.16*3 + 0.02*3 = 4.04 cycles / instr
Average time for instr execution = 4.04 * 2 ns = 8.08 ns.
cs 152 L1 3 .16 DAP Fa97,  U.CB
Performance Comparisson Continuation
Pipelined: assume
1/2 of loads followed by instr that uses load result; branch delay on
misprediction 1 cycle and 1/4 branches are mispredicted; jumps
always delay 1 cycle.
loads: 1 cycle when no dependency and 2 when there is dependency,
average cycle for loads = 1.5 cycles
stores and R-format: 1 cycle
branches: 1 cycle when prediction is correct and 2 cycles when
wrong, average cycle for branches = 1 + 1*0.25 = 1.25 cycles
jumps: 1 cycle + 1 cycle delay = 2 cycles

CPI = 1.5 *0.22 + 1*0.11 + 1*0.49 + 1.25*0.16 + 2*0.02 = 1.17 cycles


average instr execution time 1.17*2 ns = 2.34 ns
Pipelined control time is:
8 ns / 2.34 ns = 3.4 times faster than single-cycle control,
8.08 ns / 2.34 ns = 3.5 times faster than multicycle control (book does
4.04 ns / 2.34 ns = 1.73 times faster - wrong...see example pg 397
chapter 5).
cs 152 L1 3 .17 DAP Fa97,  U.CB
6.7 Exceptions

Exceptions and interrupts are events other than


branches or jumps that change the normal flow of
instr execution.

Exception: an unexpected event from within the


processor. Ex: arithmetic overflow

Interrupts: a change of control flow from outside the


processor. Ex: Interrupts caused by I/O devices to
communicate with the processor.

cs 152 L1 3 .18 DAP Fa97,  U.CB


What about Interrupts, Traps,
Faults?
External Interrupts:
• Allow pipeline to drain,
• Load PC with interupt address

Faults (within instruction, restartable)


• Force trap instruction into IF (transfer control to the
exception routine)
• disable writes till trap hits WB
• must save multiple PCs or PC + state

Refer to MIPS solution


cs 152 L1 3 .19 DAP Fa97,  U.CB
Example: Arithmetic Overflow Exception

What happens if an over flow exception occurs in the


add instr of the following instr sequence? Instr
addresses are given in hexadecimal.
40 sub $11, $2, $4
44 and $12, $2, $5
48 or $13, $2, $6
4C add $1, $2, $1
50 slt $15, $6, $7
54 lw $16, 50($7)

assume the instrs to be invoked on an exception begin:


4000040 sw $25, 1000($0)
4000044 sw $26, 1004($0)

cs 152 L1 3 .20 DAP Fa97,  U.CB


Bubbles are Inserted
lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, . . . and $12, . . .

IF.Flush ID.Flush EX.Flush

Hazard
detection
unit
M ID/EX
u M 0
40000040 u
x 0 10
WB x
0 E X/MEM
M 0 010 M 0
Control u M u WB
x x MEM/WB
0
0
0 Cause 1
IF/ID EX M WB
58 54 50 Except
PC
4 Shift
left 2 $6
M $2
u
x

12 Registers = Data
Instruction ALU
PC memory M
mem ory $7
40000040 u
M $1 x
54 u
x

Sign
extend

M 13 12
$1 u
15 x
Forwarding
unit

Clock 5

sw $25, 1000($0) bubble (nop) bubble bubble or $13, . . .

IF.Flush ID.Flush EX.Flush

Hazard
detection
unit
M ID/EX
u M 00
40000040 u
x 0 00
WB x
0 E X/MEM

Control M 0 000 M 00
u M u WB
x x MEM/WB
0
40000044 0
0 1
EX Cause M WB
IF/ID

Except
PC
4 Shift
left 2
M
u
x

13 Registers = Data
Instruction ALU
PC memory
mem ory M
40000044 u
M x
40000040 u
x

Sign
extend

M 13
u
x

Forwarding
unit

Clock 6

cs 152 L1 3 .21 DAP Fa97,  U.CB


How to Flush Unwanted Instructions

Flush instrs in IF by turning it into nop (IF.Flush


control signal).
Use multiplexors already in ID to zero control signals
of instr in ID stage. A new ID.Flush control signal is
ORed with the stall signal from the Hazard
Detection Unit to flush during ID.
Use a new signal EX.Flush is used to cause new
multiplexors to zero the control lines in EX.
To start fetching instrs at location 40000040, add an
additional input to the PC multiplexor that sends
40000040 to the PC.
If the instr is not stopped in the middle of its execution, the
programmer will see a wrong value of $1 (not the original value
that caused the overflow). The exception must be detected and
flush signals set during the EX of add instr. EX.Flush prevents
this instr from writing in the WB stage.

cs 152 L1 3 .22 DAP Fa97,  U.CB


Datapath with Controls to Handle Exceptions
IF.Flush ID.Flush EX.Flush

Hazard
detection
unit
M ID/EX
40000040 u M
x u
WB x
0 EX/MEM
M M
Control u M u WB
x x MEM/WB
0
0
EX Cause M WB
IF/ID

Except
PC
4 Shift
left 2
M
u
x
Registers = Data
Instruction ALU
PC memory M
memory
u
M x
u
x

Sign
extend

M
u
x
Forwarding
unit

Cause regs records cause of exception; Exception PC saves address of


instr that caused the exception (actually saves address + 4). In example
cs 152 L1 3 .23
EPC = 4C + 4 = 50 (in hexadecimal). DAP Fa97,  U.CB
Other Causes of Exceptions

° I/O device request


° Invoking Operating System service from user
program
° Using an undefined instr
° HW malfunction
How to associate an exception with the
corresponding instr when there are 5 instrs active
in a clok cycle?
Multiple exceptions can occur simultaneously in a
single clock cycle.

cs 152 L1 3 .24 DAP Fa97,  U.CB


Exception Problem

° Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline


• How to stop the pipeline?
• Restart?
• Who caused the interrupt?
Stage Problem interrupts occurring
IF Page fault on instruction fetch; misaligned memory
access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory
access; memory-protection violation; memory error
° Load with data page fault, Add with instruction page fault?
° Solution 1: interrupt vector/instruction 2: interrupt ASAP, restart
everything incomplete

cs 152 L1 3 .25 DAP Fa97,  U.CB


Exception Handling

IAU

npc

I mem detect bad instruction address

Regs lw $2,20($5) PC
detect bad instruction
B A im n op rw

alu detect overflow

D mem detect bad data address

Regs Allow exception to take effect


cs 152 L1 3 .26 DAP Fa97,  U.CB
Solutions to Exception Problem

I/O device requests and HW malfunction: not


associated with a specific instr. Implementation
has some flexibility as to when to interrupt the
pipeline. For I/O request, stop at simplest instr. For
HW malfunction, it is best to stop as soon as
possible because HW is unstable.

Prioritize exceptions so it is easy to determine which


is serviced first. In most MIPs implementations the
HW sorts instrs so that earliest instr is interrupted.
EPC has address of interrupted instrs.
Cause regs records all possible exceptions in a
clock cycle.

cs 152 L1 3 .27 DAP Fa97,  U.CB


Pipeline Changes Order of Exception Ocurrence

instr__1__ _2___3___4___5___6___7___8___9___10
i-3 IF ID EX MEM WB
i-2 IF ID EX MEM WB
i-1 IF ID EX MEM WB
i LW IF ID EX MEM WB
i+1 ADD IF ID EX MEM WB
i+2 IF ID EX MEM WB

cs 152 L1 3 .28 DAP Fa97,  U.CB


HW SW Interface for Exception Handling

By knowing in which stage an exception can occur,


the exception SW can match the exception to the
instr.
Exceptions are collected in the cause reg so that the
HW can interrupt based on later exceptions, once
the earliest one has been serviced.
Precise Interrupts: The pipeline is stopped so that
the instrs prior to the falty instr j complete and the
following are flushed and later restarted.
Implementation: HW associates an interrupt vector
with each instr and disables the effects of the of the
faulty instr j when it is detected. The instr follows
down the pipe and the exception is treated at the
end of the MEM stage of j. This guarantees that the
exceptions of instr i are treated before those of
instr i+1.
cs 152 L1 3 .29 DAP Fa97,  U.CB
HW SW Interface (Cont...)

The HW saves the address in EPC, and the cause of


the falty instr in the cause reg and then jumps to a
prearranged address of the exception handling
routine.
The OS looks at the cause and acts appropriately.
For undefined instr, HW malfunction, arithmetic
overflow, the OS normally kills the program and
returns an indicator of the reason.
For I/O device request or an OS service call, the OS
saves the state of the program, performs the
desired task, and then restores the program to
continue execution.
MIPS and the most machines today support precise
interrupts or precise exceptions because they
simplify the interface with the OS (e.g. to support
virtual memory).
cs 152 L1 3 .30 DAP Fa97,  U.CB
Imprecise Interrupts or Imprecise Exceptions

Imprecise Interrupts: Treat exception when it occurs.


instr__1__ _2___3___4___5___6___7___8___9___10
i-3 IF ID EX MEM WB
i-2 IF ID EX MEM WB
i-1 IF ID EX MEM WB
i LW IF ID EX MEM WB
i+1 ADD IF ID EX MEM WB
i+2 IF ID EX MEM WB

Treats ADD exception at the end of 5th cycle. Stop instrs i-2, i-1, i,
and i+1. Reinitiate pipe at intrs i-2 after exception handling. It
will then detect the pg fault of instr i, stopping instrs i...i+3.
cs 152 L1 3 .31 DAP Fa97,  U.CB
Why use Imprecise Exceptions?

The difficulty of making the HW associate the correct


exception with the correct instr in pipelined
computers sometimes leads designers to relax
precise exception treatment.

Let the OS system determine:


° which instr caused the exception (from the type of
exception and the stage in which it occured)
° which instrs to flush
° from which instr to restart

cs 152 L1 3 .32 DAP Fa97,  U.CB

You might also like