Pipelined Processor Design
Instructor: Huzefa Rangwala, PhD
CS 465
Fall 2014
Review on Single Cycle Datapath
Subset of the core MIPS ISA
Arithmetic/Logic instructions: AND, OR, ADD, SUB,
SLT
Data flow instructions: LW, SW
Branch instructions: BEQ, J
Five steps in processor design
Analyze the instruction
Determine the datapath components
Assemble the components
Determine the control
Design the control unit
Multi-cycle CPU
CS465
Complete Single Cycle Datapath
Multi-cycle CPU
CS465
Delays in Single Cycle Datapath
2ns
2ns
2ns
2ns
2ns
1ns
What are the delays for lw, sw, R-Type, beq, j instructions?
Multi-cycle CPU
CS465
Single Cycle Implementation
Calculate cycle time assuming negligible delays
except:
memory (2ns), ALU and adders (2ns), register file access
(1ns)
Instruction Instruction Register ALU Register/
class
Fetch
Access
Memory
Access
R-Type
Load
Store
Branch
X
X
X
X
Jump
X
X
X
X
X
X
X
X
R
M
M
Register
Access
6
8
7
5
2
Multi-cycle CPU
CS465
Remarks on Single Cycle Datapath
Single cycle datapath ensures the execution of
any instruction within one clock cycle
Functional units must be duplicated if used multiple
times by one instruction, e.g. ALU Why?
Functional units can be shared if used by different
instructions
Single cycle datapath is not efficient in time
Clock cycle time is determined by the instruction
taking the longest time, eg. lw in MIPS
Variable clock cycle time is too complicated
Alternative design/implementation approaches
Multiple clock cycles per instruction
Multi-cycle CPU
CS465
Outline
Todays
topic
Pipelining is an implementation technique in
which multiple instructions are overlapped in
execution
Subset of MIPS instructions
lw, sw, and, or, add, sub, slt, beq
Outline
Pipeline high-level introduction
Stages, hazards
Pipelined datapath and control design
Pipeline
CS465
Pipelining is Natural!
Laundry
example
Ann, Brian, Cathy, Dave
each has one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes
Pipeline
CS465
Sequential Laundry
6 PM
11
10
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
C
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Pipeline
CS465
Pipelined
Laundry
6 PM
7
8
9
10
11
Midnight
Time
30 40
T
a
s
k
O
r
d
e
r
40
40
40 20
A
B
C
D
Start work ASAP
Pipelined laundry takes 3.5 hours for 4 loads
Pipeline
CS465
10
Pipelining Lessons (I)
6 PM
9
Time
30 40
T
a
s
k
O
r
d
e
r
A
B
C
40
40
40 20
Multiple
tasks operating
simultaneously using
different resources
Pipelining
doesnt help
latency of single task, it
helps throughput of entire
workload
Pipeline
rate is limited by
slowest pipeline stage
Unbalanced lengths of
pipeline stages reduces
speedup
Pipeline
CS465
11
Pipelining Lessons (II)
6 PM
9
Time
30 40
T
a
s
k
O
r
d
e
r
A
B
40
40
40 20
Potential
speedup =
Number pipeline stages
Time
to fill pipeline and
time to drain it reduces
speedup- startup and
wind down
Stall
for dependencies
C
D
Pipeline
CS465
12
MIPS Pipeline
Five stages, one step per stage
1.
2.
3.
4.
5.
IF: Instruction fetch from memory
ID: Instruction decode & register read
EX: Execute operation or calculate address
MEM: Access memory operand
WB: Write result back to register
Chapter 4 The Processor
13
Pipeline Performance
Assume time for stages is
100ps for register read or write
200ps for other stages
Compare pipelined datapath with single-cycle
datapath
Instr
Instr fetch Register
read
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
Chapter 4 The Processor
14
700ps
100 ps
600ps
500ps
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Chapter 4 The Processor
15
Pipeline Speedup
If
all stages are balanced
i.e., all take the same time
Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
If
not balanced, speedup is less
Speedup due to increased throughput
Latency (time for each instruction) does not
decrease
Chapter 4 The Processor
16
Pipelining and ISA Design
MIPS
ISA designed for pipelining
All instructions are 32-bits
Easier to fetch and decode in one cycle
c.f. x86: 1- to 17-byte instructions
Few and regular instruction formats
Can decode and read registers in one step
Load/store addressing
Can calculate address in 3rd stage, access memory in
4th stage
Alignment of memory operands
Memory access takes only one cycle
Chapter 4 The Processor
17
Hazards
Situations
that prevent starting the next
instruction in the next cycle
Structure hazards
A required resource is busy
Data
hazard
Need to wait for previous instruction to
complete its data read/write
Control
hazard
Deciding on control action depends on
previous instruction
Chapter 4 The Processor
18
Structure Hazards
Conflict
for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for that
cycle
Would cause a pipeline bubble
Hence, pipelined
datapaths require
separate instruction/data memories
Or separate instruction/data caches
Chapter 4 The Processor
19
Structural Hazard:
One
Memory
Time (clock cycles)
Instr 4
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 3
Mem
Mem
ALU
Instr 2
Reg
ALU
Instr 1
Mem
ALU
O
r
d
e
r
Load
ALU
I
n
s
t
r.
Mem
Reg
Solution 1: add more HW
Hazards can always be resolved by waiting
Pipeline
CS465
20
Structural Hazard:
One
Memory
Time (clock cycles)
Instr 2
stall
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Instr 1
Mem
ALU
Mem
Reg
Bubble Bubble Bubble Bubble Bubble
Instr 3
Mem
Reg
ALU
O
r
d
e
r
Load
ALU
I
n
s
t
r.
Mem
Reg
Hazards can always be resolved by waiting
Pipeline
CS465
21
Data Hazard Example
Data hazard: an instruction depends on the result of a
previous instruction still in the pipeline
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
Pipeline
CS465
22
Data
Hazard
Example
Dependences backward in time are hazards
O
r
d
e
r
or r8,r1,r9
Dm
Reg
Dm
Reg
Dm
Reg
Dm
Reg
ALU
and r6,r1,r7
Im
Im
Im
Reg
Im
xor r10,r1,r11
WB
ALU
sub r4,r1,r3
Reg
MEM
ALU
I
n
s
t
r.
Im
EX
ALU
add r1,r2,r3
ID/RF
ALU
Time (clock cycles)
IF
Reg
Reg
Reg
Dm
Reg
Compilers can help, but it gets messy and difficult
Pipeline
CS465
23
Data Hazard Solution
O
r
d
e
r
or r8,r1,r9
Dm
Reg
Dm
Reg
Dm
Reg
Dm
Reg
ALU
and r6,r1,r7
Im
Im
Im
Reg
Im
xor r10,r1,r11
WB
ALU
sub r4,r1,r3
Reg
MEM
ALU
I
n
s
t
r.
Im
EX
ALU
add r1,r2,r3
ID/RF
ALU
Time (clock cycles)
IF
Reg
Reg
Reg
Dm
Reg
Solution : forward result from one stage to another
Pipeline
CS465
24
Data Hazard Even with Forwarding
sub r4,r1,r3
Im
Reg
Im
EX
MEM
Dm
Reg
ALU
lw r1,0(r2)
ID/RF
ALU
Time (clock cycles)
IF
WB
Reg
Dm
Reg
Cant
go back in time! Must delay/stall
instruction dependent on loads
Pipeline
CS465
25
Data Hazard Even with Forwarding
sub r4,r1,r3
Im
Reg
Stall
EX
MEM
WB
Dm
Reg
Im
Reg
ALU
lw r1,0(r2)
ID/RF
ALU
Time (clock cycles)
IF
Dm
Reg
Must
delay/stall instruction dependent on loads
Sometimes the instruction sequence can be
reordered to avoid pipeline stalls
Pipeline
CS465
26
Code Scheduling to Avoid Stalls
Reorder
code to avoid use of load result in
the next instruction
C code for A = B + E; C = B + F;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
lw
lw
lw
add
sw
add
sw
13 cycles
Chapter 4 The Processor
27
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
Control Hazards
Branch
determines flow of control
Fetching next instruction depends on branch
outcome
Pipeline cant always fetch correct instruction
Still working on ID stage of branch
In
MIPS pipeline
Need to compare registers and compute
target early in the pipeline
Add hardware to do it in ID stage
Chapter 4 The Processor
28
Stall on Branch
Wait
until branch outcome determined before
fetching next instruction
Chapter 4 The Processor
29
Branch Prediction
Longer
pipelines cant readily determine
branch outcome early
Stall penalty becomes unacceptable
Predict
outcome of branch
Only stall if prediction is wrong
In
MIPS pipeline
Can predict branches not taken
Fetch instruction after branch, with no delay
Chapter 4 The Processor
30
Control Hazard Solution: Predict
Predict: guess
one direction then back up
if wrong
Impact: 0 lost cycles per branch
instruction if right, 1 if wrong
Need to Squash and restart following
instruction if wrong
Prediction
scheme
Random prediction: correct 50% of time
History-based prediction: correct 90% of
time
Pipeline
CS465
31
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
Chapter 4 The Processor
32
More-Realistic Branch Prediction
Static branch prediction
Based on typical branch behavior
Example: loop and if-statement branches
Predict backward branches taken
Predict forward branches not taken
Dynamic branch prediction
Hardware measures actual branch behavior
e.g., record recent history of each branch
Assume future behavior will continue the trend
When wrong, stall while re-fetching, and update history
Chapter 4 The Processor
33
Pipeline Summary
The BIG Picture
Pipelining
improves performance by increasing
instruction throughput
Executes multiple instructions in parallel
Each instruction has the same latency
Subject
to hazards
Structure, data, control
Instruction
set design affects complexity of
pipeline implementation
Chapter 4 The Processor
34
MEM
Right-to-left
flow leads to
hazards
WB
Chapter 4 The Processor
35
4.6 Pipelined Datapath and Control
MIPS Pipelined Datapath
Pipeline registers
Need
registers between stages
To hold information produced in previous cycle
Chapter 4 The Processor
36
Pipeline Operation
Cycle-by-cycle
flow of instructions
through the pipelined datapath
Single-clock-cycle pipeline diagram
Shows pipeline usage in a single cycle
Highlight resources used
c.f. multi-clock-cycle diagram
Graph of operation over time
Well
look at single-clock-cycle diagrams
for load & store
Chapter 4 The Processor
37
IF for Load, Store,
Chapter 4 The Processor
38
ID for Load, Store,
Chapter 4 The Processor
39
EX for Load
Chapter 4 The Processor
40
MEM for Load
Chapter 4 The Processor
41
WB for Load
Wrong
register
number
Chapter 4 The Processor
42
Corrected Datapath for Load
Chapter 4 The Processor
43
EX for Store
Chapter 4 The Processor
44
MEM for Store
Chapter 4 The Processor
45
WB for Store
Chapter 4 The Processor
46
Multi-Cycle Pipeline Diagram
Form
showing resource usage
Chapter 4 The Processor
47
Multi-Cycle Pipeline Diagram
Traditional
form
Chapter 4 The Processor
48
Single-Cycle Pipeline Diagram
State
of pipeline in a given cycle
Chapter 4 The Processor
49
Pipelined Control (Simplified)
Chapter 4 The Processor
50
Observations
No write control for all pipeline registers and PC since
they are updated at every clock cycle
To specify the control for the pipeline, set the control
values during each pipeline stage
Control lines can be divided into 5 groups:
IF
ID
ALU
MEM
WB
NONE
NONE
RegDst, ALUOp, ALUSrc
Branch, MemRead, MemWrite
MemtoReg, RegWrite
Group these nine control lines into 3 subsets:
ALUControl, MEMControl, WBControl
Control signals are generated at ID stage, how to pass
Pipeline CS465
them to other stages?
51
Pipelined Control
Control
signals derived from instruction
As in single-cycle implementation
Chapter 4 The Processor
52
Pipelined Control
Chapter 4 The Processor
53
Add Forwarding Paths
Pipeline Hazards
CS465
54
Pipelining: executing multiple instructions in
parallel
To increase ILP
Deeper pipeline
Less work per stage shorter clock cycle
Multiple issue
Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice
Chapter 4 The Processor
55
4.10 Parallelism and Advanced Instruction Level Parallelism
Instruction-Level Parallelism (ILP)
MIPS with Static Dual Issue
Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
ALU/branch, then load/store
Pad an unused instruction with nop
Address
Instruction type
Pipeline Stages
ALU/branch
IF
ID
EX
MEM
WB
n+4
Load/store
IF
ID
EX
MEM
WB
n+8
ALU/branch
IF
ID
EX
MEM
WB
n + 12
Load/store
IF
ID
EX
MEM
WB
n + 16
ALU/branch
IF
ID
EX
MEM
WB
n + 20
Load/store
IF
ID
EX
MEM
WB
Chapter 4 The Processor
56
MIPS with Static Dual Issue
Chapter 4 The Processor
57
Hazards in the Dual-Issue MIPS
More instructions executing in parallel
EX data hazard
Forwarding avoided stalls with single-issue
Now cant use ALU result in load/store in same
packet
add $t0, $s0, $s1
load $s2, 0($t0)
Split into two packets, effectively a stall
More aggressive scheduling required
Chapter 4 The Processor
58
Multiple Issue
Static multiple issue
Compiler groups instructions to be issued together
Packages them into issue slots
Compiler detects and avoids hazards
Dynamic multiple issue
CPU examines instruction stream and chooses
instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at
runtime
Chapter 4 The Processor
59
Speculation
Guess what to do with an instruction
Start operation as soon as possible
Check whether guess was right
If so, complete the operation
If not, roll-back and do the right thing
Common to static and dynamic multiple issue
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on load
Roll back if location is updated
Chapter 4 The Processor
60
Compiler/Hardware Speculation
Compiler
can reorder instructions
e.g., move load before branch
Can include fix-up instructions to recover
from incorrect guess
Hardware
can look ahead for instructions
to execute
Buffer results until it determines they are
actually needed
Flush buffers on incorrect speculation
Chapter 4 The Processor
61
Static Multiple Issue
Compiler
groups instructions into issue
packets
Group of instructions that can be issued on a
single cycle
Determined by pipeline resources required
Think
of an issue packet as a very long
instruction
Specifies multiple concurrent operations
Very Long Instruction Word (VLIW)
Chapter 4 The Processor
62
Scheduling Static Multiple Issue
Compiler
must remove some/all hazards
Reorder instructions into issue packets
No dependencies with a packet
Possibly some dependencies between packets
Varies between ISAs; compiler must know!
Pad with nop if necessary
Chapter 4 The Processor
63
Thought Question:
In
our simple single-issue five-stage
pipeline we have a use latency of one
clock cycle --- prevents one instruction
from using the result without stalling.
How many extra stalls may be needed for
loads or even ALU in a 2-issue 5-stage
pipeline?
Remember we had zero stalls for ALU
instructions (zero-use latency) in the single
issue pipeline.
Scheduling Example
Schedule
this for dual-issue MIPS
Loop: lw
addu
sw
addi
bne
Loop:
$t0,
$t0,
$t0,
$s1,
$s1,
0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop
#
#
#
#
#
$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0
ALU/branch
Load/store
cycle
nop
lw
addi $s1, $s1,4
nop
addu $t0, $t0, $s2
nop
bne
sw
$s1, $zero, Loop
$t0, 0($s1)
$t0, 4($s1)
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
Chapter 4 The Processor
65
Loop Unrolling
Replicate
loop body to expose more
parallelism
Reduces loop-control overhead
Use
different registers per replication
Called register renaming
Avoid loop-carried anti-dependencies
Store followed by a load of the same register
Aka name dependence
Reuse of a register name
Chapter 4 The Processor
66
Loop Unrolling Example
Loop:
ALU/branch
Load/store
cycle
addi $s1, $s1,16
lw
$t0, 0($s1)
nop
lw
$t1, 12($s1)
addu $t0, $t0, $s2
lw
$t2, 8($s1)
addu $t1, $t1, $s2
lw
$t3, 4($s1)
addu $t2, $t2, $s2
sw
$t0, 16($s1)
addu $t3, $t4, $s2
sw
$t1, 12($s1)
nop
sw
$t2, 8($s1)
sw
$t3, 4($s1)
bne
$s1, $zero, Loop
IPC = 14/8 = 1.75
Closer to 2, but at cost of registers and code size
Chapter 4 The Processor
67
Dynamic Multiple Issue
Superscalar
processors
CPU decides whether to issue 0, 1, 2,
each cycle
Avoiding structural and data hazards
Avoids
the need for compiler scheduling
Though it may still help
Code semantics ensured by the CPU
Chapter 4 The Processor
68
Dynamic Pipeline Scheduling
Allow
the CPU to execute instructions
out of order to avoid stalls
But commit result to registers in order
Example
lw
$t0, 20($s2)
addu $t1, $t0, $t2
sub
$s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw
Chapter 4 The Processor
69
Dynamically Scheduled CPU
Preserves
dependencies
Hold pending
operands
Results also sent to
any waiting
reservation stations
Reorders buffer for
register writes
Can supply operands
for issued
instructions
Chapter 4 The Processor
70
Register Renaming
Reservation stations and reorder buffer
effectively provide register renaming
On instruction issue to reservation station
If operand is available in register file or reorder buffer
Copied to reservation station
For the issuing instruction, register copy of the operand is no
longer written.
If operand is not yet available
It will be provided to the reservation station by a function unit
bypassing the register file.
Register update may not be required
Chapter 4 The Processor
71
Speculation
Predict
branch and continue issuing
Dont commit until branch outcome
determined
Load
speculation
Avoid load and cache miss delay
Predict the effective address
Predict loaded value
Load before completing outstanding stores
Bypass stored values to load unit
Dont commit load until speculation cleared
Chapter 4 The Processor
72
Why Do Dynamic Scheduling?
Why
not just let the compiler schedule
code?
Not all stalls are predicable
e.g., cache misses
Cant
always schedule around branches
Branch outcome is dynamically determined
Different
implementations of an ISA have
different latencies and hazards
Chapter 4 The Processor
73
Does Multiple Issue Work?
The BIG Picture
Yes, but not as much as wed like
Programs have real dependencies that limit ILP
Some dependencies are hard to eliminate
e.g., pointer aliasing
Some parallelism is hard to expose
Limited window size during instruction issue
Memory delays and limited bandwidth
Hard to keep pipelines full
Speculation can help if done well
Chapter 4 The Processor
74
Power Efficiency
Complexity
of dynamic scheduling and
speculations requires power
Multiple simpler cores may be better
Microprocessor
Year
Clock Rate
Pipeline
Stages
Issue
width
Out-of-order/
Speculation
Cores
Power
i486
1989
25MHz
No
5W
Pentium
1993
66MHz
No
10W
Pentium Pro
1997
200MHz
10
Yes
29W
P4 Willamette
2001
2000MHz
22
Yes
75W
P4 Prescott
2004
3600MHz
31
Yes
103W
Core
2006
2930MHz
14
Yes
75W
UltraSparc III
2003
1950MHz
14
No
90W
UltraSparc T1
2005
1200MHz
No
70W
Chapter 4 The Processor
75
72 physical
registers
Chapter 4 The Processor
76
4.11 Real Stuff: The AMD Opteron X4 (Barcelona) Pipeline
The Opteron X4 Microarchitecture
The Opteron X4 Pipeline Flow
For integer operations
!
!
FP is 5 stages longer
Up to 106 RISC-ops in progress
Bottlenecks
!
!
!
Complex instructions with long dependencies
Branch mispredictions
Memory access delays
Chapter 4 The Processor
77
ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput
using parallelism
More instructions completed per second
Latency for each instruction not reduced
Hazards: structural, data, control
Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism
Complexity leads to the power wall
Chapter 4 The Processor
78
4.14 Concluding Remarks
Concluding Remarks
Unexpected events requiring change
in flow of control
Different ISAs use the terms differently
Exception
Arises within the CPU
e.g., undefined opcode, overflow, syscall,
Interrupt
From an external I/O controller
Dealing with them without sacrificing
performance is hard
Chapter 4 The Processor
79
4.9 Exceptions
Exceptions and Interrupts
Handling Exceptions
In MIPS, exceptions managed by a System
Control Coprocessor (CP0)
Save PC of offending (or interrupted)
instruction
In MIPS: Exception Program Counter (EPC)
Save indication of the problem
In MIPS: Cause register
Well assume 1-bit
0 for undefined opcode, 1 for overflow
Jump to handler at 8000 00180
Chapter 4 The Processor
80
An Alternate Mechanism
Vectored
Interrupts
Handler address determined by the cause
Example:
Undefined opcode: C000 0000
Overflow:
C000 0020
:
C000 0040
Instructions
either
Deal with the interrupt, or
Jump to real handler
Chapter 4 The Processor
81
Handler Actions
Read
cause, and transfer to relevant
handler
Determine action required
If restartable
Take corrective action
use EPC to return to program
Otherwise
Terminate program
Report error using EPC, cause,
Chapter 4 The Processor
82
Exceptions in a Pipeline
Another
form of control hazard
Consider overflow on add in EX stage
add $1, $2, $1
Prevent $1 from being clobbered
Complete previous instructions
Flush add and subsequent instructions
Set Cause and EPC register values
Transfer control to handler
Similar
to mispredicted branch
Use much of the same hardware
Chapter 4 The Processor
83
Pipeline with Exceptions
Chapter 4 The Processor
84
Exception Properties
Restartable
exceptions
Pipeline can flush the instruction
Handler executes, then returns to the
instruction
Refetched and executed from scratch
PC
saved in EPC register
Identifies causing instruction
Actually PC + 4 is saved
Handler must adjust
Chapter 4 The Processor
85
Exception Example
Exception on add in
40
44
48
4C
50
54
sub
and
or
add
slt
lw
$11,
$12,
$13,
$1,
$15,
$16,
$2, $4
$2, $5
$2, $6
$2, $1
$6, $7
50($7)
sw
sw
$25, 1000($0)
$26, 1004($0)
Handler
80000180
80000184
Chapter 4 The Processor
86
Exception Example
Chapter 4 The Processor
87
Exception Example
Chapter 4 The Processor
88
Multiple Exceptions
Pipelining overlaps multiple instructions
Could have multiple exceptions at once
Simple approach: deal with exception from
earliest instruction
Flush subsequent instructions
Precise exceptions
In complex pipelines
Multiple instructions issued per cycle
Out-of-order completion
Maintaining precise exceptions is difficult!
Chapter 4 The Processor
89