0% found this document useful (0 votes)
129 views81 pages

Pipelined MIPS Processor: Dmitri Strukov ECE 154A

Uploaded by

SHAIK MUSTHAFA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views81 pages

Pipelined MIPS Processor: Dmitri Strukov ECE 154A

Uploaded by

SHAIK MUSTHAFA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Pipelined MIPS Processor

Dmitri Strukov
ECE 154A
Pipelining Analogy
• Pipelined laundry: overlapping execution
– Parallelism improves performance

 Four loads:
 Speedup
= 8/3.5 = 2.3
 Non-stop:
 Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
Single-Cycle vs. Multicycle vs. Pipelined
Clock

Time
needed

Time
allotted Instr 1 Instr 2 Instr 3 Instr 4

Clock

Time Time
needed saved
3 cycles 5 cycles 3 cycles 4 cycles
Time
allotted Instr 1 Instr 2 Instr 3 Instr 4

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
1 f r a d w Cycle 1 f f f f f f f Cycle
2 f r a d w 2 r r r r r r r Drainage
region
3 f r a d w 3 a a a a a a a

4 f r a d w 4 Start-up d d d d d d d
f = Fetch
r = Reg read region
5 f r a d w 5 w w w w w w w
a = ALU op
6 d = Data access f r a d w
w = Writeback Pipeline
7 f r a d w stage
Instruction
(a) Task-time diagram (b) Space-time diagram
MIPS Pipeline
Five stages, one step per stage
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address lw IFetch Dec Exec Mem WB
4. MEM: Access memory operand
5. WB: Write result back to register
Pipeline Performance Example
• Assume time for stages is
– 100ps for register read or write
– 200ps for other stages
• Compare pipelined datapath with single-cycle
datapath

Instr Instr fetch Register ALU op Memory Register Total time


read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Pipeline Performance Example
Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)


Pipeline Speedup Example

• If all stages are balanced


– i.e., all take the same time
– Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
• If not balanced, speedup is less
• Speedup due to increased throughput
– Latency (time for each instruction) does not
decrease
Pipelining and ISA Design

• MIPS ISA designed for pipelining


– All instructions are 32-bits
• Easier to fetch and decode in one cycle
• c.f. x86: 1- to 17-byte instructions
– Few and regular instruction formats
• Can decode and read registers in one step
– Load/store addressing
• Can calculate address in 3rd stage, access memory in 4th
stage
– Alignment of memory operands
• Memory access takes only one cycle
Graphically Representing MIPS Pipeline

ALU
IM Reg DM Reg

• Can help with answering questions like:


– How many cycles does it take to execute this code?
– What is the ALU doing during cycle 4?
– Is there a hazard, why does it occur, and how can it be fixed?
Why Pipeline? For Performance!
Time (clock cycles)

Once the pipeline

ALU
I Inst 0 IM Reg DM Reg is full, one
n instruction is
s completed every

ALU
t Inst 1 IM Reg DM Reg
cycle, so CPI = 1
r.

ALU
O Inst 2 IM Reg DM Reg
r
d

ALU
e Inst 3 IM Reg DM Reg
r

ALU
Inst 4 IM Reg DM Reg

Time to fill the pipeline


Review from Last Lecture
multi cycle pipelined
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
1 f r a d w Cycle 1 f Clockf f f f f f Cycle
Single
2 f r a d w 2 Timer r r r r r r Drainage
needed
region cycle
3 f r a d w 3 Time
a a a a a a a
allotted Instr 1 Instr 2 Instr 3 Instr 4
4 f r a d w 4 Start-up d d d d d d d
f = Fetch
r = Reg read region
5 f r a d w 5 Clock w w w w w w w
a = ALU op
6 d = Data access
w = Writeback
f r a d w Time
multi
Pipeline
needed
Time

7 f r a d w stage
Time
3 cycles 5 cycles 3 cycles 4 cycles
saved
cycle
Instruction allotted Instr 1 Instr 2 Instr 3 Instr 4
(a) Task-time diagram (b) Space-time diagram

Execution time = 1/ Performance = Inst count x CPI x CCT


N = # of stages for pipeline design or ~ maximum number of steps for MC
Design Inst CPI CCT
CPIideal MCP=N /InstCount + 1 – 1/InstCount count
 large N and/or small InstCount result in
Single Cycle (SC) 1 1 1
worse CPI
 Performance to run one instruction is Multi cycle (MC) 1 N ≥ CPI > 1 > 1/N
(closer to N than 1)
the same as of CP (i.e. latency for single
instruction is not reduced) Multi cycle 1 >1 >1/N
pipelined (MCP)

What are the other issues affecting CCT and CPI for MC and MCP?
Visualizing pipeline - I
Cycle 1

ALU
I Inst 1 IM Reg DM Reg
n
s
t Inst 2
r.

O
r Inst 3
d
e
r
Inst 4

One way to visualize pipeline: Snapshot of what it is in


Inst 5 pipeline in a particular cycle
Visualizing pipeline - I
Cycle 2

ALU
I Inst 1 IM Reg DM Reg
n
s
t Inst 2
r.

O
r Inst 3
d
e
r
Inst 4

One way to visualize pipeline: Snapshot of what it is in


Inst 5 pipeline in a particular cycle
Visualizing pipeline - I
Cycle 3

ALU
I Inst 1 IM Reg DM Reg
n
s
t Inst 2
r.

O
r Inst 3
d
e
r
Inst 4

One way to visualize pipeline: Snapshot of what it is in


Inst 5 pipeline in a particular cycle
Visualizing pipeline - I
Cycle 4

ALU
I Inst 1 IM Reg DM Reg
n
s
t Inst 2
r.

O
r Inst 3
d
e
r
Inst 4

One way to visualize pipeline: Snapshot of what it is in


Inst 5 pipeline in a particular cycle
Visualizing pipeline - I
Cycle 5

ALU
I Inst 1 IM Reg DM Reg
n
s
t Inst 2
r.

O
r Inst 3
d
e
r
Inst 4

One way to visualize pipeline: Snapshot of what it is in


Inst 5 pipeline in a particular cycle
Visualizing pipeline - II
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I Inst 1 IM Reg DM Reg
n
s

ALU
t Inst 2 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 3
d
e

ALU
r IM Reg DM Reg
Inst 4

ALU
IM Reg DM
Inst 5
Visualizing pipeline - II
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I Inst 1 IM Reg DM Reg
n
s

ALU
t Inst 2 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 3
d
e

ALU
r IM Reg DM Reg
Inst 4

ALU
IM Reg DM
Inst 5
Visualizing pipeline - II
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I Inst 1 IM Reg DM Reg
n
s

ALU
t Inst 2 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 3
d
e

ALU
r IM Reg DM Reg
Inst 4

ALU
IM Reg DM
Inst 5
Visualizing pipeline - II
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I Inst 1 IM Reg DM Reg
n
s

ALU
t Inst 2 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 3
d
e

ALU
r IM Reg DM Reg
Inst 4

ALU
IM Reg DM
Inst 5
Visualizing pipeline - II
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I Inst 1 IM Reg DM Reg
n
s

ALU
t Inst 2 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 3
d
e

ALU
r IM Reg DM Reg
Inst 4

ALU
IM Reg DM
Inst 5
Hazards
• Situations that prevent starting the next
instruction in the next cycle
• Structure hazards
– A required resource is busy
• Data hazard
– Need to wait for previous instruction to complete
its data read/write
• Control hazard
– Deciding on control action depends on previous
instruction
Structure Hazards

• Conflict for use of a resource


• In MIPS pipeline with a single memory
– Load/store requires data access
– Instruction fetch would have to stall for that cycle
• Would cause a pipeline “bubble”
• Hence, pipelined datapaths require separate
instruction/data memories
– Or separate instruction/data caches
A Single Memory Would Be a Structural Hazard
Time (clock cycles)

Reading data from


lw

ALU
I Mem Reg Mem Reg
memory
n
s

ALU
t Inst 1 Mem Reg Mem Reg
r.

ALU
O Inst 2 Mem Reg Mem Reg
r
d

ALU
e Inst 3 Mem Reg Mem Reg
r

ALU
Inst 4 Mem Reg Mem Reg
Reading instruction
from memory

 Fix with separate instr and data memories (I$ and D$)
Note that all instructions will take effectively 5
cycles even if some stages are not used for or
instruction finishes early

Why? Time (clock cycles)

ALU
Inst 0 IM Reg DM Reg
I
n

ALU
s
Inst 1 IM Reg DM Reg
t
r.

ALU
Inst 2 IM Reg DM Reg
O
r

ALU
d
Inst 3 IM Reg DM Reg
e
r

ALU
Inst 4 IM Reg DM Reg
Data Hazards
• An instruction depends on completion of data
access by a previous instruction
– add $s0, $t0, $t1
sub $t2, $s0, $t3
Data Dependencies
instruction j is said data dependent on instruction i if either of the following holds

1. Instruction i produces a result that may be used by instruction j, or


2. Instruction j is data dependent on instruction k and instruction k is data dependent on
instruction i

Typically only type 1 data dependency is sufficient to satisfy for the correct execution of the program since type 2
dependency just implies that one instruction is dependent on another if there exist a chain of dependencies of the
first type between the two instructions. A dependency between two instructions will only result in a data hazard if the
instructions are close enough together for the considered simple datapath in class. In general, it may also become a
hazard for advanced pipelined designs when the processor executes multiple and/or out-of-order instructions

There are three particular data dependencies:

1. RAW (read after write) – j reads a source after i writes it


2. WAW (write after write) – j writes an operand after it is written by I
3. WAR (write after read) – j writes a destination after it is read by i

Note that RAW is what is called “true data dependency” because there is a flow of data between the instructions.
WAW and WAR are called “name dependency”, since two instructions use the same register of memory location (but
there is no flow of data between the instructions).
Register Usage Can Cause Data Hazards
• Dependencies backward in time cause hazards

ALU
add $1, IM Reg DM Reg

ALU
sub $4,$1,$5 IM Reg DM Reg

ALU
and $6,$1,$7 IM Reg DM Reg

ALU
or $8,$1,$9 IM Reg DM Reg

ALU
IM DM Reg
xor $4,$1,$5 Reg

 Read before write data hazard


Loads Can Cause Data Hazards
• Dependencies backward in time cause hazards

ALU
I lw $1,4($2) IM Reg DM Reg
n
s

ALU
t sub $4,$1,$5 IM Reg DM Reg
r.

ALU
O and $6,$1,$7 IM Reg DM Reg
r
d

ALU
e or $8,$1,$9 IM Reg DM Reg
r

ALU
IM DM Reg
xor $4,$1,$5 Reg

 Load-use data hazard


How About Register File Access?
Time (clock cycles)

Fix register file access


add $1,

ALU
I IM Reg DM Reg hazard by doing
n reads in the second
s half of the cycle and

ALU
t Inst 1 IM Reg DM Reg
writes in the first half
r.

ALU
O Inst 2 IM Reg DM Reg
r
d

ALU
e add $2,$1, IM Reg DM Reg
r

clock edge that controls clock edge that controls


register writing loading of pipeline state
registers
One Way to “Fix” a Data Hazard
Can fix data
add $1, hazard by

ALU
I IM Reg DM Reg
waiting – stall –
n
but impacts CPI
s
t stall
r.

O stall
r
d
sub $4,$1,$5

ALU
e IM Reg DM Reg
r

ALU
and $6,$1,$7 IM Reg DM Reg

How to implement stall?


Forwarding: Another Way to “Fix” a Data Hazard

Fix data hazards by


forwarding results

ALU
I add $1, IM Reg DM Reg
as soon as they are
n available to where
s they are needed

ALU
IM Reg DM Reg
t sub $4,$1,$5
r.

ALU
IM Reg DM Reg
r and $6,$1,$7
d
e

ALU
r IM Reg DM Reg
or $8,$1,$9

ALU
IM Reg DM Reg
xor $4,$1,$5

Requires extra connection in a datapath!


Forwarding Illustration

add $1,

ALU
I IM Reg DM Reg
n
s

ALU
t sub $4,$1,$5 IM Reg DM Reg
r.

ALU
IM Reg DM Reg
r and $6,$7,$1
d
e
r

EX forwarding MEM forwarding


Yet Another Complication!
• Another potential data hazard can occur when there is
a conflict between the result of the WB stage
instruction and the MEM stage instruction – which
should be forwarded?

I
add $1,$1,$2

ALU
IM Reg DM Reg
n
s
t
r. add $1,$1,$3

ALU
IM Reg DM Reg

O
r
add $1,$1,$4
ALU
d IM Reg DM Reg
e
r
Load-Use Data Hazard
• Can’t always avoid stalls by forwarding
– If value not computed when needed
– Can’t forward backward in time!
Code Scheduling to Avoid Stalls
• Reorder code to avoid use of load result in the
next instruction
• C code for A = B + E; C = B + F;

lw $t1, 0($t0) lw $t1, 0($t0)


lw $t2, 4($t0) lw $t2, 4($t0)
stall add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
13 cycles 11 cycles
MIPS Pipeline Control Path Modifications
• All control signals can be determined during Decode
– and held in the state registers between pipeline stages
PCSrc
ID/EX
EX/MEM
Control
IF/ID

Add
Branch MEM/WB
RegWrite Shift Add
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Memory
Read Addr 2 Data 1 MemtoReg
Read ALUSrc
File
PC

Read
Address Write Addr ALU Address
Read Data
Data 2 Write Data
Write Data
ALU
cntrl
MemRead
Sign
16 Extend 32 ALUOp

RegDst
Pipeline Control
• IF Stage: read Instr Memory (always asserted)
and write PC (on System Clock)
• ID Stage: no optional control signals to set

EX Stage MEM Stage WB Stage


Reg ALU ALU ALU Brch Mem Mem Reg Mem
Dst Op1 Op0 Src Read Write Write toReg
R 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Datapath with Forwarding Hardware
PCSrc

ID/EX
EX/MEM
Control
IF/ID

Add
Branch MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Data 1 Memory
Read Addr 2
Read File
PC

Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

EX/MEM.RegisterRd

ID/EX.RegisterRt
Forward MEM/WB.RegisterRd
ID/EX.RegisterRs Unit
Data Forwarding Control Conditions
1. EX Forward Unit:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0) Forwards the
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) result from the
ForwardA = 10 previous instr. to
if (EX/MEM.RegWrite either input of
and (EX/MEM.RegisterRd != 0) the ALU
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
2. MEM Forward Unit:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs) Forwards the
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) result from the
ForwardA = 01 previous or
second previous
if (MEM/WB.RegWrite
instr. to either
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt) input of the ALU
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
Load-use Hazard Detection Unit
• Need a Hazard detection Unit in the ID stage that
inserts a stall between the load and its use
1. ID Hazard detection Unit:
if (ID/EX.MemRead
and ((ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline

 The first line tests to see if the instruction now in the EX stage
is a lw; the next two lines check to see if the destination
register of the lw matches either source register of the
instruction in the ID stage (the load-use instruction)
 After this one cycle stall, the forwarding logic can handle the
remaining data hazards
Hazard/Stall Hardware
• Along with the Hazard Unit, we have to implement the stall
• Prevent the instructions in the IF and ID stages from
progressing down the pipeline – done by preventing the PC
register and the IF/ID pipeline register from changing
– Hazard detection Unit controls the writing of the PC
(PC.write) and IF/ID (IF/ID.write) registers
• Insert a “bubble” between the lw instruction (in the EX
stage) and the load-use instruction (in the ID stage) (i.e.,
insert a nop in the execution stream)
– Set the control bits in the EX, MEM, and WB control fields of the
ID/EX pipeline register to 0 (nop). The Hazard Unit controls the
mux that chooses between the real control values and the 0’s.
• Let the lw instruction and the instructions after it in the
pipeline (before it in the code) proceed normally down the
pipeline
Adding the Hazard/Stall Hardware
PCSrc

ID/EX.MemRead
Hazard ID/EX
Unit EX/MEM
0
IF/ID 1
Control 0
Add
Branch MEM/WB
Shift Add
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Data 1 Memory
Read Addr 2
Read File
PC

Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

Forward
Unit
ID/EX.RegisterRt
Visualizing Load-Use Stall
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I lw $1 IM Reg DM Reg
n
s

ALU
t add $2, $1 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 2
d
e

ALU
r IM Reg DM Reg
Inst 3

ALU
IM Reg DM
Inst 4
Visualizing Load-Use Stall
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I lw $1 IM Reg DM Reg
n
s

ALU
t add $2, $1 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 2
d
e

ALU
r IM Reg DM Reg
Inst 3

ALU
IM Reg DM
Inst 4
Visualizing Load-Use Stall
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I lw $1 IM Reg DM Reg
n
s

ALU
t add $2, $1 IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
r Inst 2
d
e

ALU
Can detect stall load IM Reg DM Reg
r
Inst 3 condition in this cycle
by looking in pipeline
registers

ALU
IM Reg DM
Inst 4
Visualizing Load-Use Stall
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I lw $1 IM Reg DM Reg
n
s

ALU
t IM Reg DM Reg
nop
r.

ALU
O IM Reg DM Reg
r add $2, $1
d
e

ALU
r IM Reg DM Reg
Inst 2

ALU
Inst 3 IM Reg DM
Visualizing Load-Use Stall
Time (in cycles)
1 2 3 4 5 6 7 8

ALU
I lw $1 IM Reg DM Reg
n
s

ALU
t IM Reg DM Reg
nop
r.

ALU
O IM Reg DM Reg
r add $2, $1
d
e

ALU
r IM Reg DM Reg
Inst 2

ALU
Inst 3 IM Reg DM
Control Hazards
• When the flow of instruction addresses is not
sequential (i.e., PC = PC + 4); incurred by change of flow
instructions
– Unconditional branches (j, jal, jr)
– Conditional branches (beq, bne)
– Exceptions
• Possible approaches
– Stall (impacts CPI)
– Move decision point as early in the pipeline as possible,
thereby reducing the number of stall cycles
– Delay decision (requires compiler support)
– Predict and hope for the best !
• Control hazards occur less frequently than data hazards,
but there is nothing as effective against control hazards
as forwarding is for data hazards
Datapath Branch and Jump Hardware Jump
PCSrc

Shift ID/EX
EX/MEM
left 2

IF/ID Control

Add
Branch MEM/WB
PC+4[31-28] Add
4 Shift
left 2
Read Addr 1
Instruction Read Data
Register
Memory Data 1 Memory
Read Addr 2
Read File
PC

Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

Forward
Unit
Jumps Incur One Stall
 Jumps not decoded until ID, so one flush is needed
 To flush, set IF.Flush to zero the instruction field of the IF/ID
pipeline register (turning it into a noop)

Fix jump

ALU
I j IM Reg DM Reg
hazard by
n
waiting –
s
flush

ALU
t flush IM Reg DM Reg
r.

ALU
O IM Reg DM Reg
j target
r
d
e
r

• Fortunately, jumps are very infrequent – only 3% of the


SPECint instruction mix
Two “Types” of Stalls
• Nop instruction (or bubble) inserted between two
instructions in the pipeline (as done for load-use
situations)
– Keep the instructions earlier in the pipeline (later in the
code) from progressing down the pipeline for a cycle
(“bounce” them in place with write control signals)
– Insert nop by zeroing control bits in the pipeline register
at the appropriate stage
– Let the instructions later in the pipeline (earlier in the
code) progress normally down the pipeline
• Flushes (or instruction squashing) were an instruction
in the pipeline is replaced with a nop instruction (as
done for instructions located sequentially after j
instructions)
– Zero the control bits for the instruction to be flushed
Supporting ID Stage Jumps
Jump
PCSrc

Shift ID/EX
EX/MEM
left 2

IF/ID Control

Add
Branch MEM/WB
PC+4[31-28] Add
4 Shift
left 2
Read Addr 1
Instruction Register Read Data
Memory Read Addr 2 Data 1 Memory
Read 0
File
PC

Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
ALU
16 Sign 32 cntrl
Extend

Forward
Unit
One Way to “Fix” a Branch Control Hazard
Fix branch
beq

ALU
I IM Reg DM Reg hazard by
n waiting –
s flush – but

ALU
t flush IM Reg DM Reg
affects CPI
r.

ALU
IM Reg DM Reg
O flush
r

ALU
d IM Reg DM Reg
e flush
r

ALU
IM Reg DM Reg
beq target

ALU
IM Reg DM
Inst 3
Reducing the Delay of Branches
• Move the branch decision hardware back to the EX stage
– Reduces the number of stall (flush) cycles to two
– Adds an and gate and a 2x1 mux to the EX timing path
• Add hardware to compute the branch target address and
evaluate the branch decision to the ID stage
– Reduces the number of stall (flush) cycles to one
(like with jumps)
• But now need to add forwarding hardware in ID stage
– Computing branch target address can be done in parallel with
RegFile read (done for all instructions – only used when needed)
– Comparing the registers can’t be done until after RegFile read, so
comparing and updating the PC adds a mux, a comparator, and an
and gate to the ID timing path
• For deeper pipelines, branch decision points can be even
later in the pipeline, incurring more stalls
ID Branch Forwarding Issues
• MEM/WB “forwarding” WB add3 $1,
is taken care of by the MEM add2 $3,
normal RegFile write EX add1 $4,
before read operation ID beq $1,$2,Loop
IF next_seq_instr

 Need to forward from the WB add3 $3,


EX/MEM pipeline stage to MEM add2 $1,
the ID comparison hardware EX add1 $4,
ID beq $1,$2,Loop
for cases like
IF next_seq_instr
if (IDcontrol.Branch
and (EX/MEM.RegisterRd != 0) Forwards the
and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) result from the
ForwardC = 1 second previous
if (IDcontrol.Branch instr. to either
and (EX/MEM.RegisterRd != 0) input of the
and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) compare
ForwardD = 1
ID Branch Forwarding Issues, con’t
 If the instruction immediately WB add3 $3,
before the branch produces MEM add2 $4,
one of the branch source EX add1 $1,
ID beq $1,$2,Loop
operands, then a stall needs IF next_seq_instr
to be inserted (between the
beq and add1) since the EX stage ALU
operation is occurring at the same time as
the ID stage branch compare operation
 “Bounce” the beq (in ID) and next_seq_instr (in IF) in place (ID
Hazard Unit deasserts PC.Write and IF/ID.Write)
 Insert a stall between the add in the EX stage and the beq in the
ID stage by zeroing the control bits going into the ID/EX pipeline
register (done by the ID Hazard Unit)
 If the branch is found to be taken, then flush the
instruction currently in IF (IF.Flush)
Supporting ID Stage Branches
Branch
PCSrc

Hazard ID/EX
Unit EX/MEM
0 1
IF/ID Control 0

Add
Shift MEM/WB
4 Add

Compare
IF.Flush

left 2

Read Addr 1
Instruction RegFile Data
Memory Read Addr 2 Memory
Read 0
Read Data 1
PC

Read Data
Address Write Addr ALU Address
ReadData 2
Write Data
Write Data
ALU
16 Sign cntrl
Extend 32

Forward
Unit

Forward
Unit
Delayed Branches
• If the branch hardware has been moved to the ID stage,
then we can eliminate all branch stalls with delayed
branches which are defined as always executing the next
sequential instruction after the branch instruction – the
branch takes effect after that next instruction
– MIPS compiler moves an instruction to immediately after the
branch that is not affected by the branch (a safe instruction)
thereby hiding the branch delay

 With deeper pipelines, the branch delay grows requiring more


than one delay slot
 Delayed branches have lost popularity compared to more expensive but
more flexible (dynamic) hardware branch prediction
 Growth in available transistors has made hardware branch prediction
relatively cheaper
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6

becomes becomes becomes


add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6

• A is the best choice, fills delay slot and reduces IC


• In B and C, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
Static Branch Prediction
• Resolve branch hazards by assuming a given outcome and
proceeding without waiting to see the actual branch
outcome
1. Predict not taken – always predict branches will not be
taken, continue to fetch from the sequential instruction
stream, only when branch is taken does the pipeline stall
– If taken, flush instructions after the branch (earlier in the
pipeline)
• in IF, ID, and EX stages if branch logic in MEM – three stalls
• In IF and ID stages if branch logic in EX – two stalls
• in IF stage if branch logic in ID – one stall
– ensure that those flushed instructions haven’t changed the
machine state – automatic in the MIPS pipeline since machine
state changing operations are at the tail end of the pipeline
(MemWrite (in MEM) or RegWrite (in WB))
– restart the pipeline at the branch destination
Flushing with Misprediction (Not Taken)

ALU
IM Reg DM Reg
I 4 beq $1,$2,2
n
s flush

ALU
IM Reg DM Reg
t 8 sub $4,$1,$5
r.

ALU
16 and $6,$1,$7 IM Reg DM Reg
O
r
d

ALU
20 or r8,$1,$9 IM Reg DM Reg
e
r

• To flush the IF stage instruction, assert IF.Flush to


zero the instruction field of the IF/ID pipeline register
(transforming it into a noop)
Branching Structures
• Predict not taken works well for “top of the loop”
branching structures Loop: beq $1,$2,Out
1nd loop instr
 But such loops have jumps at the .
bottom of the loop to return to the .
top of the loop – and incur the jump .
stall overhead last loop instr
j Loop
Out: fall out instr

 Predict not taken doesn’t work well for “bottom of the loop”
branching structures Loop: 1st loop instr
2nd loop instr
.
.
.
last loop instr
bne $1,$2,Loop
fall out instr
Static Branch Prediction, con’t
• Resolve branch hazards by assuming a given outcome
and proceeding
2. Predict taken – predict branches will always be taken
 Predict taken always incurs one stall cycle (if branch destination
hardware has been moved to the ID stage)
 Is there a way to “cache” the address of the branch target instruction
??

 As the branch penalty increases (for deeper pipelines), a


simple static prediction scheme will hurt performance. With
more hardware, it is possible to try to predict branch
behavior dynamically during program execution
3. Dynamic branch prediction – predict branches at run-time
using run-time information
Dynamic Branch Prediction
• A branch prediction buffer (aka branch history table (BHT))
in the IF stage addressed by the lower bits of the PC,
contains bit(s) passed to the ID stage through the IF/ID
pipeline register that tells whether the branch was taken
the last time it was execute
– Prediction bit may predict incorrectly (may be a wrong
prediction for this branch this iteration or may be from a
different branch with the same low order PC bits) but the
doesn’t affect correctness, just performance
• Branch decision occurs in the ID stage after determining that the
fetched instruction is a branch and checking the prediction bit(s)
– If the prediction is wrong, flush the incorrect instruction(s) in
pipeline, restart the pipeline with the right instruction, and
invert the prediction bit(s)
• A 4096 bit BHT varies from 1% misprediction (nasa7, tomcatv) to 18%
(eqntott)
Branch Target Buffer
• The BHT predicts when a branch is taken, but does not tell
where its taken to!
– A branch target buffer (BTB) in the IF stage caches the branch
target address, but we also need to fetch the next sequential
instruction. The prediction bit in IF/ID selects which “next”
instruction will be loaded into IF/ID at the next clock edge
• Would need a two read port
instruction memory

BTB

 Or the BTB can cache the


branch taken instruction while the Instruction
Memory
instruction memory is fetching the next Read 0

PC
sequential instruction Address

 If the prediction is correct, stalls can be avoided no matter which


direction they go
1-bit Prediction Accuracy
• A 1-bit predictor will be incorrect twice when not taken
 Assume predict_bit = 0 to start (indicating
branch not taken) and loop control is at the
bottom of the loop code Loop: 1st loop instr
2nd loop instr
1. First time through the loop, the predictor .
mispredicts the branch since the branch is .
taken back to the top of the loop; invert .
prediction bit (predict_bit = 1) last loop instr
bne $1,$2,Loop
2. As long as branch is taken (looping), fall out instr
prediction is correct
3. Exiting the loop, the predictor again
mispredicts the branch since this time the
branch is not taken falling out of the loop;
invert prediction bit (predict_bit = 0)
 For 10 times through the loop we have a 80% prediction
accuracy for a branch that is taken 90% of the time
2-bit Predictors
• A 2-bit scheme can give 90% accuracy since a prediction
must be wrong twice before the prediction bit is changed

right 9 times Loop: 1st loop instr


2nd loop instr
wrong on loop .
Taken fall out .
Not taken .
Predict 1 last loop instr
1 Predict 11 10 Taken
Taken bne $1,$2,Loop
Taken fall out instr
Taken right on 1st Not taken
iteration
Not taken 0
0 Predict 01 00 Predict
Not Taken  BHT also
Not Taken
Taken stores the
Not taken
initial FSM
state
Dealing with Exceptions
• Exceptions (aka interrupts) are just another form of control
hazard. Exceptions arise from
– R-type arithmetic overflow
– Trying to execute an undefined instruction
– An I/O device request
– An OS service request (e.g., a page fault, TLB exception)
– A hardware malfunction
• The pipeline has to stop executing the offending
instruction in midstream, let all prior instructions
complete, flush all following instructions, set a register to
show the cause of the exception, save the address of the
offending instruction, and then jump to a prearranged
address (the address of the exception handler code)
• The software (OS) looks at the cause of the exception and
“deals” with it
Two Types of Exceptions
• Interrupts – asynchronous to program execution
– caused by external events
– may be handled between instructions, so can let the
instructions currently active in the pipeline complete before
passing control to the OS interrupt handler
– simply suspend and resume user program

• Traps (Exception) – synchronous to program execution


– caused by internal events
– condition must be remedied by the trap handler for that
instruction, so much stop the offending instruction
midstream in the pipeline and pass control to the OS trap
handler
– the offending instruction may be retried (or simulated by the
OS) and the program may continue or it may be aborted
Where in the Pipeline Exceptions Occur

ALU
IM Reg DM Reg

Stage(s)? Synchronous?
• Arithmetic overflow EX yes

• Undefined instruction ID yes


• TLB or page fault IF, MEM yes
• I/O service request any no
• Hardware malfunction any no

 Beware that multiple exceptions can occur


simultaneously in a single clock cycle
Multiple Simultaneous Exceptions

ALU
I Inst 0 IM Reg DM Reg
n
D$ page fault
s

ALU
t Inst 1 IM Reg DM Reg
r.
arithmetic overflow

ALU
O Inst 2 IM Reg DM Reg
r
d undefined instruction

ALU
e Inst 3 IM Reg DM Reg
r

ALU
Inst 4 IM Reg DM Reg
I$ page fault

 Hardware sorts the exceptions so that the earliest


instruction is the one interrupted first
Additions to MIPS to Handle Exceptions (Fig 6.42)
• Cause register (records exceptions) – hardware to
record in Cause the exceptions and a signal to control
writes to it (CauseWrite)
• EPC register (records the addresses of the offending
instructions) – hardware to record in EPC the address
of the offending instruction and a signal to control
writes to it (EPCWrite)
– Exception software must match exception to instruction
• A way to load the PC with the address of the exception
handler
– Expand the PC input mux where the new input is
hardwired to the exception handler address - (e.g., 8000
0180hex for arithmetic overflow)
• A way to flush offending instruction and the ones that
follow it
Datapath with Controls for Exceptions
PCSrc Branch
EX.Flush
8000 0180hex
Hazard ID.Flush ID/EX
Unit 0 EX/MEM
0 1
IF/ID Control 0

Add Cause 0
Shift MEM/WB
4 Add

Compare
EPC
IF.Flush

left 2

Read Addr 1
Instruction RegFile Data
Memory Read Addr 2 Memory
Read 0
Read Data 1
PC

Read Data
Address Write Addr ALU Address
ReadData 2
Write Data
Write Data
ALU
16 Sign cntrl
Extend 32

Forward
Unit

Forward
Unit
Stalling vs. Flushing Example
Inst1: lw $1, 0($2) Inst1: j Inst4
Stall here  Inst2: add $2, $1, $1 Flush here  Inst2: add $2, $1, $1
Inst3: add $3, $2, $1 (assuming Inst3: add $3, $2, $1
Inst4: bne $1, $1, label no delay slot) Inst4: bne $1, $1, label
Inst5: and $1, $2, $3 Inst5: and $1, $2, $3
Inst6: or $1, $1, $1 Inst6: or $1, $1, $1

ALU

ALU
Cycle 1 IM Reg DM Reg IM Reg DM Reg
ALU

ALU
Cycle 2 IM Reg DM Reg IM Reg DM Reg

inst2nop
ALU

ALU
Cycle 3 IM Reg DM Reg IM Reg DM Reg

Insert nop nop


ALU

ALU
Cycle 4 IM Reg DM Reg IM Reg DM Reg

nop nop
ALU

ALU
Cycle 5 IM Reg DM Reg IM Reg DM Reg

forwarding
Stalling vs. Flushing Example
Inst1: lw $1, 0($2) Inst1: j Inst4
Stall here  Inst2: add $2, $1, $1 Flush here  Inst2: add $2, $1, $1
Inst3: add $3, $2, $1 (assuming Inst3: add $3, $2, $1
Inst4: bne $1, $1, label no delay slot) Inst4: bne $1, $1, label
Inst5: and $1, $2, $3 Inst5: and $1, $2, $3
Inst6: or $1, $1, $1 Inst6: or $1, $1, $1

cycle cycle
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Instr1 IF ID EX M W IF ID EX M W
Instr2 IF ID ID EX M W IF
Instr3 IF IF ID EX M W
Instr4 IF ID EX M W IF ID EX M W
Instr5 IF ID EX M W IF ID EX M W
Instr6 IF ID EX M W IF ID EX M W
Stalling vs. Flushing Example
Inst1: lw $1, 0($2) Inst1: j Inst4
Stall here  Inst2: add $2, $1, $1 Flush here  Inst2: add $2, $1, $1
Inst3: add $3, $2, $1 (assuming Inst3: add $3, $2, $1
Inst4: bne $1, $1, label no delay slot) Inst4: bne $1, $1, label
Inst5: and $1, $2, $3 Inst5: and $1, $2, $3
Inst6: or $1, $1, $1 Inst6: or $1, $1, $1

cycle cycle
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Instr1 IF ID EX M W IF ID EX M W
nop EX M W
Instr2 IF ID ID EX M W IF ID EX M W
Instr3 IF IF ID EX M W
Instr4 IF ID EX M W IF ID EX M W
Instr5 IF ID EX M W IF ID EX M W
Instr6 IF ID EX M W IF ID EX M W
Stalling vs. Flushing Example
Inst1: lw $1, 0($2) Inst1: j Inst4
Stall here  Inst2: add $2, $1, $1 Flush here  Inst2: add $2, $1, $1
Inst3: add $3, $2, $1 (assuming Inst3: add $3, $2, $1
Inst4: bne $1, $1, label no delay slot) Inst4: bne $1, $1, label
Inst5: and $1, $2, $3 Inst5: and $1, $2, $3
Inst6: or $1, $1, $1 Inst6: or $1, $1, $1

cycle cycle
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Instr1 IF ID EX M W IF ID EX M W
nop EX M W
Instr2 IF ID ID EX M W IF ID EX M W
Instr3 IF IF ID EX M W
Instr4 IF ID EX M W IF ID EX M W
Instr5 IF ID EX M W IF ID EX M W
Instr6 IF ID EX M W IF ID EX M W
Pipeline Summary
The BIG Picture

• Pipelining improves performance by increasing


instruction throughput
– Executes multiple instructions in parallel
– Each instruction has the same latency
• Subject to hazards
– Structure, data, control
• Instruction set design affects complexity of
pipeline implementation
Other Sample Pipeline Alternatives
• ARM7 IM Reg EX

PC update decode ALU op


IM access reg DM access
access shift/rotate
commit result
(write back)

• XScale

ALU
IM1 IM2 Reg DM1 Reg
SHFT DM2
PC update decode DM write
BTB access reg 1 access ALU op reg write
start IM access
shift/rotate start DM access
IM access reg 2 access exception
Acknowledgments
Some of the slides contain material developed
and copyrighted by M.J. Irwin (Penn state), B.
Parhami (UCSB), and instructor material for
the textbook

You might also like