CSE 313
Enhancing Performance
with Pipelining
Pipeline motivation
Need both low CPI and high frequency for
best performance
Want a multicycle for high frequency, but need better
CPI
Idea behind pipelining is to have a
multicycle implementation that operates
like a factory assembly line
Each “worker” in the pipeline performs a
particular task, hands off to the next
“worker”, while getting new work
Pipeline motivation
Tasks should take about the same time – if
one “worker” is much slower than the rest,
then other “workers” will stand idle
Once the assembly line is full, a new
“product” (instruction) comes out of the
back-end of the line each time period
In a computer assembly line (pipeline),
each task is called a stage and the time
period is one clock cycle
MIPS 5-stage pipeline
Like single cycle datapath but with
registers separating each stage
MIPS 5-stage pipeline
5 stages for each instruction
IF:instruction fetch
ID: instruction decode and register file read
EX: instruction execution or effective address calculation
MEM: memory access for load and store
WB: write back results to register file
Delays of all 5 stages are relatively the same
Staging registers are used to hold data and
control as instructions pass between stages
All instructions pass through all 5 stages
As an instruction leaves a stage in a particular
clock period, the next instruction enters it
Pipeline operation for lw
Stage 1: Instruction fetch
Pipeline operation for lw
Stage 2: Instruction decode and register file
read
What happens to the instruction info in IF/ID?
Pipeline operation for lw
Stage 3: Effective address calculation
Pipeline operation for lw
Stage 4: Memory access
Pipeline operation for lw
Stage 5: Write back
Instruction info in IF/ID is gone – won’t work
Modified pipeline with write back
fix
Write register bits from the instruction
must be carried through the pipeline with
the instruction
Pipeline operation for lw
Pipeline usage in each stage for lw
Pipeline operation for sw
Stage 3: Effective address calculation
Pipeline operation for sw
Stage 4: Memory access
Pipeline operation for sw
Stage 5: Write back (nothing)
Pipeline operation for lw, sub
sequence
Pipeline operation for lw, sub
sequence
Pipeline operation for lw, sub
sequence
Pipeline operation for lw, sub
sequence
Pipeline operation for lw, sub
sequence
Pipeline operation for lw, sub
sequence
Graphical pipeline representation
Represent overlap of pipelined instructions
as multiple pipelines skewed by a cycle
Another useful shorthand form
Pipeline control
Basic pipeline control is similar to the
single cycle implementation
Pipeline control
Control for an instruction is generated in
ID and travels with the instruction and
data through the pipeline
When an instruction enters a stage, it’s
control signals set the operation of that
stage
Pipeline control
Multiple instruction example
For the following code fragment
lw $10, 20($1)
sub $11, $2, $3
and $12, $4, $5
or $13, $6, $7
add $14, $8, $9
show the datapath and control usage as the
instruction sequence travels down the
pipeline
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
How the MIPS ISA simplifies
pipelining
Fixed length instruction simplifies
Fetch – just get the next 32 bits
Decode – single step; don’t have to decode opcode
before figuring out where to get the rest of the fields
Source register fields always in same
location
Can read source registers during decode
Load/store architecture
ALU can be used for both arithmetic and EA calculation
Memory instruction require about same amount of work
as arithmetic ones, easing pipelining of the two together
Memory data must be aligned
Read or write accesses can be done in one cycle
Pipeline hazards
A hazard is a conflict, regarding data, control,
or hardware resources
Data hazards are conflicts for register values
Control hazards occur due to the delay to
execute branch and jump instruction
Structural hazards are conflicts for hardware
resources, such as
A single memory for instructions and data
A multi-cycle, non-pipelined functional unit (such as a
divider)
Data dependences
A read after write (RAW) dependence occurs
when the register written by an instruction is a
source register of a subsequent instruction
lw $10, 20($1)
sub $11, $10, $3
and $12, $4, $11
or $13, $11, $4
add $14, $13, $9
Also have write after read (WAR) and write
after write (WAW) data dependences (later)
Pipelining and RAW dependences
RAW dependences that are close by may
cause data hazards in the pipeline
Consider the following code sequence:
sub $2, $1, $3
and $12, $2, $6
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
What are the RAW dependences?
Pipelining and RAW dependences
Data hazards with first three instructions
hazard
hazard
ok
ok
Forwarding
Most RAW hazards can be eliminated by
forwarding results between pipe stages
at this point, result of sub is available
Detecting forwarding
Rd of the instruction in MEM or WB must
match Rs and/or Rt of the instruction in EX
The instruction in MEM or WB must have
RegWrite=1 (why?)
Rd must not be $0 (why?)
Detecting forwarding from MEM to
EX
To the upper ALU input (ALUupper)
EX/[Link] =1
EX/[Link] not equal 0
EX/[Link] = ID/[Link]
To the lower ALU input (ALUlower)
EX/[Link] =1
EX/[Link] not equal 0
EX/[Link] = ID/[Link]
Forwarding datapaths
Bypass paths feed data from MEM and WB back
to MUXes at the EX ALU inputs
Do we still have to write the register file in WB?
Detecting forwarding from WB to
EX
To the upper ALU input
MEM/[Link] =1
MEM/[Link] not equal 0
MEM/[Link] = ID/[Link]
The value is not being forwarded from MEM (why?)
To the lower ALU input
MEM/[Link] =1
MEM/[Link] not equal 0
MEM/[Link] = ID/[Link]
The value is not being forwarded from MEM
Forwarding control
Control is handled by the forwarding unit
Forwarding example
Show forwarding for the code sequence:
sub $2, $1, $3
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
Forwarding example
sub produces result in EX
Forwarding example
sub forwards result from MEM to ALUupper
Forwarding example
sub forwards result from WB to ALUlower
and forwards result from MEM to ALUupper
Forwarding example
or forwards result from MEM to ALUupper
RAW hazards involving loads
Loads produce results in MEM – can’t forward
to an immediately following R-type instruction
Called a load-use hazard
RAW hazards involving loads
Solution: stall the stages behind the load
for one cycle, after which the result can be
forwarded
Detecting load-use hazards
Instruction in EX is a load
ID/[Link] =1
Instruction in ID has a source register that
matches the load destination register
ID/[Link]= IF/[Link] OR
ID/[Link] = IF/[Link]
Stalling the stages behind the
load
Force nop (“no operation”) instruction into
EX stage on next clock cycle
Force ID/[Link] input to zero
Force ID/[Link] input to zero
Hold instructions in ID and IF stages for
one clock cycle
Hold the contents of PC
Hold the contents of IF/ID
Control for load-use hazards
Control is handled by the hazard detection
unit
Load-use stall example
Code sequence:
lw $2, 20($1)
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
Load-use stall example
lw enters ID
Load-use stall example
Load-use hazard detected
Load-use stall example
Force nop into EX and hold ID and IF stages
Load-use stall example
lw result in WB forwarded to and in EX
or reads operand $2 from register file
Load-use stall example
Pipeline advances normally
Control hazards
Taken branches and jumps change the PC
to the target address from which the next
instruction is to be fetched
In our pipeline, the PC is changed when the
taken beq instruction is in the MEM stage
This creates a control hazard in which
sequential instructions in earlier stages
must be discarded
beq instruction that is taken
instri+3 instri+2 instri+1 beq $2,$3,7
instri+1 , instri+2 , instri+3 must be discarded
beq instruction that is taken
In this example, the branch delay is three
Reducing the branch delay
Reducing the branch delay reduces the
number of instructions that have to be
discarded on a taken branch
We can reduce the branch delay to one for
beq by moving both the equality test and
the branch target address calculation into
ID
We need to insert a nop between the beq
and the correctly fetched instruction
Reducing the branch delay
beq with one branch delay
Register equality test done in ID by a
exclusive ORing the register values and
NORing the result
Instruction in ID forced to nop by zeroing
the IF/ID register
Next fetched instruction will be from PC+4
or branch target depending on the beq
outcome
beq with one branch delay
beq in ID; next sequential instruction (and)
in IF
beq with one branch delay
bubble in ID; lw (from taken address) in IF