CS 162 Computer Architecture
Lecture 3: Pipelining Contd.
Instructor: L.N. Bhuyan
[Link]/~bhuyan/cs162
1 1999 ©UCB
Single Cycle Datapath (From Ch 5)
M
a a u
d d x
4 d << d
2 PCSrc
Read 25:21 Read MemWrite
P Addr Reg1
Read Read
C
31:0 Read data1 Zero data
20:16
Instruc- Reg2
A
tion L
Read Address
M Write U MemTo-
data2 M
u Reg Reg
u
Imem x Regs x
Dmem
Write ALU-
15:11 con Write
Data
Data
RegDst ALU- M
RegWrite src MemRead u
15:0 Sign
Extend x
2 ALUOp 1999 ©UCB
Required Changes to
Datapath
° Introduce registers to separate 5 stages
by putting IF/ID, ID/EX, EX/MEM, and
MEM/WB registers in the datapath.
° Next PC value is computed in the 3rd
step, but we need to bring in next instn
in the next cycle – Move PCSrc Mux to
1st stage. The PC is incremented unless
there is a new branch address.
° Branch address is computed in 3rd
stage. With pipeline, the PC value has
changed! Must carry the PC value along
with instn. Width of IF/ID register = (IR)+
(PC) = 64 bits.
3 1999 ©UCB
Changes to Datapath
Contd.
° For lw instn, we need write register
address at stage 5. But the IR is now
occupied by another instn! So, we
must carry the IR destination field as
we move along the stages. See
connection in fig.
Length of ID/EX register = (Reg1:32)+
(Reg2:32)+(offset:32)+ (PC:32)+
(destination register:5) = 133 bits
Assignment: What are the lengths of
EX/MEM, and MEM/WB registers
4 1999 ©UCB
Pipelined Datapath (with Pipeline Regs)
(6.2)Fetch Decode Execute Memory Write
Back
0
M
u
x
1
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
4 Add
result
Shift
left 2
Read
Ins tructio n
PC Address register 1
Read
data 1
Read
register 2 Zero
Read ALU ALU
Write 0 Address Read
data 2 result 1
register M data
u M
Imem Write
data Regs x
1
u
x
0
Write
16 32
data
Dmem
Sign
extend
5
64 bits 133 bits 102 bits 69 bits
1999 ©UCB
Pipelined Control
(6.3)
• Start with single-cycle controller
• Group control lines by pipeline stage needed
• Extend pipeline registers with control bits
WB
Instruction Mem
Control WB
EX Mem WB
RegDst
Branch MemToReg
ALUop
MemRead RegWrite
ALUSrc
MemWrite
IF/ID ID/EX EX/MEM MEM/WB
6 1999 ©UCB
Pipelined Processor: Datapath +
Control • More work to correctly handle pipeline hazards
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add resul t
RegWrite
Sh if t Branch
MemWrite
left 2
MemToReg
ALUSrc
Instructi on
Read
PC Address regis ter 1 Read
Read data 1
regis ter 2 Zero
Read ALU ALU
Writ e 0 Read
data 2 result Address 1
Imem regis ter M
u
data
M
Regs
Writ e x u
data x
1
Dmem
0
Write
data
Instruction 16 32
[15– 0] 6
Si gn ALU MemRead
ex tend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
7 1999 ©UCB
Reca
p
° if can keep all pipeline stages busy,
can retire (complete) up to one
instruction per clock cycle (thereby
achieving single-cycle throughput)
° The pipeline paradox (for MIPS): any
instruction still takes 5 cycles to
execute (even though can retire one
instruction per cycle)
8 1999 ©UCB
Problems for Pipelining
° Hazards prevent next instruction from
executing during its designated clock
cycle, limiting speedup
• Structural hazards: HW cannot support
this combination of instructions (single
memory for instruction and data)
• Data hazards: Instruction depends on
result of prior instruction still in the
pipeline
• Control hazards: conditional branches &
other instructions may stall the pipeline
delaying later instructions
9 1999 ©UCB
Single Memory is a Structural
Hazard
Time (clock cycles)
I
n
ALU
M Reg M Reg
s Load
ALU
t Instr 1 M Reg M Reg
r.
ALU
M Reg M Reg
Instr 2
O
ALU
M Reg M Reg
Instr 3
r
ALU
d Instr 4 M Reg M Reg
e
r
10
• Can’t read same memory twice in same clock cycle
1999 ©UCB
EX: MIPS multicycle datapath:
Structural Hazard in Memory
P Address Instruction Read
C Register Reg1
Memory Read
Read
Instruction Reg2
data 1 A A ALU-
or Data L Out
Registers U
Write Read
Reg data 2 B
Data Memory
Data
Register Data
11 1999 ©UCB
Structural Hazards limit
performance
° Example: if 1.3 memory accesses per
instruction (30% of instructions
execute loads and stores)
and only one memory access per cycle
then
• Average CPI 1.3
• Otherwise datapath resource is more than
100% utilized
Structural Hazard Solution: Add more
Hardware
12 1999 ©UCB
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instn
Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined
---------------------------------- X -------------------------
Ideal CPI + Pipeline stall
x CPI Clock Cyclepipelined
Speedup = Pipeline depth Clock Cycleunpipelined
------------------------ X ---------------------------
1 + Pipeline stall CPI Clock Cyclepipelined
13 1999 ©UCB
Example: Dual-port vs. Single-port
° Machine A: Dual ported memory
° Machine B: Single ported memory, but its pipelined implementation
has a 1.05 times faster clock rate
° Ideal CPI = 1 for both
° Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
° Machine A is 1.33 times faster
14 1999 ©UCB
Data Hazard on Register $1
(6.4)
add $1 ,$2, $3
sub $4, $1 ,$3
and $6, $1 ,$7
or $8, $1 ,$9
xor $10, $1 ,$11
15 1999 ©UCB
Data Hazard
Solution:
• “Forward” result from one stage to another
I Time (clock cycles)
IF ID/RF EX MEM WB
n
ALU
s add $1,$2,$3 IM Reg DM Reg
ALU
IM Reg DM Reg
sub $4,$1,$3
r.
ALU
IM Reg DM Reg
and $6,$1,$7
O
ALU
IM Reg DM Reg
r or $8,$1,$9
d
ALU
IM Reg DM Reg
xor $10,$1,$11
e
r
• “or” OK if implement register file properly
16 1999 ©UCB
Hazard Detection for Forwarding
° A hazard must be detected just before execution so that
in case of hazard, the data can be forwarded to the
input of the ALU.
° It can be detected when a source register (Rs or Rt or
both) of the instruction at the EX stage is equal to the
destination register (Rd) of an instruction in the
pipeline (either in MEM or WB stage)
° Compare the values of Rs and Rt registers in the ID/EX
stage with Rd at EX/MEM and MEM/WB stages =>
Need to carry Rs, Rt, Rd values to the ID/EX register
from the IF/ID register (only Rd was carried before)
° If they match, forward the data to the input of the ALU
through the multiplexor.
See Fig. 6.43 pp. 488 of the text
17 1999 ©UCB
Forwarding: What about
Loads?
• Dependencies backward in time are hazards
IF ID/RF EX MEM WB
ALU
lw $1,0($2) IM Reg DM Reg
ALU
IM Reg DM Reg
sub $4,$1,$3
• Can’t solve with forwarding alone
• Must stall instruction dependent on load
•“Load-Use” hazard
18 1999 ©UCB
Data Hazard Even with
Forwarding
• Must stall pipeline 1 cycle (insert 1 bubble)
Time (clock cycles)
IF ID/RF EX MEM WB
lw $1, 0($2)
ALU
IM Reg DM Reg
bub
ALU
sub $4,$1,$6 IM Reg
ble
DM Reg
bub
ALU
IM Reg DM Reg
and $6,$1,$7 ble
bub
ALU
or $8,$1,$9 ble
IM Reg DM
19 1999 ©UCB
Compiler Schemes to Improve Load Delay
° Compiler will detect data dependency and inserts
nop instructions until data is available
sub $2, $1, $3
nop
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
° Compiler will find independent instructions to
fill in the delay slots
20 1999 ©UCB
Software Scheduling to Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e
LW Rf,f
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SUB Rd,Re,Rf
SW d,Rd
SW d,Rd
21 1999 ©UCB