Chapter 4 The Processor
Chapter 4 The Processor
The Processor
Instruction count
4.1 Introduction
Introduction
A simplified version
A more realistic pipelined version
Instruction Execution
Arithmetic result
Memory address for load/store
Branch target address
CPU Overview
Multiplexers
Use multiplexers
Control
Combinational element
Operate on data
Output is a function of input
Store information
Chapter 4 The Processor 7
Combinational Elements
AND-gate
Y = A& B
A
B
Multiplexer
A
+
Y = A+ B
Adder
Y = S ? I1 : I0
I0
I1
M
u
x
Arithmetic/Logic Unit
Y = F(A, B)
A
ALU
B
F
Sequential Elements
D
Clk
D
Q
Sequential Elements
D
Write
Clk
Write
D
Q
Clocking Methodology
Datapath
Building a Datapath
Instruction Fetch
32-bit
register
Increment by
4 for next
instruction
R-Format Instructions
Load/Store Instructions
Branch Instructions
Sign-extend displacement
Shift left 2 places (word displacement)
Add to PC + 4
Branch Instructions
Just
re-routes
wires
Sign-bit wire
replicated
R-Type/Load/Store Datapath
Full Datapath
Load/Store: F = add
Branch: F = subtract
R-type: F depends on funct field
ALU control
Function
0000
AND
0001
OR
0010
add
0110
subtract
0111
set-on-less-than
1100
NOR
ALU Control
ALU Control
opcode
ALUOp
Operation
funct
ALU function
ALU control
lw
00
load word
XXXXXX
add
0010
sw
00
store word
XXXXXX
add
0010
beq
01
branch equal
XXXXXX
subtract
0110
R-type
10
add
100000
add
0010
subtract
100010
subtract
0110
AND
100100
AND
0000
OR
100101
OR
0001
set-on-less-than
101010
set-on-less-than
0111
R-type
Load/
Store
Branch
rs
rt
rd
shamt
funct
31:26
25:21
20:16
15:11
10:6
5:0
35 or 43
rs
rt
address
31:26
25:21
20:16
15:0
rs
rt
address
31:26
25:21
20:16
15:0
opcode
always
read
read,
except
for load
write for
R-type
and load
sign-extend
and add
R-Type Instruction
Load Instruction
Branch-on-Equal Instruction
Implementing Jumps
Jump
address
31:26
25:0
Performance Issues
Four loads:
Pipelining Analogy
Speedup
= 8/3.5 = 2.3
Non-stop:
Speedup
= 2n/0.5n + 1.5 4
= number of stages
MIPS Pipeline
Pipeline Performance
Instr
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
700ps
100 ps
600ps
500ps
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipeline Speedup
Load/store addressing
Hazards
Data hazard
Control hazard
Structure Hazards
Data Hazards
add
sub
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
Control Hazards
In MIPS pipeline
Stall on Branch
Branch Prediction
In MIPS pipeline
Prediction
incorrect
Pipeline Summary
The BIG Picture
Subject to hazards
MEM
Right-to-left
flow leads to
hazards
WB
Pipeline registers
Pipeline Operation
EX for Load
WB for Load
Wrong
register
number
EX for Store
WB for Store
Traditional form
Pipelined Control
As in single-cycle implementation
Pipelined Control
$2, $1,$3
$12,$2,$5
$13,$6,$2
$14,$2,$2
$15,100($2)
ID/EX.RegisterRs, ID/EX.RegisterRt
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg
EX/MEM.RegWrite, MEM/WB.RegWrite
EX/MEM.RegisterRd 0,
MEM/WB.RegisterRd 0
Forwarding Paths
Forwarding Conditions
EX hazard
MEM hazard
MEM hazard
Need to stall
for one cycle
IF/ID.RegisterRs, IF/ID.RegisterRt
ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
Stall inserted
here
Or, more
accurately
Branch Hazards
Flush these
instructions
(Set control
values to 0)
PC
sub
beq
and
or
add
slt
...
lw
$10,
$1,
$12,
$13,
$14,
$15,
$4,
$3,
$2,
$2,
$4,
$6,
$8
7
$5
$6
$2
$7
$4, 50($7)
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
lw
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
lw
$1, addr
IF
beq stalled
beq stalled
beq $1, $0, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
inner:
beq , , inner
beq , , outer
2-Bit Predictor
Exception
Interrupt
4.9 Exceptions
Handling Exceptions
An Alternate Mechanism
Vectored Interrupts
Example:
C000 0000
C000 0020
C000 0040
Instructions either
Handler Actions
Otherwise
Terminate program
Report error using EPC, cause,
Chapter 4 The Processor 97
Exceptions in a Pipeline
Exception Properties
Restartable exceptions
Exception Example
Exception on add in
40
44
48
4C
50
54
sub
and
or
add
slt
lw
$11,
$12,
$13,
$1,
$15,
$16,
$2, $4
$2, $5
$2, $6
$2, $1
$6, $7
50($7)
sw
sw
$25, 1000($0)
$26, 1004($0)
Handler
80000180
80000184
Exception Example
Exception Example
Multiple Exceptions
In complex pipelines
Imprecise Exceptions
Deeper pipeline
Multiple issue
Multiple Issue
Speculation
Speculate on load
Compiler/Hardware Speculation
Static speculation
Dynamic speculation
Two-issue packets
Address
Instruction type
Pipeline Stages
ALU/branch
IF
ID
EX
MEM
WB
n+4
Load/store
IF
ID
EX
MEM
WB
n+8
ALU/branch
IF
ID
EX
MEM
WB
n + 12
Load/store
IF
ID
EX
MEM
WB
n + 16
ALU/branch
IF
ID
EX
MEM
WB
n + 20
Load/store
IF
ID
EX
MEM
WB
Load-use hazard
Scheduling Example
$t0,
$t0,
$t0,
$s1,
$s1,
0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop
#
#
#
#
#
$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0
ALU/branch
Load/store
cycle
nop
lw
nop
nop
bne
sw
$t0, 0($s1)
$t0, 4($s1)
Loop Unrolling
ALU/branch
Load/store
cycle
lw
$t0, 0($s1)
nop
lw
$t1, 12($s1)
lw
$t2, 8($s1)
lw
$t3, 4($s1)
sw
$t0, 16($s1)
sw
$t1, 12($s1)
nop
sw
$t2, 8($s1)
sw
$t3, 4($s1)
bne
Superscalar processors
CPU decides whether to issue 0, 1, 2,
each cycle
Example
lw
$t0, 20($s2)
addu $t1, $t0, $t2
sub
$s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw
Chapter 4 The Processor 120
Hold pending
operands
Can supply
operands for
issued instructions
Register Renaming
Speculation
Load speculation
Power Efficiency
Microprocessor
Year
Clock Rate
Pipeline
Stages
Issue
width
Out-of-order/
Speculation
Cores
Power
i486
1989
25MHz
No
5W
Pentium
1993
66MHz
No
10W
Pentium Pro
1997
200MHz
10
Yes
29W
P4 Willamette
2001
2000MHz
22
Yes
75W
P4 Prescott
2004
3600MHz
31
Yes
103W
Core
2006
2930MHz
14
Yes
75W
UltraSparc III
2003
1950MHz
14
No
90W
UltraSparc T1
2005
1200MHz
No
70W
72 physical
registers
FP is 5 stages longer
Up to 106 RISC-ops in progress
Bottlenecks
Fallacies
Pitfalls
Concluding Remarks