Computer Architecture
Computer Architecture
1
What is Computer Architecture?
Application
6
Computer Architecture is Constantly
Changing
Application
Application Requirements:
Algorithm • Suggest how to improve architecture
Programming Language • Provide revenue to fund development
Operating System/Virtual Machines
Instruction Set Architecture Architecture provides feedback to guide
Microarchitecture application and technology research
directions
Register-Transfer Level
Gates
Circuits Technology Constraints:
• Restrict what can be done efficiently
Devices
• New technologies make new arch
Physics possible
10
Computers Then…
Relays
[from Kurzweil]
Electromechanical
13
Sequential Processor Performance
Move to multi-processor
RISC
16
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Course Structure
• Recommended Readings
• In-Lecture Questions
• Problem Sets
– Very useful for exam preparation
– Peer Evaluation
• Midterm
• Final Exam
17
Course Content Computer
Organization (ELE 375)
Computer Organization
• Basic Pipelined
Processor
~50,000 Transistors
18
Photo of Berkeley RISC I, © University of California (Berkeley)
Course Content Computer
Architecture (ELE 475)
• Instruction Level Parallelism
– Superscalar Computer Organization
– Very Long Instruction Word (VLIW) (ELE 375) Processor
• Long Pipelines (Pipeline
Parallelism)
• Advanced Memory and Caches
• Data Level Parallelism
– Vector
– GPU
• Thread Level Parallelism
– Multithreading
– Multiprocessor
– Multicore
– Manycore ~700,000,000 Transistors
Intel Nehalem Processor, Original Core i7, Image Credit Intel: 22
http://download.intel.com/pressroom/kits/corei7/images/Nehalem_Die_Shot_3.jpg
Architecture vs. Microarchitecture
“Architecture”/Instruction Set Architecture:
• Programmer visible state (Memory & Register)
• Operations (Instructions and how they work)
• Execution Semantics (interrupts)
• Input/Output
• Data Types/Sizes
Microarchitecture/Organization:
• Tradeoffs on how to implement ISA for some metric
(Speed, Energy, Cost)
• Examples: Pipeline depth, number of pipelines, cache
size, silicon area, peak power, execution ordering, bus
widths, ALU widths
23
Software Developments
up to 1955 Libraries of numerical routines
- Floating point operations
- Transcendental functions
- Matrix manipulation, equation solvers, . . .
29
IBM 360: A General-Purpose Register
(GPR) Machine
• Processor State
– 16 General-Purpose 32-bit Registers
• may be used as index and base register
• Register 0 has some special properties
– 4 Floating Point 64-bit Registers
– A Program Status Word (PSW)
• PC, Condition codes, Control flags
• A 32-bit machine with 24-bit addresses
– But no instruction contains a 24-bit address!
• Data Formats
– 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words
35
Image Credit: AMD
Different Architecture
Different Microarchitecture
AMD Phenom X4 IBM POWER7
• X86 Instruction Set • Power Instruction Set
• Quad Core • Eight Core
• 125W • 200W
• Decode 3 Instructions/Cycle/Core • Decode 6 Instructions/Cycle/Core
• 64KB L1 I Cache, 64KB L1 D Cache • 32KB L1 I Cache, 32KB L1 D Cache
• 512KB L2 Cache • 256KB L2 Cache
• Out-of-order • Out-of-order
• 2.6GHz • 4.25GHz
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Number Explicitly
Named Operands: 0 1 2 or 3 2 or 3
46
Stack-Based Instruction Set
Architecture (ISA)
… • Burrough’s B5000 (1960)
TOS
• Burrough’s B6700
• HP 3000
Processor
ALU
• ICL 2900
• Symbolics 3600
Modern
… • Inmos Transputer
Memory
• Forth machines
• Java Virtual Machine
• Intel x87 Floating Point Unit
47
Evaluation of Expressions
(a + b * c) / (a + d * c - e)
/
+ -
a * + e
b c a *
d
c
b*c
+
Reverse Polish a+a
b*c
abc*+adc*+e-/
Evaluation Stack
add 61
Hardware organization of the stack
• Stack is part of the processor state
stack must be bounded and small
number of Registers,
not the size of main memory
62
Stack Operations and
Implicit Memory References
• Suppose the top 2 elements of the stack are kept
in registers and the rest is kept in the memory.
Each push operation 1 memory reference
pop operation 1 memory reference
No Good!
• Better performance by keeping the top N
elements in registers, and memory references are
made only when register stack overflows or
underflows.
Issue - when to Load/Unload registers ?
65
Stack Size and Expression Evaluation
abc*+adc*+e-/
program stack (size = 4)
push a R0
push b R0 R1
a and c are push c R0 R1 R2
“loaded” twice * R0 R1
+ R0
not the best push a R0 R1
use of registers! push d R0 R1 R2
push c R0 R1 R2 R3
* R0 R1 R2
+ R0 R1
push e R0 R1 R2
- R0 R1
/ R0
69
Machine Model Summary
Register- Register-
Stack Accumulator Register
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
C=A+B
… … … …
Memory
Memory
Memory
Memory
Push A Load A Load R1, A Load R1, A
Push B Add B Add R3, R1, B Load R2, B
Add Store C Store R3, C Add R3, R1, R2
Pop C Store R3, C 72
Classes of Instructions
• Data Transfer
– LD, ST, MFC1, MTC1, MFC0, MTC0
• ALU
– ADD, SUB, AND, OR, XOR, MUL, DIV, SLT, LUI
• Control Flow
– BEQZ, JR, JAL, TRAP, ERET
• Floating Point
– ADD.D, SUB.S, MUL.D, C.LT.D, CVT.S.W,
• Multimedia (SIMD)
– ADD.PS, SUB.PS, MUL.PS, C.LT.PS
• String
– REP MOVSB (x86)
73
Addressing Modes:
How to Get Operands from Memory
Addressing Instruction Function
Mode
Register Add R4, R3, R2 Regs[R4] <- Regs[R3] + Regs[R2] **
Displacement Add R4, R3, 100(R1) Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]
Scaled Add R4, R3, 100(R1)[R5] Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] +
Regs[R5] * 4]
74
** May not actually access memory!
Data Types and Sizes
• Types
– Binary Integer
– Binary Coded Decimal (BCD)
– Floating Point
• IEEE 754
• Cray Floating Point
• Intel Extended Precision (80-bit)
– Packed Vector Data
– Addresses
• Width
– Binary Integer (8-bit, 16-bit, 32-bit, 64-bit)
– Floating Point (32-bit, 40-bit, 64-bit, 80-bit)
– Addresses (16-bit, 24-bit, 32-bit, 48-bit, 64-bit)
75
ISA Encoding
Fixed Width: Every Instruction has same width
• Easy to decode
(RISC Architectures: MIPS, PowerPC, SPARC, ARM…)
Ex: MIPS, every instruction 4-bytes
Variable Length: Instructions can vary in width
• Takes less space in memory and caches
(CISC Architectures: IBM 360, x86, Motorola 68k, VAX…)
Ex: x86, instructions 1-byte up to 17-bytes
Mostly Fixed or Compressed:
• Ex: MIPS16, THUMB (only two formats 2 and 4 bytes)
• PowerPC and some VLIWs (Store instructions compressed,
decompress into Instruction Cache
(Very) Long Instruction Word:
• Multiple instructions in a fixed width bundle
• Ex: Multiflow, HP/ST Lx, TI C6000 77
x86 (IA-32) Instruction Encoding
Up to four
1,2, or 3 1 byte 1 byte 0,1,2, or 4 0,1,2, or 4
Prefixes
bytes (if needed) (if needed) bytes bytes
(1 byte
each)
78
MIPS64 Instruction Encoding
79
Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Real World Instruction Sets
Arch Type # Oper # Mem Data Size # Regs Addr Size Use
83
Computer Architecture Lecture 1
84
Computer Architecture
ELE 475 / COS 475
Slide Deck 2: Microcode and
Pipelining Review
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
2
What Happens When the Processor is
Too Large?
• Time Multiplex Resources!
5
Microcontrol Unit Maurice Wilkes, 1954
op conditional First used in EDSAC-2,
code flip-flop completed 1958
Next state
address
Matrix A Matrix B
Datapath
Data Addr
RegSel MA
rd 3
IR rs2 A B addr addr
rs1
32 GPRs
ImmSel + PC ... Memory MemWrt
Imm ALU RegWrt
2 Ext control ALU
32-bit Reg enReg
enImm enALU data data enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
8
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
9
An Ideal Pipeline
stage stage stage stage
1 2 3 4
0x4
Add
Add
clk
we
clk
rs1
rs2
PC addr 31 rd1 we
inst ws addr
wd rd2 ALU
clk Inst. GPRs z rdata
Memory Data
Imm Memory
Ext wdata
ALU
Control
13
Pipelined Datapath
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
14
Pipelined Control
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 15
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 16
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 18
“Iron Law” of Processor Performance
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture CPI cycle time
Microcoded >1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
Multi-cycle, unpipelined control >1 short 21
CPI Examples
Microcoded machine Time
7 cycles 5 cycles 10 cycles
Inst 1 Inst 2 Inst 3
write
fetch decode & register- execute memory -back
phase fetch phase phase phase phase
write
fetch decode & register- execute memory -back
phase fetch phase phase phase phase
time t0 t1 t2 t3 t4 t5 t6 t7 ....
instruction1 IF1 ID1 EX1 MA1 WB1
instruction2 IF2 ID2 EX2 MA2 WB2
instruction3 IF3 ID3 EX3 MA3 WB3
instruction4 IF4 ID4 EX4 MA4 WB4
instruction5 IF5 ID5 EX5 MA5 WB5 25
Pipeline Diagrams: Space vs. Time
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory -back
phase fetch phase phase phase phase
time t0 t1 t2 t3 t4 t5 t6 t7 ....
Resources
IF I1 I2 I3 I4 I5
ID I1 I2 I3 I4 I5
EX I1 I2 I3 I4 I5
MA I1 I2 I3 I4 I5
WB I1 I2 I3 I4 I5 26
Instructions Interact With Each Other
in Pipeline
• Structural Hazard: An instruction in the
pipeline needs a resource being used by
another instruction in the pipeline
• Data Hazard: An instruction depends on a
data value produced by an earlier instruction
• Control Hazard: Whether or not an instruction
should be executed depends on a control
decision made by an earlier instruction
27
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
28
Overview of Structural Hazards
• Structural hazards occur when two instructions need
the same hardware resource at the same time
• Approaches to resolving structural hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create structural hazards
– Stall: Hardware includes control logic that stalls until
earlier instruction is no longer using contended resource
– Duplicate: Add more hardware to design so that each
instruction can access independent resources at the same
time
• Simple 5-stage MIPS pipeline has no structural hazards
specifically because ISA was designed that way
29
Example Structural Hazard:
0x4
Unified Memory
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
IF ID EX MEM WB
30
Example Structural Hazard:
0x4
Unified Memory
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
IF ID EX MEM WB
addr
rdata
wdata
Unified Memory 31
Example Structural Hazard:
0x4
2-Cycle Memory
Add M0 M1
we Stage Stage
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
IF ID EX MEM WB
32
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
33
Overview of Data Hazards
• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
34
Example Data Hazard
r4 r1… r1 …
0x4
Add IR IR IR
31
we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
MD1 MD2
...
r1 r0 + 10 (ADDI R1, R0, #10)
r4 r1 + 17 (ADDI R4, R1, #17) r1 is stale. Oops!
...
35
Feedback to Resolve Hazards
0x4 nop IR IR IR
Add
31
we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
...
MD1 MD2
r1 r0 + 10
r4 r1 + 17
...
37
Stalled Stages and Pipeline Bubbles
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3
(I4) stalled stages IF4 ID4 EX4 MA4 WB4
(I5) IF5 ID5 EX5 MA5 WB5
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I3 I3 I3 I4 I5
ID I1 I2 I2 I2 I2 I3 I4 I5
Resource
EX I1 nop nop nop I2 I3 I4 I5
Usage
MA I1 nop nop nop I2 I3 I4 I5
WB I1 nop nop nop I2 I3 I4 I5
0x4 nop
Add IR IR IR
31
we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
MD1 MD2
Cdest
we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
MD1 MD2
I-type: op rs rt immediate16
J-type: op immediate26
source(s) destination
ALU rd (rs) func (rt) rs, rt rd
ALUI rt (rs) op immediate rs rt
LW rt M [(rs) + immediate] rs rt
SW M [(rs) + immediate] (rt) rs, rt
BZ cond (rs)
true: PC (PC) + immediate rs
false: PC (PC) + 4 rs
J PC (PC) + immediate
JAL r31 (PC), PC (PC) + immediate 31
JR PC (rs) rs
JALR r31 (PC), PC (rs) rs 31
41
Deriving the Stall Signal
Cdest Cre
ws = Case opcode re1 = Case opcode
ALU rd ALU, ALUi,
ALUi, LW rt LW, SW, BZ,
JAL, JALR R31 JR, JALR on
J, JAL off
we = Case opcode
ALU, ALUi, LW (ws 0) re2 = Case opcode
JAL, JALR on ALU, SW on
... off ... off
Cstall
stall = ((rsD =wsE).weE +
(rsD =wsM).weM +
(rsD =wsW).weW) . re1D +
((rtD =wsE).weE +
(rtD =wsM).weM +
(rtD =wsW).weW) . re2D
42
Hazards due to Loads & Stores
Stall Condition
What if
(r1)+7 = (r3)+5 ?
0x4 nop IR IR IR
Add
31
we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
MD1 MD2
...
M[(r1)+7] (r2) Is there any possible data hazard
r4 M[(r3)+5] in this instruction sequence?
... 43
Data Hazards Due to Loads and Store
• Example instruction sequence
– Mem[ Regs[r1] + 7 ] <- Regs[r2]
– Regs[r4] <- Mem[ Regs[r3] + 5 ]
44
Overview of Data Hazards
• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
45
Adding Bypassing to the Datapath
stall
r4 r1... r1 ...
0x4 nop
E M W
IR IR IR
Add
31
ASrc
we
rs1
rs2
PC addr D rd1 A
we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
MD1 MD2
48
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE we-stallE = Case opcodeE
ALU, ALUi (ws 0) LW (ws 0)
... off JAL, JALR on
... off
49
Fully Bypassed Datapath
stall PC for JAL, ...
0x4 nop
E M W
IR IR IR
Add
ASrc 31
we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2
Is there still
a need for the
stall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D
+ (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D
50
Overview of Data Hazards
• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
51
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
52
Control Hazards
• What do we need to calculate next PC?
– For Jumps
• Opcode, offset and PC
– For Jump Register
• Opcode and Register value
– For Conditional Branches
• Opcode, PC, Register (for condition), and offset
– For all other instructions
• Opcode and PC
– have to know it’s not one of above!
53
Opcode Decoding Bubble
(assuming no branch delay slots for now)
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r3 (r2) + 17 IF2 IF2 ID2 EX2 MA2 WB2
(I3) IF3 IF3 ID3 EX3 MA3 WB3
(I4) IF4 IF4 ID4 EX4 MA4 WB4
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 nop I2 nop I3 nop I4
ID I1 nop I2 nop I3 nop I4
Resource
Usage EX I1 nop I2 nop I3 nop I4
MA I1 nop I2 nop I3 nop I4
WB I1 nop I2 nop I3 nop I4
Add
E M
0x4 nop
Add IR IR
Jump? I1
PC addr
inst IR
104 Inst
Memory I2
Jump? II21 I1
IRSrcD
Any
PC addr nop interaction
inst IR
304
104 Inst between
Memory nop
I2
stall and
IRSrcD = Case opcodeD jump?
I1 096 ADD
J, JAL nop
I2 100 J 304
... IM
I3 104 ADD kill
I4 304 ADD
56
Jump Pipeline Diagrams
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1
(I2) 100: J 304 IF2 ID2 EX2 MA2 WB2
(I3) 104: ADD IF3 nop nop nop nop
(I4) 304: ADD IF4 ID4 EX4 MA4 WB4
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 nop I4 I5
Resource
EX I1 I2 nop I4 I5
Usage
MA I1 I2 nop I4 I5
WB I1 I2 nop I4 I5
Add
E M
0x4 nop
IR IR
Add
BEQZ? I1
zero?
IRSrcD
PC addr nop A
inst IR
ALU Y
104 Inst
Memory I2
Add
E BEQZ? M
0x4 nop
IR IR
Add
I2 I1
zero?
IRSrcD
PC addr nop A
inst IR
ALU Y
108 Inst
Memory I3
Add
E BEQZ? M
IRSrcE
0x4 nop
IR IR
Add
Jump? I2 I1
zero?
PC
IRSrcD
PC addr nop A
inst IR
ALU Y
108 Inst
Memory I3
61
Control Equations for PC and IR Muxes
PCSrc = Case opcodeE
BEQZ.z, BNEZ.!z br Give priority
...
Case opcodeD
to the older
J, JAL jabs instruction,
JR, JALR rind
... pc+4 i.e., execute-stage
IRSrcD = Case opcodeE instruction
BEQZ.z, BNEZ.!z nop
... over decode-stage
Case opcodeD instruction
J, JAL, JR, JALR nop
... IM
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 nop I5
Resource
EX I1 I2 nop nop I5
Usage
MA I1 I2 nop nop I5
WB I1 I2 nop nop I5
Add E
nop
0x4 IR
Add
Zero detect on
register file output
we
rs1
rs2
addr nop rd1
PC ws
inst IR
wd rd2
Inst GPRs
D
Memory
Pipeline diagram now same as for jumps64
Branch Delay Slots
(expose control hazard to software)
I1 096 ADD
I2 100 BEQZ r1 +200
Delay slot instruction executed
I3 104 ADD
I4 304 ADD
regardless of branch outcome
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4
ID I1 I2 I3 I4
Resource
EX I1 I2 I3 I4
Usage
MA I1 I2 I3 I4
WB I1 I2 I3 I4
66
Why an Instruction may not be
dispatched every cycle (CPI>1)
• Full bypassing may be too expensive to implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase cycle time and
counteract the benefit of reducing CPI
• Loads have two-cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible pipeline hazard
(compiler schedules independent instruction or inserts NOP to avoid
hazard). Removed in MIPS-II (pipeline interlocks added in hardware)
• MIPS:“Microprocessor without Interlocked Pipeline Stages”
• Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots
68
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards
69
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
70
Computer Architecture
ELE 475 / COS 475
Slide Deck 3: Cache Review
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance
2
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance
3
Naive Register File
Write
Data
Read
Data
clk Read
Decoder Address
Write
Address
4
Memory Arrays: Register File
5
Memory Arrays: SRAM
6
Memory Arrays: DRAM
7
Relative Memory Sizes of
SRAM vs. DRAM
On-Chip DRAM on
SRAM on memory chip
logic chip
Register File
SRAM
High Capacity
DRAM High Latency
Low Bandwidth
9
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance
10
CPU-Memory Bottleneck
Main
Processor
Memory
[Hennessy &
Patterson 2011]
Processor
Small
Memory
Big Memory
13
Memory Hierarchy
Small Fast Big Slow
Processor Memory Memory
(RF, SRAM) (DRAM)
Temporal
Locality
Memory Address
Temporal
& Spatial
Locality
17
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance
18
Inside a Cache
Address Address
Main
Processor CACHE Memory
Data Data
Address 6848
Tag 416
Data Block
19
Basic Cache Algorithm for a Load
20
Classifying Caches
Address Address
Main
Processor CACHE Memory
Data Data
Memory
Cache
Memory
Cache
28
Average Memory Access Time
Hit
Main
Processor CACHE Memory
Miss
• Average Memory Access Time = Hit Time + ( Miss Rate * Miss Penalty )
29
Categorizing Misses: The Three C’s
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Cache Size
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: High Associativity
36
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
37
Computer Architecture
ELE 475 / COS 475
Slide Deck 4: Superscalar 1
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Types of Data Hazards
Consider executing a sequence of
rk ri op rj
type of instructions
Data-dependence
r3 r1 op r2 Read-after-Write
r5 r3 op r4 (RAW) hazard
Anti-dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR) hazard
Output-dependence
r3 r1 op r2 Write-after-Write
r3 r6 op r7 (WAW) hazard
2
Introduction to Superscalar Processor
• Processors studied so far are fundamentally
limited to CPI >= 1
• Superscalar processors enable CPI < 1 (IPC > 1)
by executing multiple instructions in parallel
• Can have both in-order and out-of-order
superscalar processors. We will start with in-
order.
3
Baseline 2-Way In-Order Superscalar
Processor
Data
Cache
4 Read 2 Write
IR0 Ports Branch Cond. Ports
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata
Data
Fetch 2 Instructions at Cache
same time
Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 5
Baseline 2-Way In-Order Superscalar
Processor
Data
Issue Logic / Cache
Instruction
Steering Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 6
Baseline 2-Way In-Order Superscalar
Duplicate Control
Processor
IR0 Decode A
IR1 Decode B
Data
Cache
ADDIU F D A0 A1 W
LW F D B0 B1 W
Instruction Issue Logic swaps from
LW F D B0 B1 W natural position
ADDIU F D A0 A1 W
LW F D B0 B1 W
Structural
LW F D D B0 B1 W Hazard
8
Dual Issue Data Hazards
No Bypassing:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R5,R6,1 F D A0 A1 W
ADDIU R7,R5,1 F D D D D A0 A1 W
Full Bypassing:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R5,R6,1 F D A0 A1 W
ADDIU R7,R5,1 F D D A0 A1 W 9
Dual Issue Data Hazards
Order Matters:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R7,R5,1 F D A0 A1 W
ADDIU R5,R6,1 F D B0 B1 W
10
Fetch Logic and Alignment
Cyc Addr Instr
0 0x000 OpA 0x000 0 0 1 1
0 0x004 OpB
1 0x008 OpC …
1 0x00C J 0x100
… 0x100 2 2
2 0x100 OpD
2 0x104 J 0x204 …
…
3 0x204 OpE 0x200 3 3
3 0x208 J 0x30C
…
…
4 0x30C OpF 0x300 4
4 0x310 OpG
5 0x314 OpH 0x310 4 5
12
With Alignment Constraints
Cyc Addr Instr
? 0x000 OpA 0x000 0 0 1 1
? 0x004 OpB
? 0x008 OpC …
? 0x00C J 0x100
… 0x100 2 2
? 0x100 OpD
? 0x104 J 0x204 …
…
? 0x204 OpE 0x200 3 3 4 4
? 0x208 J 0x30C
…
…
? 0x30C OpF 0x300 5 5
? 0x310 OpG
? 0x314 OpH 0x310 6 6
13
With Alignment Constraints
Cyc Addr Instr
1 0x000 OpA F D A0 A1 W
1 0x004 OpB F D B0 B1 W
2 0x008 OpC F D B0 B1 W
2 0x00C J 0x100 F D A0 A1 W
3 0x100 OpD F D B0 B1 W
3 0x104 J 0x204 F D A0 A1 W
4 0x200 ? F - - - -
4 0x204 OpE F D A0 A1 W
5 0x208 J 0x30C F D A0 A1 W
5 0x20C ? F - - - -
6 0x308 ? F - - - -
6 0x30C OpF F D A0 A1 W
7 0x310 OpG F D A0 A1 W
7 0x314 OpH F D B0 B1 W
14
Precise Exceptions and Superscalars
• Similar to tracking program order for data
dependencies, we need to track order for
exceptions
LW F D B0 B1 W
SYSCALL F D A0 A1 W
15
Bypassing in Superscalar Pipelines
Data
Cache
16
Bypassing in Superscalar Pipelines
Branch Cond.
ALU
A
RF
Write
ALU
addr
B rdata
Data
Cache
17
Bypassing in Superscalar Pipelines
Branch Cond.
ALU
A
RF
Write
ALU
addr
B rdata
Data
Cache
18
Bypassing in Superscalar Pipelines
Branch Cond.
ALU
A1 3 5
RF
Write
ALU
addr
B2 rdata
Data
4 6
Cache
19
123456
Breaking Decode and Issue Stage
• Bypass Network can become very complex
• Can motivate breaking Decode and Issue Stage
D = Decode, Possibly resolve structural Hazards
I = Register file read, Bypassing, Issue/Steer
Instructions to proper unit
OpA F D I A0 A1 W
OpB F D I B0 B1 W
OpC F D I A0 A1 W
OpD F D I B0 B1 W
20
Superscalars Multiply Branch Cost
BEQZ F D I A0 A1 W
OpA F D I B0 - -
OpB F D I - - -
OpC F D I - - -
OpD F D - - - -
OpE F D - - - -
OpF F - - - - -
OpG F - - - - -
OpH F D I A0 A1 W
OpI F D I B0 B1 W
21
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
22
Computer Architecture
ELE 475 / COS 475
Slide Deck 5: Superscalar 2 and
Exceptions
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Interrupts
• Out-of-Order Processors
2
Interrupts:
altering the normal flow of control
Ii-1 HI1
interrupt
program Ii HI2
handler
Ii+1 HIn
5
Interrupt Handler
• Saves EPC before re-enabling interrupts to allow nested
interrupts
– need an instruction to move EPC into GPRs
– need a way to mask further interrupts at least until EPC can be saved
• Needs to read a status register that indicates the cause
of the interrupt
• Uses a special indirect jump instruction RFE (return-
from-exception) to resume user code, this:
– enables interrupts
– restores the processor to the user mode
– restores hardware status and control state
6
Synchronous Interrupts
• A synchronous interrupt (exception) is caused by a
particular instruction
7
Exception Handling 5-Stage Pipeline
Inst. Data
PC D Decode E + M W
Mem Mem
Asynchronous Interrupts
Inst. Data
PC D Decode E + M W
Mem Mem
EPC Cause
Exc Exc Exc
D E M
PC PC PC
Select D E M Asynchronous
Handler Kill F Kill D Kill E Kill
PC Stage Stage Stage Interrupts Writeback
9
Exception Handling 5-Stage Pipeline
• Hold exception flags in pipeline until commit point (M
stage)
10
Speculating on Exceptions
• Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
• Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline, special
hardware for various exception types
• Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 nop I5
Resource
EX I1 I2 nop nop I5
Usage
MA I1 nop nop nop I5
WB nop nop nop nop I5
12
Agenda
• Interrupts
• Out-of-Order Processors
13
Out-Of-Order (OOO) Introduction
Name Frontend Issue Writeback Commit
I4 IO IO IO IO Fixed Length Pipelines
Scoreboard
I2O2 IO IO OOO OOO Scoreboard
I2OI IO IO OOO IO Scoreboard,
Reorder Buffer, and Store Buffer
I03 IO OOO OOO OOO Scoreboard and Issue Queue
IO2I IO OOO OOO IO Scoreboard, Issue Queue,
Reorder Buffer, and Store Buffer
14
OOO Motivating Code Sequence
0 MUL R1, R2, R3 0 1
1 ADDIU R11,R10,1
2 MUL R5, R1, R4 2 4
15
I4: In-Order Front-End, Issue,
Writeback, Commit
F D X M W
16
I4: In-Order Front-End, Issue,
Writeback, Commit
X1
X0
F D W
M0 M1
17
I4: In-Order Front-End, Issue,
Writeback, Commit (4-stage MUL)
X1 X2 X3
X0
F D X2 X3
M0 M1 W
Y0 Y1 Y2 Y3
F D I M0 M1
X2 X3 W
Y0 Y1 Y2 Y3
ARF R W
SB R/W W
19
Basic Scoreboard
Data Avail.
P F 4 3 2 1 0
P: Pending, Write to
R1
Destination in flight
R2 F: Which functional unit
R3 is writing register
Data Avail.: Where is the
…
write data in the
R31 functional unit pipeline
F D I M0 M1 W
Y0 Y1 Y2 Y3
ARF R W
SB R R/W W
23
I2O2 Scoreboard
• Similar to I4, but we can now use it to track
structural hazards on Writeback port
• Set bit in Data Avail. according to length of
pipeline
• Architecture conservatively stalls to avoid
WAW hazards by stalling in Decode therefore
current scoreboard sufficient. More
complicated scoreboard needed for
processing WAW Hazards
24
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 W
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F F F D D D D I X0 W
5 ADDIU R13,R12,1 F F F F D I X0 W
6 ADDIU R14,R12,2 F D I I X0 W
26
I2OI: In-order Frontend/Issue, Out-of-
order Writeback, In-order Commit
SB X0 PRF ARF
F D I L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
27
PRF=Physical Register File(Future File), ROB=Reorder Buffer, FSB=Finished Store Buffer (1 entry)
Reorder Buffer (ROB)
State S ST V Preg
--
P 1
F 1
P 1
P
F
P
P
--
--
State: {Free, Pending, Finished}
S: Speculative
ST: Store bit
V: Physical Register File Specifier Valid
Preg: Physical Register File Specifier 28
Reorder Buffer (ROB)
State S ST V Preg Next instruction allocates here in D
--
P 1 Tail of ROB
F 1 Speculative because branch is in flight
P 1
P
F Instruction wrote ROB out of order
P
P Head of ROB
--
--
State: {Free, Pending, Finished}
S: Speculative Commit stage is waiting for
ST: Store bit Head of ROB to be finished
V: Physical Register File Specifier Valid
Preg: Physical Register File Specifier 29
Finished Store Buffer (FSB)
V Op Addr Data
--
30
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F F F D D D D I X0 W r C
5 ADDIU R13,R12,1 F F F F D I X0 W r C
6 ADDIU R14,R12,2 F D I I X0 W r C
Cyc D I ROB 0 1 2 3
0 Empty = free entry in ROB
1 0
2 1 0 R1 State of ROB at beginning of cycle
3 2 1 R11
4 R5 Pending entry in ROB
5
6 3 2 R11 Circle=Finished (Cycle after W)
7 R7
8 R1
9 Last cycle before entry is freed from ROB
10 4 3
(Cycle in C stage)
11 5 4 R12
12 6 5 R13 R5
13 R14
14 6 R12
15 R13
16 R7 Entry becomes free and is freed
17 R14 on next cycle
18
19 31
What if First Instruction Causes an
Exception?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W /
1 ADDIU R11,R10,1 F D I X0 W r -- /
2 MUL R5, R1, R4 F D I I I Y0 /
3 MUL R7, R5, R6 F D D D I /
4 ADDIU R12,R11,1 F F F D /
F D I. . .
32
What About Branches?
Option 2
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 /
Squash instructions in ROB
2 ADDIU R5, R1, R4 F D I /
when Branch commits
3 ADDIU R7, R5, R6 F D /
T ADDIU R12,R11,1 F D I . . .
Option 1
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I -
Squash instructions earlier. Has more
2 ADDIU R5, R1, R4 F D -
complexity. ROB needs many ports.
3 ADDIU R7, R5, R6 F -
T ADDIU R12,R11,1 F D I . . .
Option 3
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 W / Wait for speculative instructions to
2 ADDIU R5, R1, R4 F D I X0 W / reach the Commit stage and squash in
3 ADDIU R7, R5, R6 F D I X0 W /
Commit stage
T ADDIU R12,R11,1 F D I X0 W C
33
What About Branches?
• Three possible designs with decreasing
complexity based on when to squash speculative
instructions and de-allocate ROB entry:
1. As soon as branch resolves
2. When branch commits
3. When speculative instructions reach commit
34
Avoiding Stalling Commit on Store
Miss
PRF ARF
W ROB C CSB R
FSB
0 OpA F D I X0 W C CSB=Committed Store Buffer
1 SW F D I S0 W C C C C
2 OpB F D I X0 W W W W C
3 OpC F D I X X X X W C
4 OpD F D I I I I X W C
F D I I
Q
M0 M1 W
Y0 Y1 Y2 Y3
ARF R W
SB R R/W W
I W R/W W
36
Q
Issue Queue (IQ)
Op Imm S V Dest V P Src0 V P Src1
Op: Opcode
Imm.: Immediate
S: Speculative Bit
V: Valid (Instruction has
corresponding Src/Dest)
P: Pending (Waiting on
operands to be produced)
F D I I
Q
M0 F D M0
I
Y0 Q
B
I Y0
Centralized Distributed
38
Advanced Scoreboard
Data Avail.
P 4 3 2 1 0
P: Pending, Write to
R1
Destination in flight
R2 Data Avail.: Where is the
R3 write data in the pipeline
and which functional unit
…
R31
39
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 W
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F D i I X0 W
5 ADDIU R13,R12,1 F D i I X0 W
6 ADDIU R14,R12,2 F D i I X0 W
Cyc D I IQ 0 1 2
0
1 0 Dest/Src0/Src1, Circle denotes value
2 1 0 R1/R2/R3 present in ARF
3 2 1 R11/R10
4 3 R5/R1/R4
5 4 R7/R5/R6 Value bypassed so no circle, present
6 5 2 R12/R11 bit
7 6 4 R13/R12 Value set present by
8 5 R14/R12 Instruction 1 in cycle 5, W
9 Stage
10 3
11 6 R14/R12
12
13
40
14
Assume All Instruction in Issue Queue
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D i I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D i I X0 W
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F D i I X0 W
5 ADDIU R13,R12,1 F D i I X0 W
6 ADDIU R14,R12,2 F D i I X0 W
41
IO2I: In-order Frontend, Out-of-order
Issue/Writeback, In-order Commit
SB X0 PRF ARF
F D I I
Q L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
42
IQ W R/W
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D i I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C
43
Out-of-order 2-Wide Superscalar
with 1 ALU
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C
44
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
45
Computer Architecture
ELE 475 / COS 475
Slide Deck 6: Superscalar 3
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation
2
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation
3
Speculation and Branches: I4
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R4, R5, 1 F D I X0 X1 X2 X3 W
2 MUL R6, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 BEQZ R6, Target F D D D I I I I X0 X1 X2 X3 W
4 ADDIU R8, R9 ,1 F F F D D D D I -- -- -- -- --
5 ADDIU R10,R11,1 F F F F D -- -- -- -- -- --
6 ADDIU R12,R13,1 F -- -- -- -- -- -- --
T F D I . . .
X0 X1 X2 X3
F D I M0 M1 X2 X3 W
Y0 Y1 Y2 Y3 4
Speculation and Branches: I2O2
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R4, R5, 1 F D I X0 W
2 MUL R6, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 BEQZ R6, Target F D D D I I I I X0 W
4 ADDIU R8, R9 ,1 F F F D D D D I -- --
5 ADDIU R10,R11,1 F F F F D -- -- --
6 ADDIU R12,R13,1 F -- -- -- --
T F D I . . .
F D I M0 M1 W
Y0 Y1 Y2 Y3 5
Speculation and Branches: I2OI
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R4, R5, 1 F D I X0 W r C
2 MUL R6, R1, R4 F D I I I Y0 Y1 Y2 Y3 W C
3 BEQZ R6, Target F D D D I I I I X0 W C
4 ADDIU R8, R9 ,1 F F F D D D D I -- -- --
5 ADDIU R10,R11,1 F F F F D -- -- -- --
6 ADDIU R12,R13,1 F -- -- -- -- --
T F D I . . .
S0
6
Y0 Y1 Y2 Y3
0 MUL
Speculation and Branches: IO3
R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R4, R5, 1 F D I X0 W
2 MUL R6, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 BEQZ R6, Target F D i I X0 W
4 ADDIU R8, R9 ,1 F D i I X0 W
5 ADDIU R10,R11,1 F D i I X0 W Speculative
6 ADDIU R12,R13,1 F D i I X0 W Instructions
7 ??? F D Wrote to ARF
8 ??? F D
9 ??? F D
10??? F D
11??? F D
T F D I . . .
SB X0 PRF ARF
F D I
Q I L0
S0
L1 W ROB
FSB
8
C
Y0 Y1 Y2 Y3
0 MUL
Speculation and Branches: IO2I
R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R4, R5, 1 F D I X0 W r C
2 MUL R6, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 BEQZ R6, Target F D i I X0 W C
4 ADDIU R8, R9 ,1 F D i I X0 W r /
5 ADDIU R10,R11,1 F D i I X0 W r /
Speculative
6 ADDIU R12,R13,1 F D i I X0 / Instructions
7 ??? F D / Wrote to PRF
8 ??? F D / Not ARF
9 ??? F D /
10??? F D /
11??? F D /
12??? F /
13??? /
T F D I . . .
Y0 Y1 Y2 Y3
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation
10
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)
11
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)
12
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)
13
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)
WAR 14
Adding More Registers
Breaking all “Name” Dependencies
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C
Manual Register Renaming. What if we could use more registers? Second R4 Write to R8?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R8, R7, 1 F D i I X0 W r C
15
Register Renaming
• Adding more “Names” (registers/memory)
removes dependence, but architecture
namespace is limited.
– Registers: Larger namespace requires more bits in
instruction encoding. 32 registers = 5 bits, 128
registers = 7 bits.
16
Register Renaming Overview
• 2 Schemes
– Pointers in the Instruction Queue/ReOrder Buffer
– Values in the Instruction Queue/ReOrder Buffer
17
IO2I: Register Renaming with Pointers
FL in IQ and ROB
RT SB X0 PRF ARF
F D I I
Q L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
• All data structures same as in IO2I Except:
– Add two fields to ROB
– Add Rename Table (RT) and Free List (FL) of
registers
• Increase size of PRF to provide more register
“Names” 18
IO2I: Register Renaming with Pointers
FL in IQ and ROB
RT SB X0 PRF ARF
F D I I
Q L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
IQ W W
RT R/W W
19
FL R/W W
Modified Reorder Buffer (ROB)
State S ST V Preg Areg Ppreg
--
P
F
P
P
F
P
P
--
--
State: {Free, Pending, Finished} Areg: Architectural Register File Specifier
S: Speculative Ppreg: Previous Physical Register
ST: Store bit
V: Destination is valid
Preg: Physical Register File Specifier 20
Rename Table (RT)
P Preg
P: Pending, Write to Destination in flight
R1
Preg: Physical Register Architectural
R2 Register maps to.
R3
…
R31
21
Free List (FL)
Free
Free: Register is free for renaming
p1
p2
If Free == 0, physical register is in use and cannot be
p3 used for renaming
…
pN
22
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C
RT FL IQ ROB
Cy D I W C R1 R2 R3 R4 R5 R6 R7 0 1 2 3 0 1 2 3
0 p0 p1 p2 p3 p4 p5 p6 p{7,8,9,10}
1 0 p{7,8,9,10}
2 1 0 p7 p{8,9,10} p7/p1/p2 p7/R1/p0
3 2 p8 p{9,10} p8/p7/p4 p8/R4/p3
4 3 p9 p10 p9/p8 p9/R6/p5
5 p10 p10/p6 p10/R4/p8
6 1
7 0
8 3 0 p7 p7/R1/p0
9 3 p0
10 2 p10 p0 p10/R4/p8
11 1 p0
12 2 1 p0 p8/R4/p3
13 2 p9 p{0,3} p9/R6/p5
14 3 p{0,3,5}
15 p{0,3,5,8}
23
Freeing Physical Registers
ADDU R1,R2,R3 <-Assume Arch. Reg R1 maps to Phys. Reg p0
ADDU R4,R1,R5
ADDU R1,R6,R7 <-Next write of Arch Reg R1, Mapped to Phys. Reg p1
ADDU R8,R9,R10
0 ADDU R1,R2,R3 I X W C
1 ADDU R4,R1,R5 I X W C
2 ADDU R1,R6,R7 IX W r C
3 ADDU R8,R9,R10 F D I X W r C
Write p0 Free p0 Alloc p0 Write p0 Read Wrong
value in p0
0 ADDU R1,R2,R3 I X W C
1 ADDU R4,R1,R5 I X W C
2 ADDU R1,R6,R7 I X W r C
3 ADDU R8,R9,R10 F D I X W r C
Write p0 Alloc p2 Write p2 Dealloc p0
• If Arch. Reg Ri mapped to Phys. Reg pj, we can free pj when the next instruction
that writes Ri commits 24
Unified Physical/Architectural
Register File
• Combine PRF and ARF into one register file
• Replace ARF with Architectural Rename Table
• Instead of copying Values, Commit stage
copies Preg pointer into appropriate entry of
Architectural Rename Table
• Unified Physical/Architectural Register file can
be smaller than separate
25
IO2I: Register Renaming with Values in
IQ and ROB
RT SB X0 ARF
F D I I
Q L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
• All data structures same as previous Except:
– Modified ROB (Values instead of Register Specifier)
– Modified RT
– Modified IQ
– No FL
– No PRF, values merged into ROB
26
IO2I: Register Renaming with Values in
IQ and ROB
RT SB X0 ARF
F D I I
Q L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
ARF R W
SB R/W W
ROB R/W W R/W
FSB W R/W
IQ W W
RT R/W W
27
Modified Reorder Buffer (ROB)
State S ST V Value Areg
--
P
F
P
P
F
P
P
--
--
State: {Free, Pending, Finished} Areg: Architectural Register File Specifier
S: Speculative
ST: Store bit
V: Destination is valid
Value: Actual Register Value 28
Modified Issue Queue (IQ)
Op Imm S V Dest V P Src0 V P Src1
Op: Opcode
Imm.: Immediate
S: Speculative Bit
V: Valid (Instruction has
corresponding Src/Dest)
P: Pending (Waiting on
operands to be produced)
29
Modified Rename Table (RT)
V P Preg
V: Valid Bit
R1
P: Pending, Write to Destination in flight
R2 Preg: Index into ROB
R3
…
R31
V:
If V == 0:
Value in ARF is up to date
If V == 1:
Value is in-flight or in ROB
P:
If P == 0:
Value is in ROB
if P == 1:
Value is in flight 30
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C
RT IQ ROB
Cy D I W C R1 R2 R3 R4 R5 R6 R7 0 1 2 3 0 1 2 3
0
1 0
2 1 0 p0 p0/R2/R3 p0/R1
3 2 p1 p1/p0/R5 p1/R4
4 3 p2 p2/p1 p2/R6
5 p3 p3/R7 p3/R4
6 1
7 0
8 3 0 p0/R1
9 3
10 2 p3 p3/R4
11 1
12 2 1 p1/R4
13 2 p2/R6
14 3
15
31
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation
32
Memory Disambiguation
st R1, 0(R2)
ld R3, 0(R4)
33
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave IQ for execution
until all previous loads and stores have
completed execution
34
IO2I: With In-Order LD/ST IQ
Int SB X0 PRF ARF
F D I
Q I L0 L1 W ROB
FSB
C
LD/ S0
ST
I Y0 Y1 Y2 Y3
Q
35
Conservative OOO Load Execution
st R1, 0(R2)
ld R3, 0(R4)
• Split execution of store instruction into two phases: address
calculation and data write
F D I
Q I L0 L1 W ROB
FSB
C
S0 FLB
Y0 Y1 Y2 Y3
38
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)
39
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
40
Speculative Loads / Stores
Just like register updates, stores should not modify
the memory until after the instruction is committed
42
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V S Tag Data Tags Data
V S Tag Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data
• On store execute:
– mark entry valid and speculative, and save data and tag of
instruction.
• On store commit:
– clear speculative bit and eventually move data to cache
• On store abort:
– clear valid bit
43
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V S Tag Data Tags Data
V S Tag Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data
44
Computer Architecture
ELE 475 / COS 475
Slide Deck 7: VLIW
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Superscalar Control Logic Scaling
Issue Width W
Issue Group
Previously
Issued Lifetime L
Instructions
• Each issued instruction must somehow check against W*L instructions, i.e.,
growth in hardware W*(W*L)
• For in-order machines, L is related to pipeline latencies and check is done during
issue (scoreboard)
• For out-of-order machines, L also includes time spent in IQ, SB, and check is
done by broadcasting tags to waiting instructions at completion
• As W increases, larger instruction window is needed to find enough parallelism
to keep machine busy => greater L
=> Out-of-order control logic grows faster than W2 (~W3) 2
Out-of-Order Control Complexity:
MIPS R10000
[A. Ahi et al., MIPS R10000 Superscalar Microprocessor, Hot Chips, 1995 ]
3
Image Credit: MIPS Technologies Inc. / Silicon Graphics Computer Systems
Out-of-Order Control Complexity:
MIPS R10000
Control
Logic
[A. Ahi et al., MIPS R10000 Superscalar Microprocessor, Hot Chips, 1995 ]
4
Image Credit: MIPS Technologies Inc. / Silicon Graphics Computer Systems
Sequential ISA Bottleneck
Sequential Superscalar compiler
source code
a = foo(b);
for (i=0, i<
Find independent
operations
5
Sequential ISA Bottleneck
Sequential Superscalar compiler
source code
a = foo(b);
for (i=0, i<
6
Sequential ISA Bottleneck
Sequential Superscalar compiler Sequential
source code machine code
a = foo(b);
for (i=0, i<
Superscalar processor
Check instruction
dependencies 7
Sequential ISA Bottleneck
Sequential Superscalar compiler Sequential
source code machine code
a = foo(b);
for (i=0, i<
Superscalar processor
10
VLIW Less-Than-or-Equals (LEQ)
Scheduling Model
• Each operation may take less than or equal
to its specified latency
– Destination can be written any time after
instruction issue
– Dependent instruction still needs to be
scheduled after instruction latency
• Precise interrupts simplified
• Binary compatibility preserved when latencies
are reduced
11
Early VLIW Machines
• FPS AP120B (1976)
– scientific attached array processor
– first commercial wide instruction machine
– hand-coded vector math libraries using software pipelining and loop
unrolling
• Multiflow Trace (1987)
– commercialization of ideas from Fisher’s Yale group including “trace
scheduling”
– available in configurations with 7, 14, or 28 operations/instruction
– 28 operations packed into a 1024-bit instruction word
• Cydrome Cydra-5 (1987)
– 7 operations encoded in 256-bit instruction word
– rotating register file
12
VLIW Compiler Responsibilities
• Schedule operations to maximize parallel
execution
13
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop:
Compile
14
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile
15
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile
16
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile
17
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile
20
Loop Unrolling
for (i=0; i<N; i++)
B[i] = A[i] + C;
22
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4
add.s F6, F0, F2 Schedule
add.s F7, F0, F3
add.s F8, F0, F4
sw F5, 0(R2)
sw F6, 4(R2)
sw F7, 8(R2)
sw F8, 12(R2)
addiu R2, R2, 16
bne R1, R3, loop
23
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4 add.s F5
add.s F6, F0, F2 Schedule add.s F6
add.s F7, F0, F3 add.s F7
add.s F8, F0, F4 add.s F8
sw F5, 0(R2) sw F5
sw F6, 4(R2) sw F6
sw F7, 8(R2) sw F7
sw F8, 12(R2)
add R2 bne sw F8
addiu R2, R2, 16
bne R1, R3, loop
24
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4 add.s F5
add.s F6, F0, F2 Schedule add.s F6
add.s F7, F0, F3 add.s F7
add.s F8, F0, F4 add.s F8
sw F5, 0(R2) sw F5
sw F6, 4(R2) sw F6
sw F7, 8(R2) sw F7
sw F8, 12(R2)
add R2 bne sw F8
addiu R2, R2, 16
bne R1, R3, loop
27
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
addiu R1, R1, 16 add.s F5
add.s F5, F0, F1 add.s F6
add.s F6, F0, F2 add.s F7
add.s F7, F0, F3 add.s F8
add.s F8, F0, F4 sw F5
sw F5, 0(R2) sw F6
sw F6, 4(R2) add R2 sw F7
sw F7, 8(R2) bne sw F8
sw F8, 12(R2)
addiu R2, R2, 16
bne R1, R3, loop
28
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 sw F5 add.s F5
sw F5, 0(R2) sw F6 add.s F6
sw F6, 4(R2) add R2 sw F7 add.s F7
sw F7, 8(R2) bne sw F8 add.s F8
sw F8, 12(R2)
sw F5
addiu R2, R2, 16
sw F6
bne R1, R3, loop
add R2 sw F7
bne sw F8
29
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 lw F1 sw F5 add.s F5
sw F5, 0(R2) lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop
add R2 sw F7 add.s F7
bne sw F8 add.s F8
sw F5 30
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
prolog
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 loop: lw F1 sw F5 add.s F5
sw F5, 0(R2) iterate
lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop epilog
add R2 sw F7 add.s F7
bne sw F8 add.s F8
sw F5 31
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
prolog
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 loop: lw F1 sw F5 add.s F5
sw F5, 0(R2) iterate
lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop epilog
add R2 sw F7 add.s F7
bne sw F8 add.s F8
How many FLOPS/cycle?
sw F5 32
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
prolog
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 loop: lw F1 sw F5 add.s F5
sw F5, 0(R2) iterate
lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop epilog
add R2 sw F7 add.s F7
bne sw F8 add.s F8
How many FLOPS/cycle?
sw F5
4 add.s / 4 cycles = 1
33
Software Pipelining vs. Loop Unrolling
Loop Unrolled Wind-down overhead
performance
Startup overhead
Software Pipelined
performance
Loop time
Iteration
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration
34
What if there are no loops?
35
Trace Scheduling [ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace
36
Trace Scheduling [ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace
37
Problems with “Classic” VLIW
• Object-code compatibility
– have to recompile all code for every machine, even for two machines in same
generation
• Object code size
– instruction padding wastes instruction memory/cache
– loop unrolling/software pipelining replicates code
• Scheduling variable latency memory operations
– caches and/or memory bank conflicts impose statically unpredictable
variability
• Knowing branch probabilities
– Profiling requires an significant extra step in build process
• Scheduling for statically unpredictable branches
– optimal schedule varies with branch path
• Precise Interrupts can be challenging
– Does fault in one portion of bundle fault whole bundle?
– EQ Model has problem with single step, etc.
38
VLIW Instruction Encoding
41
Full Predication
– Almost all instructions can be executed
conditionally under predicate
– Instruction becomes NOP if predicate register false
b0: Inst 1 if
Inst 2
br a==b, b2
b3: Inst 7
Inst 8
42
Four basic blocks
Full Predication
– Almost all instructions can be executed
conditionally under predicate
– Instruction becomes NOP if predicate register false
b0: Inst 1 if
Inst 2
br a==b, b2 Inst 1
Inst 2
b1: Inst 3 else
Inst 4 p1,p2 <- cmp(a==b)
br b3 Predication (p1) Inst 3 || (p2) Inst 5
(p1) Inst 4 || (p2) Inst 6
b2: Inst 5 Inst 7
then
Inst 6 Inst 8
One basic block
b3: Inst 7
Inst 8
Mahlke et al, ISCA95: On
average >50% branches removed 43
Four basic blocks
Predicates and the Datapath
stall PC for JAL, ...
0x4 nop
E M W
IR IR IR
Add
ASrc 31
we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2
44
Predicates and the Datapath
stall PC for JAL, ...
0x4 nop
E M W
IR IR IR
Add
ASrc 31
we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2
45
Predicates and the Datapath
stall PC for JAL, ...
0x4 nop
E M W
IR IR IR
Add
ASrc 31
we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2
47
Leveraging Speculative Execution and
Reacting to Dynamic Events in a VLIW
Speculation:
• Moving instructions across branches
• Moving memory operations past other memory
operations
Dynamic Events:
• Cache Miss
• Exceptions
• Branch Mispredict
48
Code Motion
Before Code Motion After Code Motion
MUL R1, R2, R3 LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R5, R1, R4 MUL R1, R2, R3
MUL R7, R5, R6 ADDIU R12,R11,1
SW R7, 0(R16) MUL R5, R1, R4
ADDIU R12,R11,1 ADD R13,R12,R14
LW R14, 0(R9) MUL R7, R5, R6
ADD R13,R12,R14 ADD R14,R12,R13
ADD R14,R12,R13 SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target
49
Scheduling and Bundling
Before Bundling After Bundling
LW R14, 0(R9) {LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R1, R2, R3 MUL R1, R2, R3}
ADDIU R12,R11,1 {ADDIU R12,R11,1
MUL R5, R1, R4 MUL R5, R1, R4}
ADD R13,R12,R14 {ADD R13,R12,R14
MUL R7, R5, R6 MUL R7, R5, R6}
ADD R14,R12,R13 {ADD R14,R12,R13
SW R7, 0(R16) SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target}
50
VLIW Speculative Execution
Problem: Branches restrict compiler code motion
Solution: Speculative operations that don’t cause exceptions
Speculative load
Inst 1 Load.s r1 never causes
Inst 2 Inst 1 exception, but sets
br a==b, b2 Inst 2 “poison” bit on
br a==b, b2 destination register
Load r1
Use r1 Check for exception in
Inst 3 Chk.s r1 original home block
Use r1 jumps to fixup code if
Can’t move load above branch Inst 3
exception detected
because might cause spurious
exception
54
VLIW Speculative Execution
Problem: Branches restrict compiler code motion
Solution: Speculative operations that don’t cause exceptions
Speculative load
Inst 1 Load.s r1 never causes
Inst 2 Inst 1 exception, but sets
br a==b, b2 Inst 2 “poison” bit on
br a==b, b2 destination register
Load r1
Use r1 Check for exception in
Inst 3 Chk.s r1 original home block
Use r1 jumps to fixup code if
Can’t move load above branch Inst 3
exception detected
because might cause spurious
exception
57
VLIW Data Speculation
Problem: Possible memory hazards limit code scheduling
Solution: Hardware to check pointer hazards
61
VLIW Multi-Way Branches
Problem: Long instructions provide few opportunities for branches
Solution: Allow one instruction to branch multiple directions
63
VLIW Multi-Way Branches
Problem: Long instructions provide few opportunities for branches
Solution: Allow one instruction to branch multiple directions
{ .mii
cmp.eq P1, P2 = R1, R2
cmp.ne P3,P4 = R4, R5
cmp.lt P5,P6, R8, R9
}
{ .bbb
(P1) br.cond label1
(P2) br.cond label2
(P5) br.cond label3
}
// fall through code here
64
Scheduling Around Dynamic Events
• Cache Miss
– Informing loads (loads nullify subsequent
instructions)
– Elbrus (Soviet/Russian) processor had branch on
cache miss
• Branch Mispredict
– Delay slots with predicated instructions
• Exceptions
– Hard on superscalar also…
65
Clustered VLIW
• 8 cores
• Cores are 2-way multithreaded
• 1-cycle 16KB L1 I&D caches
• 9-cycle 512KB L2 I-cache • 6 instruction/cycle fetch
– Two 128-bit bundles
• 8-cycle 256KB L2 D-cache
• 32 MB shared L3 cache • Up to 12 insts/cycle execute
• 544mm2 in 32nm CMOS
• Over 3 billion transistors 71
IA-64 Instruction Format
Instruction 2 Instruction 1 Instruction 0 Template
72
IA-64 Registers
• 128 General Purpose 64-bit Integer Registers
• 128 General Purpose 64/80-bit Floating Point
Registers
• 64 1-bit Predicate Registers
80
Computer Architecture
ELE 475 / COS 475
Slide Deck 8: Branch Prediction
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address
2
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address
3
Longer Frontends Means More Control
Flow Penalty
RT
SB X0 ARF
F D I
Q I L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
Penalty includes
instructions in IQ
4
Longer Pipeline Frontends Amplify
Branch Cost
Data
Cache
6
Superscalars Multiply Branch Cost
BEQZ F D I A0 A1 W
OpA F D I B0 - -
OpB F D I - - -
OpC F D I - - - Dual-issue
Processor
OpD F D - - - - has twice
the mispredict
OpE F D - - - - penalty
OpF F - - - - -
OpG F - - - - -
OpH F D I A0 A1 W
OpI F D I B0 B1 W
How much work is lost if pipeline doesn’t follow correct instruction flow?
~pipeline width x branch penalty 7
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address
8
Branch Prediction
• Essential in modern processors to mitigate
branch delay latencies
9
Where is the Branch Information
Known?
F D I X M W
Know branch outcome
Know target address for JR, JALR
10
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address
11
Branch Delay Slots
(expose control hazard to software)
I1 096 ADD
I2 100 BEQZ r1 +200
I3 104 ADD Delay slot instructions executed
I4 108 ADD regardless of branch outcome
I5 304 ADD
12
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:
BEZ
backward forward
90% 50%
BEZ
13
Static Software Branch Prediction
• Extend ISA to enable compiler to tell microarchitecture if branch is
likely to be taken or not (Can be up to 80% accurate)
BR.T F D X M W
OpA F - - - -
Targ F D X M W
BR.NT F D X M W
OpB F D X M W
OpC F D X M W
16
Dynamic Hardware Branch Prediction:
Exploiting Temporal Correlation
• Exploit structure in program: The way a
branch resolves may be a good indicator of
the way it will resolve the next time it
executes (Temporal Correlation)
T Predict Predict NT
T NT
T
17
1-bit Saturating Counter
NT
T Predict Predict NT
T NT
T
NT
Predict
Strong NT
Not Taken
NT 20
Branch History Table (BHT)
Fetch PC 00
k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry
Instruction
Opcode offset
Branch
Outcome BHR Indexes
(T/NT) PHT
FSM
Output
Logic
Prediction (T/NT) 23
Two-Level Branch Predictor
PHT 0 PHT 2^(k-1)
Branch
Outcome
(T/NT)
BHR Indexes
PHT …
PC
FSM
Output
Logic
24
Prediction (T/NT)
Generalized Two-Level Branch
Predictor PHT 0 PHT 2^(k-1)
Branch
Outcome BHR
(T/NT) Indexes
PHT …
PC
FSM
m Output
Logic
For non-trivial m and k, > 97% accuracy 25
Prediction (T/NT)
Tournament Predictors
(ex: Alpha 21264)
PC
Global Local
Predictor Predictor
Choice
Predictor
Prediction (T/NT)
• Choice predictor learns whether best to use local or global branch
history in predicting next branch
• Global history is speculatively updated but restored on mispredict
• Claim 90-100% success on range of applications 26
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address
27
Predicting Target Address
F D I X M W
Know target address for JR, JALR
New PC if FSM
== hit and Output
predicted Logic
taken
Prediction (T/NT)
Hit
30
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly
&fd() k entries
&fc() (typically k=8-16)
&fb()
32
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
33
Computer Architecture
ELE 475 / COS 475
Slide Deck 9: Advanced Caches
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart
2
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart
3
Average Memory Access Time
Hit
Main
Processor CACHE Memory
Miss
• Average Memory Access Time = Hit Time + ( Miss Rate * Miss Penalty )
4
Categorizing Misses: The Three C’s
7
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Cache Size
8
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: High Associativity
10
Write Performance
Tag Index Block
Offset
b
t
k
V Tag Data
2k
lines
t
= WE
Solutions:
• Design data RAM that can perform read and write
concurrently, restore old value after tag miss
• Fully-associative (CAM Tag) caches: Word line only
enabled if hit
12
Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one
cycle for tag check plus one cycle for data write if hit
Solutions:
• Design data RAM that can perform read and write
concurrently, restore old value after tag miss
• Fully-associative (CAM Tag) caches: Word line only
enabled if hit
13
Pipelining Cache Writes
F D X M W
Delayed
Cache
Write
Buffer
=? 1 0
Pipelined
Writes
16
Pipelined Cache Efficacy
Pipelined
Writes - +
17
Write Buffer to Reduce Read Miss
Penalty
CPU L1 Data Unified
Cache L2 Cache
Write
RF
buffer
Evicted dirty lines for writeback cache
OR
All writes in writethrough cache
Write Buffer
19
Write Buffer Efficacy
Write Buffer
+
20
Multilevel Caches
Problem: A memory cannot be large and fast
Solution: Increasing sizes of cache at each level
21
Presence of L2 influences L1 design
• Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time and
reduced L1 miss penalty
– Reduces average access energy
• Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)
22
Inclusion Policy
• Inclusive multilevel cache:
– Inner cache holds copies of data in outer cache
– External coherence snoop access need only check
outer cache
• Exclusive multilevel caches:
– Inner cache may hold data not in outer cache
– Swap lines between inner/outer caches on miss
– Used in AMD Athlon with 64KB primary and 256KB
secondary cache
Why choose one type or the other?
23
Itanium-2 On-Chip Caches
(Intel/HP, 2002)
Level 1: 16KB, 4-way s.a.,
64B line, quad-port (2
load+2 store), single
cycle latency
Multilevel
Cache
26
Multilevel Cache Efficacy L1
Multilevel
Cache +
27
Multilevel Cache Efficacy L1, L2, L3
Multilevel
Cache + +
28
Victim Cache
• Small Fully Associative cache for recently evicted lines
– Usually small (4-16 blocks)
• Reduced conflict misses
– More associativity for small number of lines
• Can be checked in parallel or series with main cache
• On Miss in L1, Hit in VC: VC->L1, L1->VC
• On Miss in L1, Miss in VC: L1->VC, VC->? (Can always be clean)
Unified
CPU L1 Data L2 Cache
Cache
RF
Evicted Data from L1
Victim ?
Hit Data (miss in L1) Cache (FA,
29
small)
Victim Cache Efficacy
Victim Cache
30
Victim Cache Efficacy L1
Victim Cache
+
31
Victim Cache Efficacy L1 and VC
Victim Cache
+ +
32
Prefetching
• Speculate on future instruction and data accesses
and fetch them into cache(s)
– Instruction accesses easier to predict than data accesses
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes
Prefetched data
34
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and
the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream
buffer block into cache and prefetch next block (i+2)
Prefetched
Req
Stream instruction block
block
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
35
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch
b+3N etc.
37
Software Prefetching Issues
• Timing is the biggest issue, not predictability
– If you prefetch very close to when the data is required, you might be
too late
– Prefetch too early, cause pollution
– Estimate how long it will take for the data to come into L1, so we can
set P appropriately
– Why is this hard to do?
Prefetching
39
Prefetching Efficacy
Prefetching
+ +
40
Increasing Cache Bandwidth
Multiporting and Banking
Data
Cache
41
Increasing Cache Bandwidth
Multiporting and Banking
B rdata
42
Increasing Cache Bandwidth
Multiporting and Banking
B rdata
Address 1
addr
Data 1
rdata
Data
Cache
Address 2
addr
Data 2
rdata 44
Banked Caches
• Partition Address Space into multiple banks
– Use portions of address (low or high order interleaved)
Benefits:
• Higher throughput
Challenges:
• Bank Conflicts
• Extra Wiring
• Uneven utilization
Address 0 Data 0
Bank 0
45
Cache Banking Efficacy
Cache
Banking +
46
Compiler Optimizations
• Restructuring code affects the data block access
sequence
– Group data accesses together to improve spatial locality
– Re-order data accesses to improve temporal locality
• Prevent data from entering the cache
– Useful for variables that will only be accessed once before being
replaced
– Needs mechanism for software to tell hardware not to cache data
(“no-allocate” instruction hints or page table bits)
• Kill data that will never be used again
– Streaming data exploits spatial locality but not temporal locality
– Replace into dead cache locations
47
Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}
48
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];
49
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];
50
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];
y k x j
i i
y k x j
i i
53
Matrix Multiply with Cache Tiling/Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B) z j
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}
y k x j
i i
y k x j
i i
y k x j
i i
y k x j
i i
Compiler
Optimization
58
Compiler Memory Optimizations
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization
Compiler
Optimization +
59
Non-Blocking Caches
(aka Out-Of-Order Memory System)
(aka Lockup Free Caches)
• Enable subsequent cache accesses after a cache
miss has occurred
– Hit-under-miss
– Miss-under-miss (concurrent misses)
• Suitable for in-order processor or out-of-order
processors
• Challenges
– Maintaining order when multiple misses that might
return out of order
– Load or Store to an already pending miss address
(need merge) 60
Non-Blocking Cache Timeline
Cache Miss
V: Valid V: Valid
Block Address: Address of cache block MSHR Entry: Entry Number
in memory system Type: {LW, SW, LH, SH, LB, SB}
Issues: Issued to Main Memory/Next Offset: Offset within the block
level of cache Destination: (Loads) Register, (Stores)
Store buffer entry 62
Non-Blocking Cache Operation
On Cache Miss:
• Check MSHR for matched address
– If found: Allocate new Load/Store entry pointing to MSHR
– If not found: Allocate new MSHR entry and Load/Store entry
– If all entries full in MSHR or Load/Store entry table, stall or
prevent new LDs/STs
On Data Return from Memory:
• Find Load or Store waiting for it
– Forward Load data to processor/Clear Store Buffer
– Could be multiple Loads and Stores
• Write Data to cache
When Cache Lines is Completely Returned:
• De-allocate MSHR entry
63
Non-Blocking Cache with In-Order
Pipelines
• Need Scoreboard for Individual Registers
On Load Miss:
• Mark Destination Register as Busy
64
Non-Blocking Cache Efficacy
Non-blocking
Cache
65
Non-Blocking Cache Efficacy
Non-blocking
Cache + +
66
Critical Word First
• Request the missed word from memory first.
• Rest of cache line comes after “critical word”
– Commonly words come back in rotated order
CPU Time CPU Time
Basic Blocking Cache:
Miss Penalty
Order of fill: 0, 1, 2, 3, 4, 5, 6, 7
Order of fill: 3, 4, 5, 6, 7, 0, 1, 2
67
Early Restart
• Data returns from memory in order
• Processor Restarts when needed word is
returned
CPU Time CPU Time
Basic Blocking Cache:
Miss Penalty
Order of fill: 0, 1, 2, 3, 4, 5, 6, 7
Order of fill: 0, 1, 2, 3, 4, 5, 6, 7
68
Critical Word First and Early Restart
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization
Critical Word
First/Early
Restart
69
Critical Word First and Early Restart
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization
Critical Word
First/Early +
Restart
70
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart
71
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
72
Computer Architecture
ELE 475 / COS 475
Slide Deck 10: Address Translation
and Protection
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Memory Management
• From early absolute addressing schemes, to
modern virtual memory systems with support for
virtual machine monitors
3
Bare Machine
Physical Physical
Address Inst. Address Data
PC D Decode E + M W
Cache Cache
4
Dynamic Address Translation
Location-independent programs
Programming and storage management ease
need for a base register prog1
Protection
Physical Memory
Independent programs should not affect
each other inadvertently
need for a bound register
Multiprogramming drives requirement for
resident supervisor to manage context prog2
switches between multiple programs
OS
5
Simple Base and Bound Translation
Segment Length
Bound Bounds
Register Violation?
Physical Memory
Physical current
Load X Effective Address segment
Address +
Base
Register
Base Physical Address
Program
Address
Space
Main Memory
Data Base Physical
Register + Address
Inst. Data
PC + D Decode E + M + Cache W
Cache
Physical Physical
Address Address
Program Base Data Base
Register Register
Physical Physical
Address Address
Memory Controller
Physical Address
Main Memory (DRAM)
10
Private Address Space per User
OS
User 1 VA1 pages
Page Table
User 2
Physical Memory
VA1
Page Table
User 3 VA1
12
Page Tables in Physical Memory
PT
User
1
VA1
PT
Physical Memory
User
User 1 Virtual 2
Address Space
VA1
User 2 Virtual
Address Space
13
Linear Page Table
• Page Table Entry (PTE) Data Pages
Page Table
contains: PPN
– A bit to indicate if a page PPN
exists DPN
PPN
– PPN (physical page number) Data word
for a memory-resident page
Offset
– DPN (disk page number) for
a page on the disk
– Status bits for protection DPN
and usage PPN
• OS sets the Page Table Base PPN
DPN
Register whenever active DPN
user process changes VPN
DPN
PPN
PPN
PT Base Register VPN Offset
Virtual address
14
Size of Linear Page Table
With 32-bit addresses, 4-KB pages & 4-byte PTEs:
220 PTEs, i.e, 4 MB page table per user per process
4 GB of swap needed to back up full virtual address
space
Larger pages?
• Internal fragmentation (Not all memory in page is used)
• Larger page fault penalty (more time to read from disk)
10-bit 10-bit
L1 index L2 index offset
Physical Memory
Root of the Current
Page Table p2
p1
(Processor Level 1
Register) Page Table
Level 2
page in Memory Page Tables
page on Disk
User2/VA1
VA1 User1/VA1
User 2
Level 2 PT
User 2
17
Address Translation & Protection
Virtual Address Virtual Page No. (VPN) offset
Kernel/User Mode
Read/Write
Protection Address
Check Translation
Exception?
Physical Address Physical Page No. (PPN) offset
19
TLB Designs
• Typically 16-128 entries, usually fully associative
– Each entry maps a large page, hence less spatial locality across
pages more likely that two entries conflict
– Sometimes larger TLBs (256-512 entries) are 4-8 way set-
associative
– Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random (Clock Algorithm) or FIFO replacement policy
• No process information in TLB
– Flush TLB on Process Context Switch
• TLB Reach: Size of largest virtual address space that can be
simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry
20
TLB Designs
• Typically 16-128 entries, usually fully associative
– Each entry maps a large page, hence less spatial locality across
pages more likely that two entries conflict
– Sometimes larger TLBs (256-512 entries) are 4-8 way set-
associative
– Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random (Clock Algorithm) or FIFO replacement policy
• No process information in TLB
– Flush TLB on Process Context Switch
• TLB Reach: Size of largest virtual address space that can be
simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry
21
TLB Extensions
• Address Space Identifier (ASID)
– Allow TLB Entries from multiple processes to be in
TLB at same time. ID of address space (Process) is
matched on.
– Global Bit (G) can match on all ASIDs
• Variable Page Size (PS)
– Can increase reach on a per page basis
VRWD tag PPN PS G ASID
22
Handling a TLB Miss
Software (MIPS, Alpha)
TLB miss causes an exception and the operating system
walks the page tables and reloads TLB. A privileged
“untranslated” addressing mode used for walk
23
Hierarchical Page Table Walk:
SPARC v8
Virtual Address Index 1 Index 2 Index 3 Offset
31 23 17 11 0
Context Context Table
Table
Register L1 Table
root ptr
Context
Register L2 Table
PTP L3 Table
PTP
PTE
31 11
Physical Address 0 PPN Offset
Miss? Miss?
Page-Table Base
Register Hardware Page
Table Walker
Physical Physical
Memory Controller Address
Address
Physical Address
Main Memory (DRAM)
• Assumes page tables held in untranslated physical memory
25
Address Translation:
putting it all together
Virtual Address
hardware
Restart instruction hardware or software
TLB
software
Lookup
miss hit
Page Fault
Update TLB Protection Physical
(OS loads page) Address
Fault
(to cache)
SEGFAULT
26
Modern Virtual Memory Systems
Illusion of a large, private, uniform store
Protection & Privacy OS
several users, each with their private
address space and one or more
shared address spaces useri
page table name space
Swapping
Demand Paging Store
Provides the ability to run programs Primary
larger than the primary memory Memory
29
Virtually Addressed Cache
(Virtual Index/Virtual Tag)
Virtual Virtual
Address Address
Inst. Data
PC D Decode E + M W
Cache Cache
Miss? Miss?
Inst.
Page-Table Base Data
TLB Register Hardware Page
TLB
Physical Table Walker
Address Physical
Instruction Memory Controller Address
data
Physical Address
Main Memory (DRAM)
Translate on miss
30
Aliasing in Virtual-Address Caches
Page Table Tag Data
VA1
Data Pages VA1 1st Copy of Data at PA
31
Cache-TLB Interactions
• Physically Indexed/Physically Tagged
• Virtually Indexed/Virtually Tagged
• Virtually Indexed/Physically Tagged
– Concurrent cache access with TLB Translation
• Both Indexed/Physically Tagged
– Small enough cache or highly associative cache
will have fewer indexes than page size
– Concurrent cache access with TLB Translation
• Physically Indexed/Virtually Tagged
32
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
33
Computer Architecture
ELE 475 / COS 475
Slide Deck 11: Vector, SIMD, and GPUs
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Vector Processors
• Single Instruction Multiple Data (SIMD)
Instruction Set Extensions
• Graphics Processing Units (GPU)
2
Vector Programming Model
Scalar Registers Vector Registers
r15 v15
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
3
Vector Programming Model
Scalar Registers Vector Registers
r15 v15
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
V1
Vector Arithmetic V2
Instructions + + + + + +
ADDVV V3, V1, V2 V3
[0] [1] [VLR-1]
4
Vector Programming Model
Scalar Registers Vector Registers
r15 v15
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
V1
Vector Arithmetic V2
Instructions + + + + + +
ADDVV V3, V1, V2 V3
[0] [1] [VLR-1]
Memory
Base, r1 Stride, r2
5
Vector Code Element-by-Element
Multiplication
# C code # Scalar Assembly Code # Vector Assembly Code
for (i=0; i<64; i++) LI R4, 64 LI VLR, 64
C[i] = A[i] * B[i]; loop: LV V1, R1
L.D F0, 0(R1) LV V2, R2
L.D F2, 0(R2) MULVV.D V3, V1, V2
MUL.D F4, F2, F0 SV V3, R3
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop
6
Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) to
execute element operations V1 V2 V3
• Simplifies control of deep pipeline
because elements in vector are
independent
• no data hazards!
• no bypassing needed
Six stage multiply pipeline
V3 <- V1 * V2
7
Interleaved Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
• Bank busy time: Time before bank ready to accept next request
Base Stride
Vector Registers
Address
Generator
+
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
8
Example Vector Microarchitecture
SRF
VLR
X0 VRF
F D R L0 L1 W
S0 S1
Y0 Y1 Y2 Y3
Commit Point
9
Basic Vector Execution
# C code # Vector Assembly Code
for (i=0; i<4; i++) LI VLR, 4
C[i] = A[i] * B[i]; LV V1, R1
LV V2, R2
VLR = 4 MULVV.D V3, V1, V2
SV V3, R3
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D D D D D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F F F F F F D D D D D D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
10
Vector Instruction Parallelism
• Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and
8 lanes
Load Unit Multiply Unit Add Unit
time
Instruction
issue
11
Vector Instruction Parallelism
• Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and
8 lanes
Load Unit Multiply Unit Add Unit
load
mul
add
time
load
mul
add
Instruction
issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle12
Vector Chaining
• Vector version of register bypassing
– introduced with Cray-1
LV V1
MULVV V3,V1,v2
ADDVV V5,V3, v4
13
Vector Chaining
• Vector version of register bypassing
– introduced with Cray-1
V1 V2 V3 V4 V5
LV V1
MULVV V3,V1,v2
ADDVV V5,V3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
14
Vector Chaining Advantage
• Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Time Add
20
Vector Stripmining
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit in registers, “Stripmining”
ANDI R1, N, 63 # N mod 64
for (i=0; i<N; i++) MTC1 VLR, R1 # Do remainder
C[i] = A[i]*B[i]; loop:
A B C LV V1, RA
LV V2, RB
+ Remainder MULVV.D V3, V1, V2
SV V3, RC
DSLL R2, R1, 3 # Multiply by 8
+ 64 elements DADDU RA, RA, R2 # Bump pointer
DADDU RB, RB, R2
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
+ LI R1, 64
MTC1 VLR, R1 # Reset full length
21
BGTZ N, loop # Any more to do?
Vector Stripmining
VLR = 4
LV F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
DSLL R2, R1, 3 F F F F D R X W
DADDU RA, RA, R2 F D R X W
DADDU RB, RB, R2 F D R X W
DADDU RC, RC, R2 F D R X W
DSUBU N, N, R1 F D R X W
LI R1, 64 F D R X W
MTC1 VLR, R1 F D R X W
22
BGTZ N, loop F D R X W
Vector Instruction Execution
MULVV C,A,B
23
Vector Instruction Execution
MULVV C,A,B
Execution using
one pipelined
functional unit
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]
C[2]
C[1]
C[0]
24
Vector Instruction Execution
MULVV C,A,B
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
25
Two Lane Vector Microarchitecture
SRF
VLR
X0 VRF
F D R L0 L1
S0 S1
Y0 Y1 Y2 Y3 W
X0
L0 L1
S0 S1
Y0 Y1 Y2 Y3
26
Vector Stripmining 2-Lanes
VLR = 4
LV F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
LV F D D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D F F D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV F F D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
DSLL R2, R1, 3 F F F F D R X W
DADDU RA, RA, R2 F D R X W
DADDU RB, RB, R2 F D R X W
DADDU RC, RC, R2 F D R X W
DSUBU N, N, R1 F D R X W
LI R1, 64 F D R X W
MTC1 VLR, R1 F D R X W
27
BGTZ N, loop F D R X W
Vector Unit Structure
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
Memory Subsystem
28
Vector Unit Structure
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
Lane
Memory Subsystem
29
Vector Unit Structure
Functional Unit
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
Lane
Memory Subsystem
30
T0 Vector Microprocessor (UCB/ICSI, 1995)
Lane
33
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
34
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
Scalar Sequential Code
load
Iter. 1 load
mul
store
load
Iter. 2 load
mul
store 35
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
Scalar Sequential Code Vectorized Code
load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
mul
store 36
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
Scalar Sequential Code Vectorized Code
load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
mul
Vectorization is a massive compile-time
reordering of operation sequencing
requires extensive loop dependence analysis
store 37
Vector Conditional Execution
Problem: Want to vectorize loops with conditional code:
for (i=0; i<N; i++)
if (A[i]>0) then
A[i] = B[i];
38
Masked Vector Instructions
Simple Implementation
– execute all N operations, turn off
result writeback according to mask
M[2]=0 C[2]
M[1]=1 C[1]
M[0]=0 C[0]
39
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks
M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
M[0]=0 C[0]
40
Vector Reductions
Problem: Loop-carried dependence on reduction variables
sum = 0;
for (i=0; i<N; i++)
sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary tree to perform reduction
# Rearrange as:
sum[0:VL-1] = 0 # Vector of VL partial sums
for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks
sum[0:VL-1] += A[i:i+VL-1]; # Vector sum
# Now have VL partial sums in one vector register
do {
VL = VL/2; # Halve vector length
sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials
} while (VL>1)
41
Vector Scatter/Gather
42
Vector Supercomputers
Epitomized by Cray-1, 1976:
• Scalar Unit
– Load/Store Architecture
• Vector Extension
– Vector Registers
– Vector Instructions
• Implementation
– Hardwired Control
– Highly Pipelined Functional
Units
– Interleaved Memory System
– No Data Caches
– No Virtual Memory
Cray 1 at The Deutsches Museum
Image Credit: Clemens Pfeiffer 43
http://en.wikipedia.org/wiki/File:Cray-1-deutsches-museum.jpg
Cray-1 (1976)
V0 Vi V. Mask
V1
V2 Vj
64 Element Vector V3 V. Length
Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
45
SIMD / Multimedia Extensions
64b
32b 32b
8b 8b 8b 8b 8b 8b 8b 8b
• Very short vectors added to existing ISAs for microprocessors
• Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b
– This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b
datapath split into 2x18b or 4x9b
– Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4)
or 256-bit registers (Intel AVX)
• Single instruction operates on all elements within register
16b 16b 16b 16b
4x16b adds + + + +
47
Agenda
• Vector Processors
• Single Instruction Multiple Data (SIMD)
Instruction Set Extensions
• Graphics Processing Units (GPU)
48
Graphics Processing Units (GPUs)
• Original GPUs were dedicated fixed-function devices for
generating 3D graphics (mid-late 1990s) including high-
performance floating-point units
– Provide workstation-like graphics for PCs
– User could configure graphics pipeline, but not really program it
• Over time, more programmability added (2001-2005)
– E.g., New language Cg for writing small programs run on each
vertex or each pixel, also Windows DirectX variants
– Massively parallel (millions of vertices or pixels per frame) but
very constrained programming model
• Some users noticed they could do general-purpose
computation by mapping input and output data to images,
and computation to vertex and pixel shading computations
– Incredibly difficult programming model as had to use graphics
pipeline model for general computation
49
General Purpose GPUs (GPGPUs)
• In 2006, Nvidia introduced GeForce 8800 GPU supporting a
new programming language: CUDA
– “Compute Unified Device Architecture”
– Subsequently, broader industry pushing for OpenCL, a vendor-
neutral version of same ideas.
• Idea: Take advantage of GPU computational performance
and memory bandwidth to accelerate some kernels for
general-purpose computing
• Attached processor model: Host CPU issues data-parallel
kernels to GP-GPU for execution
• This lecture has a simplified version of Nvidia CUDA-style
model and only considers GPU execution for computational
kernels, not graphics
50
Simplified CUDA Programming Model
• Computation performed by a very large number of
independent small scalar threads (CUDA threads or
microthreads) grouped into thread blocks.
// C version of DAXPY loop.
void daxpy(int n, double a, double*x, double*y)
{ for (int i=0; i<n; i++)
y[i] = a*x[i] + y[i]; }
// CUDA version.
__host__ // Piece run on host processor.
int nblocks = (n+255)/256; // 256 CUDA threads/block
daxpy<<<nblocks,256>>>(n,2.0,x,y);
__device__ // Piece run on GPGPU.
void daxpy(int n, double a, double*x, double*y)
{ int i = blockIdx.x*blockDim.x + threadId.x;
if (i<n) y[i]=a*x[i]+y[i]; }
51
“Single Instruction, Multiple Thread”
• GPUs use a SIMT model, where individual scalar
instruction streams for each CUDA thread are grouped
together for SIMD execution on hardware (Nvidia
groups 32 CUDA threads into a warp)
52
Hardware Execution Model
Lane 0 Lane 0 Lane 0
Lane 1 Lane 1 Lane 1
CPU
GPU Memory
53
“Single Instruction, Multiple Thread”
• GPUs use a SIMT model, where individual scalar
instruction streams for each CUDA thread are grouped
together for SIMD execution on hardware (Nvidia
groups 32 CUDA threads into a warp)
54
Implications of SIMT Model
• All “vector” loads and stores are scatter-gather,
as individual µthreads perform scalar loads and
stores
– GPU adds hardware to dynamically coalesce individual
µthread loads and stores to mimic vector loads and
stores
• Every µthread has to perform stripmining
calculations redundantly (“am I active?”) as there
is no scalar processor equivalent
• If divergent control flow, need predicates
55
GPGPUs are Multithreaded SIMD
57
Image Credit: NVIDIA [Wittenbrink, Kilgariff, and Prabhu, Hot Chips 2010]
Fermi “Streaming
Multiprocessor” Core
59
Copyright © 2013 David Wentzlaff
60
Computer Architecture
ELE 475 / COS 475
Slide Deck 12: Multithreading
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Agenda
• Multithreading Motivation
• Course Grain Multithreading
• Simultaneous Multithreading
2
Multithreading
• Difficult to continue to extract instruction-level
parallelism (ILP) or data level parallelism (DLP)
from a single sequential thread of control
• Many workloads can make use of thread-level
parallelism (TLP)
– TLP from multiprogramming (run independent sequential jobs)
– TLP from multithreaded applications (run one job faster using
parallel threads)
• Multithreading uses TLP to improve utilization of
a single processor
3
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
LW r1, 0(r2) F D X MW
LW r5, 12(r1) F D D D D X MW
ADDI r5, r5, #12 F F F F D D D D X MW
SW 12(r1), r5 F F F F D D D D
4
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
LW r1, 0(r2) F D X MW
LW r5, 12(r1) F D D D D X MW
ADDI r5, r5, #12 F F F F D D D D X MW
SW 12(r1), r5 F F F F D D D D
5
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
LW r1, 0(r2) F D X MW
LW r5, 12(r1) F D D D D X MW
ADDI r5, r5, #12 F F F F D D D D X MW
SW 12(r1), r5 F F F F D D D D
6
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
-- One way is to interleave execution of instructions
from different program threads on same pipeline
7
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
-- One way is to interleave execution of instructions
from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
8
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
-- One way is to interleave execution of instructions
from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
9
Simple Multithreaded Pipeline
PC X
PC GPR1
PC 1 I$ IR GPR1
GPR1
PC 1 GPR1
1
1 D$
Y
+1
2 Thread 2
select
• Have to carry thread select down pipeline to ensure correct state bits
read/written at each pipe stage
• Appears to software (including OS) as multiple, albeit slower, CPUs
10
Multithreading Costs
• Each thread requires its own user state
– PC
– GPRs
• Other overheads:
– Additional cache/TLB conflicts from competing threads
– (or add larger cache/TLB capacity)
– More OS overhead to schedule more threads (where do all these
threads come from?)
11
Thread Scheduling Policies
• Fixed interleave (CDC 6600 PPUs, 1964)
– Each of N threads executes one instruction every N cycles
– If thread not ready to go in its slot, insert pipeline bubble
– Can potentially remove bypassing and interlocking logic
12
Coarse-Grain Hardware Multithreading
• Some architectures do not have many low-
latency bubbles
• Add support for a few threads to hide
occasional cache miss latency
• Swap threads in hardware on cache miss
13
Denelcor HEP
(Burton Smith, 1982)
http://ftp.arl.army.mil/ftp/histori
c-computers/png/hep2.png
15
MTA Pipeline
Issue Pool Inst Fetch
• Every cycle, one
W VLIW instruction from
M A C one active thread is
launched into pipeline
• Instruction pipeline
is 21 cycles long
Memory Pool
W • Memory operations
Write Pool
Retry Pool
Interconnection Network
Memory pipeline
16
MIT Alewife (1990)
18
Oracle/Sun Niagara-3, “Rainbow Falls” 2009
20
Ideal Superscalar Multithreading
[Tullsen, Eggers, Levy, UW, 1995]
Issue width
Time
Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)
23
Vertical Multithreading
Issue width
Instruction
issue
Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)
24
Vertical Multithreading
Issue width
Instruction
issue
Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)
Time
26
Chip Multiprocessing (CMP)
Issue width
Time
Time
Time Time
30
Power 4
[POWER 4 system microarchitecture, Tendler et al, IBM J. Res. & Dev., Jan 2002] Image Credit: IBM
Courtesy of International Business Machines, © International Business Machines. 2 commits
Power 5 (architected
register sets)
2 fetch (PC),
2 initial decodes
[POWER 5 system microarchitecture, Sinharoy et al, IBM J. Res. & Dev., Jul/Sept 2005] Image Credit: IBM 31
Courtesy of International Business Machines, © International Business Machines.
Power 5 data flow ...
Image Credit: Carsten Schulz
[POWER 5 system microarchitecture, Sinharoy et al, IBM J. Res. & Dev., Jul/Sept 2005] Image Credit: IBM
Courtesy of International Business Machines, © International Business Machines.
35
Icount Choosing Policy
Fetch from thread with the least instructions in flight.
38
Copyright © 2013 David Wentzlaff
39
Computer Architecture
ELE 475 / COS 475
Slide Deck 13: Parallel Programming
and Small Multiprocessors
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Trends in Computation Transistors
(Thousands)
Sequential
Performance
(SpecINT)
Frequency
(MHz)
Typical Power
(Watts)
2
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, and D. Wentzlaff
Trends in Computation Transistors
(Thousands)
Sequential
Performance
(SpecINT)
Frequency
(MHz)
Typical Power
(Watts)
Cores
3
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, and D. Wentzlaff
Symmetric Multiprocessors
Processor Processor
CPU-Memory bus
bridge
I/O bus
Memory
I/O controller I/O controller I/O controller
symmetric
• All memory is equally far Graphics
away from all processors output
• Any processor can do any I/O Networks
(set up a DMA transfer)
4
Synchronization
The need for synchronization arises whenever
there are concurrent processes in a system
(even in a uniprocessor system) producer
consumer
Producer-Consumer: A consumer process
must wait until the producer process has
produced data
P1 P2
Mutual Exclusion: Ensure that only one
process uses a resource at a given time
Shared
Resource
5
A Producer-Consumer Example
tail head
Producer Consumer
7
Sequential Consistency
A Memory Model
P P P P P P
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
8
Sequential Consistency
Sequential concurrent tasks: T1, T2
Shared variables: X, Y (initially X = 0, Y = 10)
T1: T2:
Store 1, (X) (X = 1) Load R1, (Y)
Store 11, (Y) (Y = 11) Store R1, (Y’) (Y’= Y)
Load R2, (X)
Store R2, (X’) (X’= X)
If Y is 11 then X cannot be 0
9
Sequential Consistency
Sequential consistency imposes more memory ordering
constraints than those imposed by uniprocessor
program dependencies ( )
T1: T2:
Store 1, (X) (X = 1) Load R1, (Y)
Store 11, (Y) (Y = 11) Store (Y’), R1 (Y’= Y)
Load R2, (X)
additional SC requirements Store (X’), R2 (X’= X)
10
Multiple Consumer Example
tail head Rhead R
Producer Consumer
1 Rtail
Rtail Rhead R
Consumer
2 Rtail
12
Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented
using ordinary Load and Store instructions in the
Sequential Consistency memory model. However,
protocols for mutual exclusion are difficult to design...
Simpler solution:
atomic read-modify-write instructions
Examples: m is a memory location, R is a register
13
Multiple Consumers Example
using the Test&Set Instruction
P: Test&Set (mutex),Rtemp
if (Rtemp!=0) goto P
Load Rhead, (head)
spin: Load Rtail, (tail)
Critical
if Rhead==Rtail goto spin Section
Load R, (Rhead)
Rhead=Rhead+1
Store Rhead, (head)
V: Store 0, (mutex)
process(R)
14
Nonblocking Synchronization
Compare&Swap(m), Rt, Rs:
if (Rt==M[m]) status is an
then M[m]=Rs; implicit
Rs=Rt ; argument
status success;
else status fail;
15
Load-link & Store-conditional
aka Load-reserve, Load-Locked
Special register(s) to hold reservation flag and address,
and the outcome of store-conditional
17
Issues in Implementing
Sequential Consistency
P P P P P P
M
Implementation of SC is complicated by two issues
• Caches
Caches can prevent the effect of a store from
being seen by other processors
SC complications motivate architects to consider
weak or relaxed memory models 18
Memory Fences
Instructions to sequentialize memory accesses
Processors with relaxed or weak memory models permit Loads and Stores to
different addresses to be reordered, remove some/all extra dependencies
imposed by SC
• LL, LS, SL, SS
Memory fences are expensive operations – mem instructions wait for all
relevant instructions in-flight to complete (including stores to retire – need
store acks)
However, cost of serialization only when it is required!
19
Using Memory Fences
tail head
Producer Consumer
Process 1 Process 2
... ...
c1=1; c2=1;
L: if c2==1 then go to L L: if c1==1 then go to L
< critical section> < critical section>
c1=0; c2=0;
21
Mutual Exclusion: second attempt
To avoid deadlock, let a process give up the reservation
(i.e. Process 1 sets c1 to 0) while waiting.
Process 1 Process 2
... ...
L: c1=1; L: c2=1;
if c2==1 then if c1==1 then
{ c1=0; go to L} { c2=0; go to L}
< critical section> < critical section>
c1=0 c2=0
22
A Protocol for Mutual Exclusion
T. Dekker, 1966
Process 1 Process 2
... ...
c1=1; c2=1;
turn = 1; turn = 2;
L: if c2==1 && turn==1 L: if c1==1 && turn==2
then go to L then go to L
< critical section> < critical section>
c1=0; c2=0;
23
N-process Mutual Exclusion
Lamport’s Bakery Algorithm
Process i
Initially num[j] = 0, for all j
Entry Code
choosing[i] = 1;
num[i] = max(num[0], …, num[N-1]) + 1;
choosing[i] = 0;
for(j = 0; j < N; j++) {
while( choosing[j] );
while( num[j] &&
( ( num[j] < num[i] ) ||
( num[j] == num[i] && j < i ) ) );
}
Exit Code
num[i] = 0;
24
Symmetric Multiprocessors
Processor Processor
CPU-Memory bus
bridge
I/O bus
Memory
I/O controller I/O controller I/O controller
symmetric
• All memory is equally far Graphics
away from all processors output
• Any processor can do any I/O Networks
(set up a DMA transfer)
25
Multidrop Memory Bus
Arbitration
Control
Address
Data
Clock
Main
Processor 1 Processor 2
Memory
26
Pipelined Memory Bus
Arbitration
Control
Address
Data
Clock
Main
Processor 1 Processor 2
Memory
27
Pipelined Memory Bus
P1
LD
0x1234abcd
0xDA7E0000
Arbitration
Control
Address
Data
Clock
Main 28
Processor 1 Processor 2
Memory Coherence in SMPs
CPU-1 CPU-2
CPU-Memory bus
A 100 memory
X= 1 X=1 Y=
Y=11 Y =11 Y’=
• T1 executed X’= X=0
Y’= X’=
X= 1 X=1 Y = 11
• T2 executed Y=11 Y =11 Y’= 11
X’= 0 X=0
Y’=11 X’= 0
R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can D DMA
be the Bus Master and DISK
R/W
effect transfers
(DMA stands for “Direct Memory Access”, means the I/O device
can read/write memory autonomous from the CPU)
33
Problems with Parallel I/O
Cached portions
of page Physical
Memory Memory
Bus
Proc.
Cache
DMA transfers
DMA
DISK
Memory Disk: Physical memory may be
stale if cache copy is dirty
A A
Tags and Snoopy read port
State attached to Memory
Proc. R/W R/W
Bus
Data
D (lines)
Cache
35
Shared Memory Multiprocessor
Memory
Bus
Snoopy
P1 Cache Physical
Memory
Snoopy
P2 Cache
37
Write Update (Broadcast) Protocols
write miss:
Broadcast on bus, other processors update
copies (in place)
read miss:
Memory is always up to date
38
Write Invalidate Protocols
write miss:
the address is invalidated in all other
caches before the write is performed
read miss:
if a dirty copy is found in some cache, a write-
back is performed before the memory is read
39
Cache State Transition Diagram
The MSI protocol
P1 P1 reads
P1 reads
P2 reads, M or writes
P1 writes P1 writes back
P2 reads Write miss
P2 writes
P2 intent to write
P1 reads
P1 writes Read
miss
P2 writes S I
P1 writes P2 intent to write
P2 P1 reads,
P2 reads
P2 writes back M or writes
Write miss
P1 intent to write
Read
miss
S I
P1 intent to write
41
Observation
P1 reads
Other processor reads M or writes
P1 writes back Write miss
Other processor
intent to write
Read
miss
S I
Read by any Other processor
processor intent to write
46
False Sharing
state blk addr data0 data1 ... dataN
47
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
48
Blackboard Example: Sequential
Consistency
Valid Not Valid
P1 P2 1 1 5 5
1 5 2 2 6 1
2 6 5 3 7 3
3 7 3 4 1 2
4 8 6 5 2 4
7 6 3 6
8 7 4 7
4 8 8 8
49
Analysis of Dekker’s Algorithm
... Process 1 ... Process 2
c1=1; c2=1;
Scenario 1
turn = 1; turn = 2;
L: if c2=1 & turn=1 L: if c1=1 & turn=2
then go to L then go to L
< critical section> < critical section>
c1=0; c2=0;
turn = 1; turn = 2;
L: if c2=1 & turn=1 L: if c1=1 & turn=2
then go to L then go to L
< critical section> < critical section>
c1=0; c2=0;
50
Computer Architecture
ELE 475 / COS 475
Slide Deck 14: Interconnection
Networks
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Overview of Interconnection
Networks: Buses
4
Overview of Interconnection
Networks: Point-to-point / Switched
SW SW SW SW
SW SW SW SW
SW SW SW SW
5
Overview of Interconnection
Networks: Point-to-point / Switched
SW SW SW SW
SW SW SW SW
SW SW SW SW
6
Explicit Message Passing
(Programming)
• Send(Destination, *Data)
• Receive(&Data)
• Receive(Source, &Data)
• Unicast (one-to-one)
• Multicast (one-to-multiple)
• Broadcast (one-to-all)
7
Message Passing Interface (MPI)
#include <stdio.h>
#include <assert.h>
#include <mpi.h>
int main (int argc, char **argv) {
int myid, numprocs, x, y;
int tag = 475;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
assert(numprocs == 2);
if(myid==0) {
x = 475;
MPI_Send(&x, 1, MPI_INT, 1, tag, MPI_COMM_WORLD);
MPI_Recv(&y, 1, MPI_INT, 1, tag, MPI_COMM_WORLD, &status);
printf(“received number: ELE %d A\n”, y);
}
else {
MPI_Recv(&y, 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &status);
y += 105;
MPI_Send(&y, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
}
MPI_Finalize();
exit(0);
}
8
Message Passing vs. Shared Memory
• Message Passing
– Memory is private
– Explicit send/receive to communicate
– Message contains data and synchronization
– Need to know Destination on generation of data (send)
– Easy for Producer-Consumer
• Shared Memory
– Memory is shared
– Implicit communication via loads and stores
– Implicit synchronization needed via Fences, Locks, and Flags
– No need to know Destination on generation of date (can store in
memory and user of data can pick up later)
– Easy for multiple threads accessing a shared table
– Needs Locks and critical sections to synchronize access
9
Shared Memory Tunneled over
Messaging
• Software
– Turn loads and stores into sends and receives
• Hardware
– Replace bus communications with messages sent
between cores and between cores and memory
SW SW SW SW
10
Shared Memory Tunneled over
Messaging
• Software
– Turn loads and stores into sends and receives
• Hardware
– Replace bus communications with messages sent
between cores and between cores and memory
SW SW SW SW
11
Messaging Tunneled over Shared
Memory
• Use software queues (FIFOs) with locks to
transmit data directly between cores by loads
and stores to memory
tail head
Producer Consumer
12
Interconnect Design
• Switching
• Topology
• Routing
• Flow Control
13
Anatomy of a Message
15
Topology
16
Topology
17
Topology
18
Topology
19
Topology
20
Topology
21
Topology
22
Topology
23
Topology Parameters
• Routing Distance: Number of links between
two points
• Diameter: Maximum routing distance
between any two points
• Average Distance
• Minimum Bisection Bandwidth (Bisection
Bandwidth): The bandwidth of a minimal cut
though the network such that the network is
divided into two sets of nodes
• Degree of a Router
24
Topology Parameters
Diameter: 2√𝑁 - 2
Bisection Bandwidth: 2√𝑁
Degree of a Router: 5
25
Topology Influenced by Packaging
• Wiring grows as
N-1
• Physically hard to
pack into 3-space
(pack in sphere?)
26
Topology Influenced by Packaging
• Packing N dimensions in N-1
space leads to long wires
• Packing N dimensions in N-2
space leads to really long wires
27
Network Performance
• Bandwidth: The rate of data that can be transmitted
over the network (network link) in a given time
• Latency: The time taken for a message to be sent from
sender to receiver
28
Latency
29
Anatomy of Message Latency
T = Thead + L/b
Thead: Head Phit Latency, includes tC , tR , hop
count, and contention
Unloaded Latency:
T0 = HR * tR + HC * tC + L/b
30
Anatomy of Message Latency
T = Thead + L/b
Thead: Head Phit Latency, includes tC , tR , hop
count, and contention
Unloaded Latency:
T0 = HR * tR + HC * tC + L/b
32
Routing
• Oblivious (routing path independent of state
of network)
– Deterministic
– Non-Deterministic
• Adaptive (routing path depends on state of
network)
33
Flow Control
• Local (Link or hop based) Flow Control
• End-to-end (Long distance)
34
Deadlock
• Deadlock can occur if cycle possible in “Waits-
for” graph
35
Deadlock Example (Waits-for and
Holds analysis)
36
Deadlock Avoidance vs. Deadlock
Recovery
• Deadlock Avoidance
– Protocol designed to never deadlock
• Deadlock Recovery
– Allow Deadlock to occur and then resolve
deadlock usually through use of more buffering
37
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)
38
39
40
41
42
Computer Architecture
ELE 475 / COS 475
Slide Deck 15: Directory Cache
Coherence
David Wentzlaff
Department of Electrical Engineering
Princeton University
1
Coherency Misses
1. True sharing misses arise from the communication of
data through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated
because some word in the block, other than the one
being read, is written into
• Invalidation does not cause a new value to be
communicated, but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
miss would not occur if block size were 1 word
2
Example: True v. False Sharing v.
Hit?
• Assume x1 and x2 in same cache block.
P1 and P2 both read x1 and x2 before.
3
MP Performance 4 Processor
Commercial Workload: OLTP, Decision Support (Database),
Search Engine
3.25
• True sharing and
3
false sharing Instruction
2.75
1.75
• Uniprocessor cache 1.5
misses 1.25
improve with 1
0.75
cache size increase
0.5
(Instruction,
Capacity/Conflict, 0.25
Compulsory) 0
1 MB 2 MB 4 MB 8 MB
Cache size
4
MP Performance 2MB Cache
Commercial Workload: OLTP, Decision Support
(Database), Search Engine
3
Instruction
Conflict/Capacity
• True sharing, 2.5
0.5
0
1 2 4 6 8
Processor count
5 5
Directory Coherence Motivation
• Snoopy protocols require every cache miss to
broadcast
– Requires large bus bandwidth, O(N)
– Requires large cache snooping bandwidth, O(N^2)
aggregate
• Directory protocols enable further scaling
– Directory can track all caches holding a memory block
and use point-to-point messages to maintain
coherence
– Communication done via scalable point-to-point
interconnect
6
Directory Cache Coherence
CPU CPU CPU CPU CPU CPU
Interconnection Network
7
Distributed Shared Memory
CPU CPU CPU
DRAM Cache DRAM Cache DRAM Cache
Bank Bank Bank
IO IO IO
Interconnection Network
IO IO IO
Directory Directory Directory
IO IO IO
Interconnection Network
IO IO IO
Directory Directory Directory
10
Address to Home Directory
High Order Bits Determine Home (Directory)
Physical Address
32 0
Home Node Index Offset
12
Cache State Transition Diagram
The MSI protocol
M: Modified
S: Shared
I: Invalid
Write miss, P1 Send Write
Miss Message (Waits for
reply before transition)
P1 reads
Receive Read Miss Message M or writes
(P1 writes back)
Invalidate
Message (P1
Read miss, P1 Send Read writes back,
Miss Message (Waits for Reply after)
reply before transition)
S I
Read by P1 Invalidate
Message Cache state in
(Reply) processor P1
14
Cache State Transition Diagram
For Directory Coherence
M: Modified
S: Shared
I: Invalid
Write miss, P1 Send Write
Miss Message (Waits for
reply before transition)
P1 reads
Receive Read Miss Message M or writes
(P1 writes back) Writeback
Invalidate (Notify Directory)
Message (P1
Read miss, P1 Send Read writes back,
Miss Message (Waits for Reply after)
reply before transition)
S I
Read by P1 Invalidate
Message Cache state in
(Reply) processor P1
15
Notify Directory
Directory State Transition Diagram
U: Uncached
S: Shared
E: Exclusive Write Miss P:
Fetch/Invalidate
From E Node,
Data Value Reply
Sharers = {P}
Read Miss P: E
Fetch from E Node,
Data Value Reply Data Write-Back P:
Sharers = {P} Sharers = {}
Write Miss P:
Data Value Reply
Sharers = {P}
S U
Read Miss P: Read Miss P:
Data Value Reply Data Value Reply
Sharers = Sharers = {P} State of Cache
Sharers +{P} Line in Directory16
Message Types
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved. 17
Multiple Logical Communication
Channels Needed
• Responses queued behind requests can lead
to deadlock
• Many different message types, need to
determine which message type can create
more messages
• Segregate flows onto different logical/physical
channels
18
Memory Ordering Point
• Just like in bus based snooping protocol, need to
guarantee that state transitions are atomic
• Directory used as ordering point
– Whichever message reaches home directory first wins
– Other requests on same cache line given negative
acknowledgement (NACK)
• NACK causes retry from other node
• Forward progress guarantee needed
– After node acquires line, need to commit at least one
memory operation before transitioning invalidating
line
19
Scalability of Directory Sharer List
Storage
• Full-Map Directory (Bit per cache)
20
Beyond Simple Directory Coherence
• On-chip coherence (Leverage fast on-chip
communications to speed up or simplify
protocol)
• Cache Only Memory Architectures (COMA)
• Large scale directory systems (Scalability of
directory messages and sharer list storage)
21
SGI UV 1000 (Origin Descendant)
Maximum Memory: 16TB
Maximum Processors: 256
Maximum Cores: 2560
Topology: 2D Torus
22
Image Credit: SGI
TILE64Pro
• 64 Cores DDR2 Controller DDR2 Controller
• 2D Mesh 10 Gig
SerDes
SerDes
• 4 Memory PCIe Enet
(XAUI)
Controllers
• 3 Memory General
Purpose
Gigabit
Enet
Networks I/O Gigabit
Enet
– Divide different
flows of traffic
SerDes
10 Gig
SerDes
• Each node can be
PCIe Enet
(XAUI)
a home
DDR2 Controller DDR2 Controller
23
Beyond ELE 475
• Computer Architecture Research
– International Symposium on Computer Architecture (ISCA)
– International Symposium on Microarchitecture (MICRO)
– Architectural Support for Programing Languages and
Operating Systems (ASPLOS)
– International Symposium on High Performance Computer
Architecture (HPCA)
• Build some chips / FPGA
• Parallel Computer Architecture
• ELE 580A Parallel Computation (Princeton Only)
– Graduate Level, Using Primary Sources
24