0% found this document useful (0 votes)
173 views

Computer Architecture

This document provides an introduction to computer architecture. It defines computer architecture as the design of abstraction and implementation layers that allow efficient information processing using available manufacturing technologies. Computer architecture balances application requirements with technological constraints and provides feedback to guide development. Major generations include vacuum tubes, transistors, integrated circuits, and the ongoing transition to parallelism. The course will cover instruction level parallelism, pipelining, memory hierarchies, and threading across single and multi-core processors. It highlights the IBM 360 as the first general purpose register machine and an influential early example of an instruction set architecture designed for compatibility across implementations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views

Computer Architecture

This document provides an introduction to computer architecture. It defines computer architecture as the design of abstraction and implementation layers that allow efficient information processing using available manufacturing technologies. Computer architecture balances application requirements with technological constraints and provides feedback to guide development. Major generations include vacuum tubes, transistors, integrated circuits, and the ongoing transition to parallelism. The course will cover instruction level parallelism, pipelining, memory hierarchies, and threading across single and multi-core processors. It highlights the IBM 360 as the first general purpose register machine and an influential early example of an instruction set architecture designed for compatibility across implementations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 667

Computer Architecture

ELE 475 / COS 475


Slide Deck 1: Introduction and
Instruction Set Architectures
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
What is Computer Architecture?
Application

In its broadest definition,


computer architecture is the
Gap too large to design of the
bridge in one step abstraction/implementation
layers that allow us to
execute information
processing applications
efficiently using
manufacturing technologies
Physics

6
Computer Architecture is Constantly
Changing
Application
Application Requirements:
Algorithm • Suggest how to improve architecture
Programming Language • Provide revenue to fund development
Operating System/Virtual Machines
Instruction Set Architecture Architecture provides feedback to guide
Microarchitecture application and technology research
directions
Register-Transfer Level
Gates
Circuits Technology Constraints:
• Restrict what can be done efficiently
Devices
• New technologies make new arch
Physics possible
10
Computers Then…

IAS Machine. Design directed by John von Neumann.


First booted in Princeton NJ in 1952
11
Smithsonian Institution Archives (Smithsonian Image 95-06151)
Major
Technology
Generations Bipolar
CMOS
nMOS
Vacuum
pMOS
Tubes

Relays

[from Kurzweil]
Electromechanical

13
Sequential Processor Performance
Move to multi-processor

RISC

16
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Course Structure
• Recommended Readings
• In-Lecture Questions
• Problem Sets
– Very useful for exam preparation
– Peer Evaluation
• Midterm
• Final Exam

17
Course Content Computer
Organization (ELE 375)
Computer Organization
• Basic Pipelined
Processor

~50,000 Transistors

18
Photo of Berkeley RISC I, © University of California (Berkeley)
Course Content Computer
Architecture (ELE 475)
• Instruction Level Parallelism
– Superscalar Computer Organization
– Very Long Instruction Word (VLIW) (ELE 375) Processor
• Long Pipelines (Pipeline
Parallelism)
• Advanced Memory and Caches
• Data Level Parallelism
– Vector
– GPU
• Thread Level Parallelism
– Multithreading
– Multiprocessor
– Multicore
– Manycore ~700,000,000 Transistors
Intel Nehalem Processor, Original Core i7, Image Credit Intel: 22
http://download.intel.com/pressroom/kits/corei7/images/Nehalem_Die_Shot_3.jpg
Architecture vs. Microarchitecture
“Architecture”/Instruction Set Architecture:
• Programmer visible state (Memory & Register)
• Operations (Instructions and how they work)
• Execution Semantics (interrupts)
• Input/Output
• Data Types/Sizes
Microarchitecture/Organization:
• Tradeoffs on how to implement ISA for some metric
(Speed, Energy, Cost)
• Examples: Pipeline depth, number of pipelines, cache
size, silicon area, peak power, execution ordering, bus
widths, ALU widths
23
Software Developments
up to 1955 Libraries of numerical routines
- Floating point operations
- Transcendental functions
- Matrix manipulation, equation solvers, . . .

1955-60 High level Languages - Fortran 1956


Operating Systems -
- Assemblers, Loaders, Linkers, Compilers
- Accounting programs to keep track of
usage and charges

Machines required experienced operators

• Most users could not be expected to understand


these programs, much less write them

• Machines had to be sold with a lot of resident software


25
Compatibility Problem at IBM
By early 1960’s, IBM had 4 incompatible lines of
computers!
701  7094
650  7074
702  7080
1401  7010
Each system had its own
• Instruction set
• I/O system and Secondary Storage:
magnetic tapes, drums and disks
• assemblers, compilers, libraries,...
• market niche business, scientific, real time, ...
 IBM 360
28
IBM 360 : Design Premises
Amdahl, Blaauw and Brooks, 1964
• The design must lend itself to growth and successor
machines
• General method for connecting I/O devices
• Total performance - answers per month rather than bits per
microsecond  programming aids
• Machine must be capable of supervising itself without
manual intervention
• Built-in hardware fault checking and locating aids to reduce
down time
• Simple to assemble systems with redundant I/O devices,
memories etc. for fault tolerance
• Some problems required floating-point larger than 36 bits

29
IBM 360: A General-Purpose Register
(GPR) Machine
• Processor State
– 16 General-Purpose 32-bit Registers
• may be used as index and base register
• Register 0 has some special properties
– 4 Floating Point 64-bit Registers
– A Program Status Word (PSW)
• PC, Condition codes, Control flags
• A 32-bit machine with 24-bit addresses
– But no instruction contains a 24-bit address!
• Data Formats
– 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words

The IBM 360 is why bytes are 8-bits long today!


31
IBM 360: Initial Implementations
Model 30 ... Model 70
Storage 8K - 64 KB 256K - 512 KB
Datapath 8-bit 64-bit
Circuit Delay 30 nsec/level 5 nsec/level
Local Store Main Store Transistor Registers
Control Store Read only 1sec Conventional circuits

IBM 360 instruction set architecture (ISA) completely


hid the underlying technological differences between
various models.
Milestone: The first true ISA designed as portable
hardware-software interface!
With minor modifications it still survives today!
33
IBM 360: 47 years later…
The zSeries z11 Microprocessor
• 5.2 GHz in IBM 45nm PD-SOI CMOS technology
• 1.4 billion transistors in 512 mm2
• 64-bit virtual addressing
– original S/360 was 24-bit, and S/370 was 31-bit extension
• Quad-core design
• Three-issue out-of-order superscalar pipeline
• Out-of-order memory accesses
• Redundant datapaths
– every instruction performed in two parallel datapaths and
results compared
• 64KB L1 I-cache, 128KB L1 D-cache on-chip
• 1.5MB private L2 unified cache per core, on-chip
[ IBM, Kevin Shum, HotChips, 2010] • On-Chip 24MB eDRAM L3 cache
Image Credit: IBM • Scales to 96-core multiprocessor with 768MB of
Courtesy of International Business shared L4 eDRAM
Machines Corporation, © International
Business Machines Corporation.
34
Same Architecture
Different Microarchitecture
AMD Phenom X4 Intel Atom
• X86 Instruction Set • X86 Instruction Set
• Quad Core • Single Core
• 125W • 2W
• Decode 3 Instructions/Cycle/Core • Decode 2 Instructions/Cycle/Core
• 64KB L1 I Cache, 64KB L1 D Cache • 32KB L1 I Cache, 24KB L1 D Cache
• 512KB L2 Cache • 512KB L2 Cache
• Out-of-order • In-order
• 2.6GHz • 1.6GHz

Image Credit: Intel

35
Image Credit: AMD
Different Architecture
Different Microarchitecture
AMD Phenom X4 IBM POWER7
• X86 Instruction Set • Power Instruction Set
• Quad Core • Eight Core
• 125W • 200W
• Decode 3 Instructions/Cycle/Core • Decode 6 Instructions/Cycle/Core
• 64KB L1 I Cache, 64KB L1 D Cache • 32KB L1 I Cache, 32KB L1 D Cache
• 512KB L2 Cache • 256KB L2 Cache
• Out-of-order • Out-of-order
• 2.6GHz • 4.25GHz

Image Credit: IBM


Courtesy of International Business Machines 36
Image Credit: AMD Corporation, © International Business Machines Corporation.
Where Do Operands Come from
And Where Do Results
Register-
Go? Register-
Stack Accumulator Register
Memory

… … …
TOS
Processor

Processor

Processor

Processor
ALU ALU ALU ALU

… … … …

Memory

Memory
Memory

Memory

Number Explicitly
Named Operands: 0 1 2 or 3 2 or 3
46
Stack-Based Instruction Set
Architecture (ISA)
… • Burrough’s B5000 (1960)
TOS
• Burrough’s B6700
• HP 3000
Processor

ALU
• ICL 2900
• Symbolics 3600
Modern
… • Inmos Transputer
Memory

• Forth machines
• Java Virtual Machine
• Intel x87 Floating Point Unit
47
Evaluation of Expressions
(a + b * c) / (a + d * c - e)
/

+ -

a * + e

b c a *
d
c
b*c
+
Reverse Polish a+a
b*c
abc*+adc*+e-/
Evaluation Stack
add 61
Hardware organization of the stack
• Stack is part of the processor state
 stack must be bounded and small
 number of Registers,
not the size of main memory

• Conceptually stack is unbounded


a part of the stack is included in the
processor state; the rest is kept in the
main memory

62
Stack Operations and
Implicit Memory References
• Suppose the top 2 elements of the stack are kept
in registers and the rest is kept in the memory.
Each push operation  1 memory reference
pop operation  1 memory reference
No Good!
• Better performance by keeping the top N
elements in registers, and memory references are
made only when register stack overflows or
underflows.
Issue - when to Load/Unload registers ?
65
Stack Size and Expression Evaluation
abc*+adc*+e-/
program stack (size = 4)
push a R0
push b R0 R1
a and c are push c R0 R1 R2
“loaded” twice * R0 R1
 + R0
not the best push a R0 R1
use of registers! push d R0 R1 R2
push c R0 R1 R2 R3
* R0 R1 R2
+ R0 R1
push e R0 R1 R2
- R0 R1
/ R0
69
Machine Model Summary
Register- Register-
Stack Accumulator Register
Memory

… … …
TOS
Processor

Processor

Processor

Processor
ALU ALU ALU ALU

C=A+B
… … … …
Memory

Memory

Memory

Memory
Push A Load A Load R1, A Load R1, A
Push B Add B Add R3, R1, B Load R2, B
Add Store C Store R3, C Add R3, R1, R2
Pop C Store R3, C 72
Classes of Instructions
• Data Transfer
– LD, ST, MFC1, MTC1, MFC0, MTC0
• ALU
– ADD, SUB, AND, OR, XOR, MUL, DIV, SLT, LUI
• Control Flow
– BEQZ, JR, JAL, TRAP, ERET
• Floating Point
– ADD.D, SUB.S, MUL.D, C.LT.D, CVT.S.W,
• Multimedia (SIMD)
– ADD.PS, SUB.PS, MUL.PS, C.LT.PS
• String
– REP MOVSB (x86)
73
Addressing Modes:
How to Get Operands from Memory
Addressing Instruction Function
Mode
Register Add R4, R3, R2 Regs[R4] <- Regs[R3] + Regs[R2] **

Immediate Add R4, R3, #5 Regs[R4] <- Regs[R3] + 5 **

Displacement Add R4, R3, 100(R1) Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]

Register Add R4, R3, (R1) Regs[R4] <- Regs[R3] + Mem[Regs[R1]]


Indirect
Absolute Add R4, R3, (0x475) Regs[R4] <- Regs[R3] + Mem[0x475]

Memory Add R4, R3, @(R1) Regs[R4] <- Regs[R3] + Mem[Mem[R1]]


Indirect
PC relative Add R4, R3, 100(PC) Regs[R4] <- Regs[R3] + Mem[100 + PC]

Scaled Add R4, R3, 100(R1)[R5] Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] +
Regs[R5] * 4]
74
** May not actually access memory!
Data Types and Sizes
• Types
– Binary Integer
– Binary Coded Decimal (BCD)
– Floating Point
• IEEE 754
• Cray Floating Point
• Intel Extended Precision (80-bit)
– Packed Vector Data
– Addresses
• Width
– Binary Integer (8-bit, 16-bit, 32-bit, 64-bit)
– Floating Point (32-bit, 40-bit, 64-bit, 80-bit)
– Addresses (16-bit, 24-bit, 32-bit, 48-bit, 64-bit)
75
ISA Encoding
Fixed Width: Every Instruction has same width
• Easy to decode
(RISC Architectures: MIPS, PowerPC, SPARC, ARM…)
Ex: MIPS, every instruction 4-bytes
Variable Length: Instructions can vary in width
• Takes less space in memory and caches
(CISC Architectures: IBM 360, x86, Motorola 68k, VAX…)
Ex: x86, instructions 1-byte up to 17-bytes
Mostly Fixed or Compressed:
• Ex: MIPS16, THUMB (only two formats 2 and 4 bytes)
• PowerPC and some VLIWs (Store instructions compressed,
decompress into Instruction Cache
(Very) Long Instruction Word:
• Multiple instructions in a fixed width bundle
• Ex: Multiflow, HP/ST Lx, TI C6000 77
x86 (IA-32) Instruction Encoding

Instruction Scale, Index,


Opcode ModR/M Displacement Immediate
Prefixes Base

Up to four
1,2, or 3 1 byte 1 byte 0,1,2, or 4 0,1,2, or 4
Prefixes
bytes (if needed) (if needed) bytes bytes
(1 byte
each)

x86 and x86-64 Instruction Formats


Possible instructions 1 to 18 bytes long

78
MIPS64 Instruction Encoding

79
Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Real World Instruction Sets
Arch Type # Oper # Mem Data Size # Regs Addr Size Use

Alpha Reg-Reg 3 0 64-bit 32 64-bit Workstation

ARM Reg-Reg 3 0 32/64-bit 16 32/64-bit Cell Phones,


Embedded
MIPS Reg-Reg 3 0 32/64-bit 32 32/64-bit Workstation,
Embedded
SPARC Reg-Reg 3 0 32/64-bit 24-32 32/64-bit Workstation

TI C6000 Reg-Reg 3 0 32-bit 32 32-bit DSP

IBM 360 Reg-Mem 2 1 32-bit 16 24/31/64 Mainframe

x86 Reg-Mem 2 1 8/16/32/ 4/8/24 16/32/64 Personal


64-bit Computers
VAX Mem-Mem 3 3 32-bit 16 32-bit Minicomputer

Mot. 6800 Accum. 1 1/2 8-bit 0 16-bit Microcontroler


80
Why the Diversity in ISAs?
Technology Influenced ISA
• Storage is expensive, tight encoding important
• Reduced Instruction Set Computer
– Remove instructions until whole computer fits on die
• Multicore/Manycore
– Transistors not turning into sequential performance
Application Influenced ISA
• Instructions for Applications
– DSP instructions
• Compiler Technology has improved
– SPARC Register Windows no longer needed
– Compiler can register allocate effectively
81
Recap
Application • ISA vs Microarchitecture
Algorithm
• ISA Characteristics
Programming Language
– Machine Models
Operating System/Virtual Machines
– Encoding
Instruction Set Architecture
– Data Types
Microarchitecture
Register-Transfer Level – Instructions
Gates – Addressing Modes
Circuits
Devices
Physics

83
Computer Architecture Lecture 1

Next Class: Microcode and Review of Pipelining

84
Computer Architecture
ELE 475 / COS 475
Slide Deck 2: Microcode and
Pipelining Review
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards

2
What Happens When the Processor is
Too Large?
• Time Multiplex Resources!

5
Microcontrol Unit Maurice Wilkes, 1954
op conditional First used in EDSAC-2,
code flip-flop completed 1958

Next state
 address

Matrix A Matrix B

Embed the control


Decoder logic state table in
a memory array
Memory
Control lines to
ALU, MUXs, Registers
6
Microcoded Microarchitecture
busy? holds fixed
zero?
controller
(ROM) microcode instructions
opcode

Datapath

Data Addr

holds user program Memory enMem


written in macrocode (RAM) MemWrt
instructions (e.g., x86,
MIPS, etc.)
7
A Bus-based Datapath for RISC
Opcode bcompare? busy
ldIR OpSel ldA ldB 32(PC) ldMA
1(Link)
rd
2 rs2
rs1

RegSel MA
rd 3
IR rs2 A B addr addr
rs1
32 GPRs
ImmSel + PC ... Memory MemWrt
Imm ALU RegWrt
2 Ext control ALU
32-bit Reg enReg
enImm enALU data data enMem

Bus 32
Microinstruction: register to register transfer (17 control signals)
8
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards

9
An Ideal Pipeline
stage stage stage stage
1 2 3 4

• All objects go through the same stages


• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• Scheduling of a transaction entering the pipeline is not
affected by the transactions in other stages
• These conditions generally hold for industry assembly
lines, but instructions depend on each other causing
various hazards
11
Unpipelined Datapath for MIPS
PCSrc
br RegWrite MemWrite WBSrc
rind
jabs
pc+4

0x4
Add
Add

clk

we
clk
rs1
rs2
PC addr 31 rd1 we
inst ws addr
wd rd2 ALU
clk Inst. GPRs z rdata
Memory Data
Imm Memory
Ext wdata
ALU
Control

OpCode RegDst ExtSel OpSel BSrc zero?


12
Simplified Unpipelined Datapath
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

13
Pipelined Datapath
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
14
Pipelined Control
0x4
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 15
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 16
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 18
“Iron Law” of Processor Performance
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
–Instructions per program depends on source code,
compiler technology, and ISA
–Cycles per instructions (CPI) depends upon the ISA
and the microarchitecture
–Time per cycle depends upon the microarchitecture
and the base technology
Microarchitecture CPI cycle time
Microcoded >1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
Multi-cycle, unpipelined control >1 short 21
CPI Examples
Microcoded machine Time
7 cycles 5 cycles 10 cycles
Inst 1 Inst 2 Inst 3

3 instructions, 22 cycles, CPI=7.33


Unpipelined machine
Inst 1 Inst 2 Inst 3
3 instructions, 3 cycles, CPI=1
Pipelined machine
Inst 1
3 instructions, 3 cycles, CPI=1
Inst 2
Inst 3
22
Technology Assumptions
• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower!)

Thus, the following timing assumption is reasonable

tIM tRFtALU tDM tRW

A 5-stage pipeline will be the focus of our detailed design

- some commercial designs have over 30 pipeline


stages to do an integer add!
23
Pipeline Diagrams
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory -back
phase fetch phase phase phase phase

We need some way to show multiple


simultaneous transactions in both space and time
24
Pipeline Diagrams: Transactions vs. Time
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory -back
phase fetch phase phase phase phase
time t0 t1 t2 t3 t4 t5 t6 t7 ....
instruction1 IF1 ID1 EX1 MA1 WB1
instruction2 IF2 ID2 EX2 MA2 WB2
instruction3 IF3 ID3 EX3 MA3 WB3
instruction4 IF4 ID4 EX4 MA4 WB4
instruction5 IF5 ID5 EX5 MA5 WB5 25
Pipeline Diagrams: Space vs. Time
Hardwired
0x4 Controller
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

write
fetch decode & register- execute memory -back
phase fetch phase phase phase phase
time t0 t1 t2 t3 t4 t5 t6 t7 ....
Resources

IF I1 I2 I3 I4 I5
ID I1 I2 I3 I4 I5
EX I1 I2 I3 I4 I5
MA I1 I2 I3 I4 I5
WB I1 I2 I3 I4 I5 26
Instructions Interact With Each Other
in Pipeline
• Structural Hazard: An instruction in the
pipeline needs a resource being used by
another instruction in the pipeline
• Data Hazard: An instruction depends on a
data value produced by an earlier instruction
• Control Hazard: Whether or not an instruction
should be executed depends on a control
decision made by an earlier instruction

27
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards

28
Overview of Structural Hazards
• Structural hazards occur when two instructions need
the same hardware resource at the same time
• Approaches to resolving structural hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create structural hazards
– Stall: Hardware includes control logic that stalls until
earlier instruction is no longer using contended resource
– Duplicate: Add more hardware to design so that each
instruction can access independent resources at the same
time
• Simple 5-stage MIPS pipeline has no structural hazards
specifically because ISA was designed that way

29
Example Structural Hazard:
0x4
Unified Memory
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

IF ID EX MEM WB

30
Example Structural Hazard:
0x4
Unified Memory
Add
we
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

IF ID EX MEM WB

addr

rdata
wdata
Unified Memory 31
Example Structural Hazard:
0x4
2-Cycle Memory
Add M0 M1
we Stage Stage
rs1
rs2
PC addr rd1 we
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata

IF ID EX MEM WB

32
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards

33
Overview of Data Hazards
• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
34
Example Data Hazard
r4 r1… r1 …
0x4
Add IR IR IR
31

we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata

MD1 MD2

...
r1 r0 + 10 (ADDI R1, R0, #10)
r4 r1 + 17 (ADDI R4, R1, #17) r1 is stale. Oops!
...
35
Feedback to Resolve Hazards

FB1 FB2 FB3 FB4

stage stage stage stage


1 2 3 4

• Later stages provide dependence information to


earlier stages which can stall (or kill) instructions
• Controlling a pipeline in this manner works provided
the instruction at stage i+1 can complete without any
interaction from instructions in stages 1 to i
(otherwise deadlock)
36
Resolving Data Hazards with Stalls
Stall Condition
(Interlocks)

0x4 nop IR IR IR
Add
31

we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
...
MD1 MD2
r1 r0 + 10
r4 r1 + 17
...
37
Stalled Stages and Pipeline Bubbles
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3
(I4) stalled stages IF4 ID4 EX4 MA4 WB4
(I5) IF5 ID5 EX5 MA5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I3 I3 I3 I4 I5
ID I1 I2 I2 I2 I2 I3 I4 I5
Resource
EX I1 nop nop nop I2 I3 I4 I5
Usage
MA I1 nop nop nop I2 I3 I4 I5
WB I1 nop nop nop I2 I3 I4 I5

nop  pipeline bubble


38
Stall Control Logic
stall ws
Cstall
rs ?
rt

0x4 nop
Add IR IR IR
31

we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata

MD1 MD2

Compare the source registers of the instruction in the decode


stage with the destination register of the uncommitted
39
instructions.
Stall Control Logic (ignoring jumps &branches)
stall ws
we
Cstall
rs ?
rt we ws we ws
re1 re2 Cdest Cdest
Cre
0x4 nop IR IR IR
Add
31

Cdest
we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata

MD1 MD2

Should we always stall if the rs field matches some rd?


not every instruction writes a register we
not every instruction reads a register re 40
Source & Destination Registers
R-type: op rs rt rd func

I-type: op rs rt immediate16

J-type: op immediate26
source(s) destination
ALU rd (rs) func (rt) rs, rt rd
ALUI rt (rs) op immediate rs rt
LW rt M [(rs) + immediate] rs rt
SW M [(rs) + immediate] (rt) rs, rt
BZ cond (rs)
true: PC (PC) + immediate rs
false: PC (PC) + 4 rs
J PC (PC) + immediate
JAL r31 (PC), PC (PC) + immediate 31
JR PC (rs) rs
JALR r31 (PC), PC (rs) rs 31
41
Deriving the Stall Signal
Cdest Cre
ws = Case opcode re1 = Case opcode
ALU rd ALU, ALUi,
ALUi, LW rt LW, SW, BZ,
JAL, JALR R31 JR, JALR on
J, JAL off
we = Case opcode
ALU, ALUi, LW (ws  0) re2 = Case opcode
JAL, JALR on ALU, SW on
... off ... off

Cstall
stall = ((rsD =wsE).weE +
(rsD =wsM).weM +
(rsD =wsW).weW) . re1D +
((rtD =wsE).weE +
(rtD =wsM).weM +
(rtD =wsW).weW) . re2D
42
Hazards due to Loads & Stores
Stall Condition
What if
(r1)+7 = (r3)+5 ?

0x4 nop IR IR IR
Add
31

we
rs1
rs2
addr rd1 A
PC we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata

MD1 MD2
...
M[(r1)+7]  (r2) Is there any possible data hazard
r4  M[(r3)+5] in this instruction sequence?
... 43
Data Hazards Due to Loads and Store
• Example instruction sequence
– Mem[ Regs[r1] + 7 ] <- Regs[r2]
– Regs[r4] <- Mem[ Regs[r3] + 5 ]

• What if Regs[r1]+7 == Regs[r3]+5 ?


– Writing and reading to/from the same address
– Hazard is avoided because our memory system
completes writes in a single cycle
– More realistic memory system will require more
careful handling of data hazards due to loads and
stores

44
Overview of Data Hazards
• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
45
Adding Bypassing to the Datapath
stall
r4 r1... r1 ...
0x4 nop
E M W
IR IR IR
Add
31

ASrc
we
rs1
rs2
PC addr D rd1 A
we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata

MD1 MD2

... When does this bypass help?


(I1) r1 r0 + 10 r1 Mem[r0 + 10] JAL 500
(I2) r4 r1 + 17 r4 r1 + 17 r4 r31 + 17
46
Deriving the Bypass Signal
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2
(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3
(I4) stalled stages IF4 ID4 EX4 MA4 WB4
(I5) IF5 ID5 EX5 MA5 WB5
Each stall or kill introduces a bubble in the pipeline
⇒ CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r4 (r1) + 17 IF2 ID2 EX2 MA2 WB2
(I3) IF3 ID3 EX3 MA3 WB3
(I4) IF4 ID4 EX4 MA4 WB4
(I5) IF5 ID5 EX5 MA5 WB5 47
The Bypass Signal
Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )

ws = Case opcode we = Case opcode


ALU rd ALU, ALUi, LW (ws  0)
ALUi, LW rt JAL, JALR on
JAL, JALR R31 ... off

ASrc = (rsD=wsE).weE.re1D Is this correct?


No because only ALU and ALUi instructions can benefit
from this bypass
Split weE into two components: we-bypass, we-stall

48
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE we-stallE = Case opcodeE
ALU, ALUi (ws  0) LW (ws  0)
... off JAL, JALR on
... off

ASrc = (rsD =wsE).we-bypassE . re1D

stall = ((rsD =wsE).we-stallE +


(rsD=wsM).weM + (rsD=wsW).weW). re1D
+((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D

49
Fully Bypassed Datapath
stall PC for JAL, ...

0x4 nop
E M W
IR IR IR
Add
ASrc 31

we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2

Is there still
a need for the
stall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D
+ (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D
50
Overview of Data Hazards
• Data hazards occur when one instruction depends on a
data value produced by a preceding instruction still in
the pipeline
• Approaches to resolving data hazards
– Schedule: Programmer explicitly avoids scheduling
instructions that would create data hazards
– Stall: Hardware includes control logic that freezes earlier
stages until preceding instruction has finished producing
data value
– Bypass: Hardware datapath allows values to be sent to an
earlier stage before preceding instruction has left the
pipeline
– Speculate: Guess that there is not a problem, if incorrect
kill speculative instruction and restart
51
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards

52
Control Hazards
• What do we need to calculate next PC?

– For Jumps
• Opcode, offset and PC
– For Jump Register
• Opcode and Register value
– For Conditional Branches
• Opcode, PC, Register (for condition), and offset
– For all other instructions
• Opcode and PC
– have to know it’s not one of above!

53
Opcode Decoding Bubble
(assuming no branch delay slots for now)

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r3 (r2) + 17 IF2 IF2 ID2 EX2 MA2 WB2
(I3) IF3 IF3 ID3 EX3 MA3 WB3
(I4) IF4 IF4 ID4 EX4 MA4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 nop I2 nop I3 nop I4
ID I1 nop I2 nop I3 nop I4
Resource
Usage EX I1 nop I2 nop I3 nop I4
MA I1 nop I2 nop I3 nop I4
WB I1 nop I2 nop I3 nop I4

CPI = 2! nop  pipeline bubble 54


Speculate next address is PC+4
PCSrc (pc+4 / jabs / rind/ br)
stall

Add
E M
0x4 nop
Add IR IR

Jump? I1

PC addr
inst IR

104 Inst
Memory I2

I1 096 ADD A jump instruction kills (not stalls)


I2 100 J 304
I3 104 ADD kill the following instruction
I4 304 ADD How? 55
Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)
stall
To kill a fetched
instruction -- Insert
Add a mux before IR
E M
0x4 nop
Add IR IR

Jump? II21 I1

IRSrcD
Any
PC addr nop interaction
inst IR
304
104 Inst between
Memory nop
I2
stall and
IRSrcD = Case opcodeD jump?
I1 096 ADD
J, JAL nop
I2 100 J 304
... IM
I3 104 ADD kill
I4 304 ADD
56
Jump Pipeline Diagrams
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1
(I2) 100: J 304 IF2 ID2 EX2 MA2 WB2
(I3) 104: ADD IF3 nop nop nop nop
(I4) 304: ADD IF4 ID4 EX4 MA4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 nop I4 I5
Resource
EX I1 I2 nop I4 I5
Usage
MA I1 I2 nop I4 I5
WB I1 I2 nop I4 I5

nop  pipeline bubble


57
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br)
stall

Add
E M
0x4 nop
IR IR
Add

BEQZ? I1
zero?

IRSrcD

PC addr nop A
inst IR
ALU Y
104 Inst
Memory I2

I1 096 ADD Branch condition is not known until the


I2 100 BEQZ r1 +200 execute stage
I3 104 ADD what action should be taken in the
108 …
I4 304 ADD decode stage ? 58
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br)
stall
?

Add
E BEQZ? M
0x4 nop
IR IR
Add

I2 I1
zero?

IRSrcD

PC addr nop A
inst IR
ALU Y
108 Inst
Memory I3

If the branch is taken


I1 096 ADD
I2 100 BEQZ r1 +200
- kill the two following instructions
I3 104 ADD - the instruction at the decode stage is
108 … not valid
I4 304 ADD 59
 stall signal is not valid
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br) stall

Add
E BEQZ? M
IRSrcE
0x4 nop
IR IR
Add

Jump? I2 I1
zero?
PC
IRSrcD
PC addr nop A
inst IR
ALU Y
108 Inst
Memory I3

If the branch is taken


I1 096 ADD
I2 100 BEQZ r1 +200
- kill the two following instructions
I3 104 ADD - the instruction at the decode stage is
108 … not valid
I4 304 ADD 60
 stall signal is not valid
New Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+ ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
. !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)

Don’t stall if the branch is taken. Why?

Instruction at the decode stage is invalid

61
Control Equations for PC and IR Muxes
PCSrc = Case opcodeE
BEQZ.z, BNEZ.!z br Give priority
... 
  Case opcodeD
to the older
J, JAL jabs instruction,
JR, JALR rind
... pc+4 i.e., execute-stage
IRSrcD = Case opcodeE instruction
BEQZ.z, BNEZ.!z nop
...  over decode-stage
  Case opcodeD instruction
J, JAL, JR, JALR nop
... IM

IRSrcE = Case opcodeE


BEQZ.z, BNEZ.!z nop
... stall.nop + !stall.IRD
62
Branch Pipeline Diagrams
(resolved in execute stage)
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1
(I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2
(I3) 104: ADD IF3 ID3 nop nop nop
(I4) 108: IF4 nop nop nop nop
(I5) 304: ADD IF5 ID5 EX5 MA5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 nop I5
Resource
EX I1 I2 nop nop I5
Usage
MA I1 I2 nop nop I5
WB I1 I2 nop nop I5

nop  pipeline bubble


63
Reducing Branch Penalty
(resolve in decode stage)
• One pipeline bubble can be removed if an extra
comparator is used in the Decode stage
– But might elongate cycle time
PCSrc (pc+4 / jabs / rind/ br)

Add E
nop
0x4 IR
Add

Zero detect on
register file output
we
rs1
rs2
addr nop rd1
PC ws
inst IR
wd rd2
Inst GPRs
D
Memory
Pipeline diagram now same as for jumps64
Branch Delay Slots
(expose control hazard to software)

• Change the ISA semantics so that the instruction


that follows a jump or branch is always executed
– gives compiler the flexibility to put in a useful instruction where normally
a pipeline bubble would have resulted.

I1 096 ADD
I2 100 BEQZ r1 +200
Delay slot instruction executed
I3 104 ADD
I4 304 ADD
regardless of branch outcome

• Other techniques include more advanced


branch prediction, which can dramatically
reduce the branch penalty... to come later 65
Branch Pipeline Diagrams
(branch delay slot)
time
t0 t1 t2 t3 t4 t5 t6 t7 ....
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1
(I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2
(I3) 104: ADD IF3 ID3 EX3 MA3 WB3
(I4) 304: ADD IF4 ID4 EX4 MA4 WB4

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4
ID I1 I2 I3 I4
Resource
EX I1 I2 I3 I4
Usage
MA I1 I2 I3 I4
WB I1 I2 I3 I4

66
Why an Instruction may not be
dispatched every cycle (CPI>1)
• Full bypassing may be too expensive to implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase cycle time and
counteract the benefit of reducing CPI
• Loads have two-cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible pipeline hazard
(compiler schedules independent instruction or inserts NOP to avoid
hazard). Removed in MIPS-II (pipeline interlocks added in hardware)
• MIPS:“Microprocessor without Interlocked Pipeline Stages”
• Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots

Machines with software-visible delay slots may execute significant


number of NOP instructions inserted by the compiler. NOPs not
counted in useful CPI (alternatively, increase instructions/program)
67
Other Control Hazards
• Exceptions
• Interrupts

More on this later in the course

68
Agenda
• Microcoded Microarchitectures
• Pipeline Review
– Pipelining Basics
– Structural Hazards
– Data Hazards
– Control Hazards

69
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

70
Computer Architecture
ELE 475 / COS 475
Slide Deck 3: Cache Review
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance

2
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance

3
Naive Register File

Write
Data

Read
Data

clk Read
Decoder Address

Write
Address
4
Memory Arrays: Register File

5
Memory Arrays: SRAM

6
Memory Arrays: DRAM

7
Relative Memory Sizes of
SRAM vs. DRAM

On-Chip DRAM on
SRAM on memory chip
logic chip

[ From Foss, R.C. “Implementing Application-


Specific Memory”, ISSCC 1996 ] 8
Memory Technology Trade-offs
Low Capacity
Latches/Registers Low Latency
High Bandwidth
(more and wider ports)

Register File

SRAM
High Capacity
DRAM High Latency
Low Bandwidth

9
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance

10
CPU-Memory Bottleneck
Main
Processor
Memory

• Performance of high-speed computers is usually limited by


memory bandwidth and latency
• Latency is time for a single access
– Main memory latency is usually >> than processor cycle time
• Bandwidth is the number of accesses per unit time
– If m instructions are loads/stores, 1 + m memory accesses per
instruction, CPI = 1 requires at least 1 + m memory accesses
per cycle
• Bandwidth-Delay Product is amount of data that can be in
flight at the same time (Little’s Law)
11
Processor-DRAM Latency Gap

[Hennessy &
Patterson 2011]

• Four-issue 2 GHz superscalar accessing 100 ns DRAM could execute


800 instructions during the time for one memory access!
• Long latencies mean large bandwidth-delay products which can be
difficult to saturate, meaning bandwidth is wasted
12
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Physical Size Affects Latency
Processor

Processor

Small
Memory
Big Memory

• Signals have further to travel


• Fan out to more locations

13
Memory Hierarchy
Small Fast Big Slow
Processor Memory Memory
(RF, SRAM) (DRAM)

• Capacity: Register << SRAM << DRAM


• Latency: Register << SRAM << DRAM
• Bandwidth: on-chip >> off-chip
• On a data access:
– if data is in fast memory -> low-latency access to SRAM
– if data is not in fast memory -> long-latency access to DRAM
• Memory hierarchies only work if the small, fast memory
actually stores data that is reused by the processor 14
Common And Predictable Memory
Reference Patterns
Address n loop iterations
Temporal Locality:
Instruction If a location is
fetches reference it is likely
to be reference again
subroutine subroutine in the near future
call return
Stack
Spatial Locality:
accesses
argument access If a location is
referenced it is likely
that locations near it
will be referenced in
Data the near future
accesses scalar accesses
Time
15
Real Memory Reference Patterns
Spatial
Locality

Temporal
Locality
Memory Address

Temporal
& Spatial
Locality

Time (one dot per access to that address at that time)


[From Donald J. Hatfield, Jeanette Gerald: Program Restructuring 16
for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)]
Caches Exploit Both Types of Locality
Small Fast Big Slow
Processor Memory Memory
(RF, SRAM) (DRAM)

• Exploit temporal locality by remembering the


contents of recently accessed locations
• Exploit spatial locality by fetching blocks of
data around recently accessed locations

17
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance

18
Inside a Cache
Address Address
Main
Processor CACHE Memory
Data Data

copy of main copy of main


memory memory
location 100 location 101
Data Data
100 Byte Byte Line
Data
304 Byte

Address 6848
Tag 416

Data Block
19
Basic Cache Algorithm for a Load

20
Classifying Caches
Address Address
Main
Processor CACHE Memory
Data Data

• Block Placement: Where can a block be


placed in the cache?
• Block Identification: How a block is found if it
is in the cache?
• Block Replacement: Which block should be
replaced on a miss?
• Write Strategy: What happens on a write? 21
Block Placement:
Where Place Block in Cache?
1111111111 2222222222 33
Block Number 0123456789 0123456789 0123456789 01

Memory

Set Number 0 1 2 3 01234567

Cache

Fully (2-way) Set Direct


Associative Associative Mapped
anywhere anywhere in only into
block 12
can be placed set 0 block 4
(12 mod 4) (12 mod 8)
22
Block Placement:
Where Place Block in Cache?
1111111111 2222222222 33
Block Number 0123456789 0123456789 0123456789 01

Memory

Set Number 0 1 2 3 01234567

Cache

Fully (2-way) Set Direct


Associative Associative Mapped
anywhere anywhere in only into
block 12
can be placed set 0 block 4
(12 mod 4) (12 mod 8)
23
Block Identification: How to find block
in cache?

• Cache uses index and offset to find


potential match, then checks tag
• Tag check only includes higher order bits
• In this example (Direct-mapped, 8B block,
4 line cache )
24
Block Identification: How to find block
in cache?

• Cache checks all potential blocks with


parallel tag check
• In this example (2-way associative, 8B block,
4 line cache) 25
Block Replacement: Which block to
replace?
• No choice in a direct mapped cache
• In an associative cache, which block from set should be
evicted when the set becomes full?
• Random
• Least Recently Used (LRU)
– LRU cache state must be updated on every access
– True implementation only feasible for small sets (2-way)
– Pseudo-LRU binary tree often used for 4-8 way
• First In, First Out (FIFO) aka Round-Robin
– Used in highly associative caches
• Not Most Recently Used (NMRU)
– FIFO with exception for most recently used block(s)
26
Write Strategy: How are writes
handled?
• Cache Hit
– Write Through – write both cache and memory,
generally higher traffic but simpler to design
– Write Back – write cache only, memory is written
when evicted, dirty bit per block avoids unnecessary
write backs, more complicated
• Cache Miss
– No Write Allocate – only write to main memory
– Write Allocate – fetch block into cache, then write
• Common Combinations
• Write Through & No Write Allocate
• Write Back & Write Allocate
27
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance

28
Average Memory Access Time
Hit
Main
Processor CACHE Memory
Miss

• Average Memory Access Time = Hit Time + ( Miss Rate * Miss Penalty )

29
Categorizing Misses: The Three C’s

• Compulsory – first-reference to a block, occur even


with infinite cache
• Capacity – cache is too small to hold all data needed by
program, occur even under perfect replacement policy
(loop over 5 cache lines)
• Conflict – misses that occur because of collisions due
to less than full associativity (loop over 3 cache lines) 30
Reduce Hit Time: Small & Simple
Caches

Plot from Hennessy and Patterson Ed. 4


Image Copyright © 2007-2012 Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Block Size

• Less tag overhead • Can waste bandwidth if data is


• Exploit fast burst transfers not used
from DRAM • Fewer blocks -> more conflicts
• Exploit fast burst transfers
over wide on-chip busses

Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Cache Size

Empirical Rule of Thumb:


If cache size is doubled, miss rate usually drops by about √2

Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: High Associativity

Empirical Rule of Thumb:


Direct-mapped cache of size N has about the same miss rate
as a two-way set- associative cache of size N/2
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: High Associativity

Empirical Rule of Thumb:


Direct-mapped cache of size N has about the same miss rate
as a two-way set- associative cache of size N/2
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Agenda
• Memory Technology
• Motivation for Caches
• Classifying Caches
• Cache Performance

36
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

37
Computer Architecture
ELE 475 / COS 475
Slide Deck 4: Superscalar 1
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Types of Data Hazards
Consider executing a sequence of
rk ri op rj
type of instructions
Data-dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW) hazard

Anti-dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR) hazard

Output-dependence
r3  r1 op r2 Write-after-Write
r3  r6 op r7 (WAW) hazard
2
Introduction to Superscalar Processor
• Processors studied so far are fundamentally
limited to CPI >= 1
• Superscalar processors enable CPI < 1 (IPC > 1)
by executing multiple instructions in parallel
• Can have both in-order and out-of-order
superscalar processors. We will start with in-
order.

3
Baseline 2-Way In-Order Superscalar
Processor

IR0 Branch Cond.


ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata

Data
Cache

Pipe A: Integer Ops., Branches


Pipe B: Integer Ops., Memory 4
Baseline 2-Way In-Order Superscalar
Processor

4 Read 2 Write
IR0 Ports Branch Cond. Ports
ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata

Data
Fetch 2 Instructions at Cache
same time
Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 5
Baseline 2-Way In-Order Superscalar
Processor

IR0 Branch Cond.


ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata

Data
Issue Logic / Cache
Instruction
Steering Pipe A: Integer Ops., Branches
Pipe B: Integer Ops., Memory 6
Baseline 2-Way In-Order Superscalar
Duplicate Control
Processor
IR0 Decode A

IR1 Decode B

IR0 Branch Cond.


ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata

Data
Cache

Pipe A: Integer Ops., Branches


Pipe B: Integer Ops., Memory 7
Issue Logic Pipeline Diagrams
OpA F D A0 A1 W CPI = 0.5 (IPC = 2)
OpB F D B0 B1 W
OpC F D A0 A1 W Double Issue Pipeline
Can have two instructions in
OpD F D B0 B1 W same stage at same time
OpE F D A0 A1 W
OpF F D B0 B1 W

ADDIU F D A0 A1 W
LW F D B0 B1 W
Instruction Issue Logic swaps from
LW F D B0 B1 W natural position
ADDIU F D A0 A1 W
LW F D B0 B1 W
Structural
LW F D D B0 B1 W Hazard
8
Dual Issue Data Hazards
No Bypassing:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R5,R6,1 F D A0 A1 W
ADDIU R7,R5,1 F D D D D A0 A1 W

Full Bypassing:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R5,R6,1 F D A0 A1 W
ADDIU R7,R5,1 F D D A0 A1 W 9
Dual Issue Data Hazards
Order Matters:
ADDIU R1,R1,1 F D A0 A1 W
ADDIU R3,R4,1 F D B0 B1 W
ADDIU R7,R5,1 F D A0 A1 W
ADDIU R5,R6,1 F D B0 B1 W

WAR Hazard Possible?

10
Fetch Logic and Alignment
Cyc Addr Instr
0 0x000 OpA 0x000 0 0 1 1
0 0x004 OpB
1 0x008 OpC …
1 0x00C J 0x100
… 0x100 2 2
2 0x100 OpD
2 0x104 J 0x204 …

3 0x204 OpE 0x200 3 3
3 0x208 J 0x30C


4 0x30C OpF 0x300 4
4 0x310 OpG
5 0x314 OpH 0x310 4 5

Fetching across cache Lines is


very hard. May need extra ports 11
Fetch Logic and Alignment
Cyc Addr Instr
0 0x000 OpA Ideal, No Alignment Constraints
0 0x004 OpB
1 0x008 OpC OpA F D A0 A1 W
1 0x00C J 0x100 OpB F D B0 B1 W
… OpC F D B0 B1 W
2 0x100 OpD J F D A0 A1 W
2 0x104 J 0x204 OpD F D B0 B1 W
… J F D A0 A1 W
3 0x204 OpE OpE F D B0 B1 W
3 0x208 J 0x30C J F D A0 A1 W
… OpF F D A0 A1 W
4 0x30C OpF OpG F D B0 B1 W
4 0x310 OpG OpH F D A0 A1 W
5 0x314 OpH

12
With Alignment Constraints
Cyc Addr Instr
? 0x000 OpA 0x000 0 0 1 1
? 0x004 OpB
? 0x008 OpC …
? 0x00C J 0x100
… 0x100 2 2
? 0x100 OpD
? 0x104 J 0x204 …

? 0x204 OpE 0x200 3 3 4 4
? 0x208 J 0x30C


? 0x30C OpF 0x300 5 5
? 0x310 OpG
? 0x314 OpH 0x310 6 6

13
With Alignment Constraints
Cyc Addr Instr
1 0x000 OpA F D A0 A1 W
1 0x004 OpB F D B0 B1 W
2 0x008 OpC F D B0 B1 W
2 0x00C J 0x100 F D A0 A1 W
3 0x100 OpD F D B0 B1 W
3 0x104 J 0x204 F D A0 A1 W
4 0x200 ? F - - - -
4 0x204 OpE F D A0 A1 W
5 0x208 J 0x30C F D A0 A1 W
5 0x20C ? F - - - -
6 0x308 ? F - - - -
6 0x30C OpF F D A0 A1 W
7 0x310 OpG F D A0 A1 W
7 0x314 OpH F D B0 B1 W
14
Precise Exceptions and Superscalars
• Similar to tracking program order for data
dependencies, we need to track order for
exceptions

LW F D B0 B1 W
SYSCALL F D A0 A1 W

LW is in B pipeline, but commits first in logical


order!

15
Bypassing in Superscalar Pipelines

IR0 Branch Cond.


ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Write
ALU
addr
B rdata

Data
Cache

16
Bypassing in Superscalar Pipelines

Branch Cond.
ALU

A
RF
Write
ALU
addr
B rdata

Data
Cache

17
Bypassing in Superscalar Pipelines

Branch Cond.
ALU

A
RF
Write
ALU
addr
B rdata

Data
Cache

18
Bypassing in Superscalar Pipelines

Branch Cond.
ALU

A1 3 5
RF
Write
ALU
addr
B2 rdata

Data
4 6
Cache

19
123456
Breaking Decode and Issue Stage
• Bypass Network can become very complex
• Can motivate breaking Decode and Issue Stage
D = Decode, Possibly resolve structural Hazards
I = Register file read, Bypassing, Issue/Steer
Instructions to proper unit

OpA F D I A0 A1 W
OpB F D I B0 B1 W
OpC F D I A0 A1 W
OpD F D I B0 B1 W
20
Superscalars Multiply Branch Cost
BEQZ F D I A0 A1 W
OpA F D I B0 - -
OpB F D I - - -
OpC F D I - - -
OpD F D - - - -
OpE F D - - - -
OpF F - - - - -
OpG F - - - - -
OpH F D I A0 A1 W
OpI F D I B0 B1 W
21
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

22
Computer Architecture
ELE 475 / COS 475
Slide Deck 5: Superscalar 2 and
Exceptions
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Interrupts
• Out-of-Order Processors

2
Interrupts:
altering the normal flow of control

Ii-1 HI1

interrupt
program Ii HI2
handler

Ii+1 HIn

An external or internal event that needs to be processed by


another (system) program. The event is usually unexpected or
rare from program’s point of view. 3
Causes of Exceptions
Interrupt: an event that requests the attention of the processor
• Asynchronous: an external event
– input/output device service request
– timer expiration
– power disruptions, hardware failure
• Synchronous: an internal exception (a.k.a.
exceptions/trap)
– undefined opcode, privileged instruction
– arithmetic overflow, FPU exception
– misaligned memory access
– virtual memory exceptions: page faults,
TLB misses, protection violations
– software exceptions: system calls, e.g., jumps into kernel
4
Asynchronous Interrupts:
invoking the interrupt handler

• An I/O device requests attention by asserting


one of the prioritized interrupt request lines

• When the processor decides to process the


interrupt
– It stops the current program at instruction Ii, completing all the
instructions up to Ii-1 (a precise interrupt)
– It saves the PC of instruction Ii in a special register (EPC)
– It disables interrupts and transfers control to a designated interrupt
handler running in the kernel mode

5
Interrupt Handler
• Saves EPC before re-enabling interrupts to allow nested
interrupts 
– need an instruction to move EPC into GPRs
– need a way to mask further interrupts at least until EPC can be saved
• Needs to read a status register that indicates the cause
of the interrupt
• Uses a special indirect jump instruction RFE (return-
from-exception) to resume user code, this:
– enables interrupts
– restores the processor to the user mode
– restores hardware status and control state

6
Synchronous Interrupts
• A synchronous interrupt (exception) is caused by a
particular instruction

• In general, the instruction cannot be completed and


needs to be restarted after the exception has been
handled
– requires undoing the effect of one or more partially executed instructions

• In the case of a system call trap, the instruction is


considered to have been completed
– syscall is a special jump instruction involving a change to privileged kernel mode
– Handler resumes at instruction after system call

7
Exception Handling 5-Stage Pipeline
Inst. Data
PC D Decode E + M W
Mem Mem

PC address Illegal Data address


Overflow
Exception Opcode Exceptions

Asynchronous Interrupts

• How to handle multiple simultaneous exceptions in


different pipeline stages?
• How and where to handle external asynchronous
interrupts?
8
Exception Handling 5-Stage Pipeline
Commit
Point

Inst. Data
PC D Decode E + M W
Mem Mem

Illegal Overflow Data address


PC address
Opcode Exceptions
Exception

EPC Cause
Exc Exc Exc
D E M

PC PC PC
Select D E M Asynchronous
Handler Kill F Kill D Kill E Kill
PC Stage Stage Stage Interrupts Writeback

9
Exception Handling 5-Stage Pipeline
• Hold exception flags in pipeline until commit point (M
stage)

• Exceptions in earlier pipe stages override later


exceptions for a given instruction

• Inject external interrupts at commit point (override


others)

• If exception at commit: update Cause and EPC


registers, kill all stages, inject handler PC into fetch
stage

10
Speculating on Exceptions
• Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
• Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline, special
hardware for various exception types
• Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline

• Bypassing allows use of uncommitted instruction


results by following instructions
11
Exception Pipeline Diagram
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 MA1 nop overflow!
(I2) 100: XOR IF2 ID2 EX2 nop nop
(I3) 104: SUB IF3 ID3 nop nop nop
(I4) 108: ADD IF4 nop nop nop nop
(I5) Exc. Handler code IF5 ID5 EX5 MA5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 nop I5
Resource
EX I1 I2 nop nop I5
Usage
MA I1 nop nop nop I5
WB nop nop nop nop I5

12
Agenda
• Interrupts
• Out-of-Order Processors

13
Out-Of-Order (OOO) Introduction
Name Frontend Issue Writeback Commit
I4 IO IO IO IO Fixed Length Pipelines
Scoreboard
I2O2 IO IO OOO OOO Scoreboard
I2OI IO IO OOO IO Scoreboard,
Reorder Buffer, and Store Buffer
I03 IO OOO OOO OOO Scoreboard and Issue Queue
IO2I IO OOO OOO IO Scoreboard, Issue Queue,
Reorder Buffer, and Store Buffer

14
OOO Motivating Code Sequence
0 MUL R1, R2, R3 0 1
1 ADDIU R11,R10,1
2 MUL R5, R1, R4 2 4

3 MUL R7, R5, R6 5 6


3
4 ADDIU R12,R11,1
5 ADDIU R13,R12,1
6 ADDIU R14,R12,2

• Two independent sequences of instructions enable flexibility


in terms of how instructions are scheduled in total order
• We can schedule statically in software or dynamically in
hardware

15
I4: In-Order Front-End, Issue,
Writeback, Commit

F D X M W

16
I4: In-Order Front-End, Issue,
Writeback, Commit

X1
X0
F D W
M0 M1

17
I4: In-Order Front-End, Issue,
Writeback, Commit (4-stage MUL)
X1 X2 X3
X0

F D X2 X3
M0 M1 W
Y0 Y1 Y2 Y3

To avoid increasing CPI, needs full bypassing which can be


expensive. To help cycle time, add Issue stage where
register file read and instruction “issued” to Functional Unit
18
I4: In-Order Front-End, Issue,
Writeback, Commit (4-stage MUL)
SB X0 X1 X2 X3 ARF

F D I M0 M1
X2 X3 W
Y0 Y1 Y2 Y3

ARF R W

SB R/W W
19
Basic Scoreboard
Data Avail.
P F 4 3 2 1 0
P: Pending, Write to
R1
Destination in flight
R2 F: Which functional unit
R3 is writing register
Data Avail.: Where is the

write data in the
R31 functional unit pipeline

• A One in Data Avail. In column ‘I’ means that result data is


in stage ‘I’ of functional unit F
• Can use F and Data Avail. fields to determine when to
bypass and where to bypass from
• A one in column zero means that cycle functional unit is in
the Writeback stage
• Bits in Data Avail. field shift right every cycle. 20
Basic Scoreboard
Data Avail.
P F 4 3 2 1 0
P: Pending, Write to
R1 1
Destination in flight
R2 F: Which functional unit
R3 is writing register
Data Avail.: Where is the

write data in the
R31 functional unit pipeline

• A One in Data Avail. In column ‘I’ means that result data is


in stage ‘I’ of functional unit F
• Can use F and Data Avail. fields to determine when to
bypass and where to bypass from
• A one in column zero means that cycle functional unit is in
the Writeback stage
• Bits in Data Avail. field shift right every cycle. 21
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 X1 X2 X3 W
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F F F D D D D I X0 X1 X2 X3 W
5 ADDIU R13,R12,1 F F F F D I X0 X1 X2 X3 W
6 ADDIU R14,R12,2 F D I X0 X1 X2 X3 W

Cyc D I 4 3 2 1 0 Dest Regs


1 0 RED Indicates if we look at F
2 1 0 Field, we can bypass on this cycle
3 2 1 1 R1
4 1 1 R11
5 1 1
6 3 2 1 1
7 1 1 1 R5
8 1 1
9 1
10 4 3 1
11 5 4 1 1 R7
12 6 5 1 1 R12
13 6 1 1 1 R13
14 1 1 1 1 R14
15 1 1 1 1
16 1 1 1
17 1 1 22
18 1
I2O2: In-order Frontend/Issue, Out-of-
order Writeback/Commit
SB X0 ARF

F D I M0 M1 W
Y0 Y1 Y2 Y3

ARF R W

SB R R/W W
23
I2O2 Scoreboard
• Similar to I4, but we can now use it to track
structural hazards on Writeback port
• Set bit in Data Avail. according to length of
pipeline
• Architecture conservatively stalls to avoid
WAW hazards by stalling in Decode therefore
current scoreboard sufficient. More
complicated scoreboard needed for
processing WAW Hazards

24
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 W
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F F F D D D D I X0 W
5 ADDIU R13,R12,1 F F F F D I X0 W
6 ADDIU R14,R12,2 F D I I X0 W

Cyc D I 4 3 2 1 0 Dest Regs


1 0 RED Indicates if we look at F
2 1 0 Field, we can bypass on this cycle
3 2 1 1 R1
4 1 1 R11
5 1 1
6 3 2 1
7 1 1 R5
8 1 Writes with two cycle
9 1 latency. Structural
10 4 3 1 Hazard
11 5 4 1 1 R7
12 6 5 1 1 R12
13 1 1 1 R13
14 6 1 1
15 1 1 R15
16 1
17 25
18
Early Commit Point?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 /
1 ADDIU R11,R10,1 F D I X0 W /
2 MUL R5, R1, R4 F D I I I /
3 MUL R7, R5, R6 F D D D /
4 ADDIU R12,R11,1 F F F /
5 ADDIU R13,R12,1 /
6 ADDIU R14,R12,2

• Limits certain types of exceptions.

26
I2OI: In-order Frontend/Issue, Out-of-
order Writeback, In-order Commit
SB X0 PRF ARF

F D I L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3

ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
27
PRF=Physical Register File(Future File), ROB=Reorder Buffer, FSB=Finished Store Buffer (1 entry)
Reorder Buffer (ROB)
State S ST V Preg
--
P 1
F 1
P 1
P
F
P
P
--
--
State: {Free, Pending, Finished}
S: Speculative
ST: Store bit
V: Physical Register File Specifier Valid
Preg: Physical Register File Specifier 28
Reorder Buffer (ROB)
State S ST V Preg Next instruction allocates here in D
--
P 1 Tail of ROB
F 1 Speculative because branch is in flight
P 1
P
F Instruction wrote ROB out of order
P
P Head of ROB
--
--
State: {Free, Pending, Finished}
S: Speculative Commit stage is waiting for
ST: Store bit Head of ROB to be finished
V: Physical Register File Specifier Valid
Preg: Physical Register File Specifier 29
Finished Store Buffer (FSB)
V Op Addr Data
--

• Only need one entry if we only support one


memory instruction inflight at a time.
• Single Entry FSB makes allocation trivial.
• If support more than one memory instruction,
we need to worry about Load/Store address
aliasing.

30
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F F F D D D D I X0 W r C
5 ADDIU R13,R12,1 F F F F D I X0 W r C
6 ADDIU R14,R12,2 F D I I X0 W r C

Cyc D I ROB 0 1 2 3
0 Empty = free entry in ROB
1 0
2 1 0 R1 State of ROB at beginning of cycle
3 2 1 R11
4 R5 Pending entry in ROB
5
6 3 2 R11 Circle=Finished (Cycle after W)
7 R7
8 R1
9 Last cycle before entry is freed from ROB
10 4 3
(Cycle in C stage)
11 5 4 R12
12 6 5 R13 R5
13 R14
14 6 R12
15 R13
16 R7 Entry becomes free and is freed
17 R14 on next cycle
18
19 31
What if First Instruction Causes an
Exception?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W /
1 ADDIU R11,R10,1 F D I X0 W r -- /
2 MUL R5, R1, R4 F D I I I Y0 /
3 MUL R7, R5, R6 F D D D I /
4 ADDIU R12,R11,1 F F F D /
F D I. . .

32
What About Branches?
Option 2
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 /
Squash instructions in ROB
2 ADDIU R5, R1, R4 F D I /
when Branch commits
3 ADDIU R7, R5, R6 F D /
T ADDIU R12,R11,1 F D I . . .

Option 1
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I -
Squash instructions earlier. Has more
2 ADDIU R5, R1, R4 F D -
complexity. ROB needs many ports.
3 ADDIU R7, R5, R6 F -
T ADDIU R12,R11,1 F D I . . .

Option 3
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 W / Wait for speculative instructions to
2 ADDIU R5, R1, R4 F D I X0 W / reach the Commit stage and squash in
3 ADDIU R7, R5, R6 F D I X0 W /
Commit stage
T ADDIU R12,R11,1 F D I X0 W C
33
What About Branches?
• Three possible designs with decreasing
complexity based on when to squash speculative
instructions and de-allocate ROB entry:
1. As soon as branch resolves
2. When branch commits
3. When speculative instructions reach commit

• Base design only allows one branch at a time.


Second branch stalls in decode. Can add more
bits to track multiple in-flight branches.

34
Avoiding Stalling Commit on Store
Miss
PRF ARF
W ROB C CSB R
FSB
0 OpA F D I X0 W C CSB=Committed Store Buffer
1 SW F D I S0 W C C C C
2 OpB F D I X0 W W W W C
3 OpC F D I X X X X W C
4 OpD F D I I I I X W C

With Retire Stage


0 OpA F D I X0 W C
1 SW F D I S0 W C R R R
2 OpB F D I X0 W C
3 OpC F D I X W C
4 OpD F D I X W C 35
IO3: In-order Frontend, Out-of-order
Issue/Writeback/Commit
SB X0 ARF

F D I I
Q
M0 M1 W
Y0 Y1 Y2 Y3

ARF R W
SB R R/W W
I W R/W W
36
Q
Issue Queue (IQ)
Op Imm S V Dest V P Src0 V P Src1
Op: Opcode
Imm.: Immediate
S: Speculative Bit
V: Valid (Instruction has
corresponding Src/Dest)
P: Pending (Waiting on
operands to be produced)

Instruction Ready = (!Vsrc0 || !Psrc0) && (!Vsrc1


|| !Psrc1) && no structural hazards

• For high performance, factor in bypassing


37
Centralized vs. Distributed Issue Queue
I
X0 Q
A I X0

F D I I
Q
M0 F D M0

I
Y0 Q
B
I Y0

Centralized Distributed

38
Advanced Scoreboard
Data Avail.
P 4 3 2 1 0
P: Pending, Write to
R1
Destination in flight
R2 Data Avail.: Where is the
R3 write data in the pipeline
and which functional unit

R31

• Data Avail. now contains functional unit identfier


• A non-empty value in column zero means that cycle
functional unit is in the Writeback stage
• Bits in Data Avail. field shift right every cycle.

39
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 W
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F D i I X0 W
5 ADDIU R13,R12,1 F D i I X0 W
6 ADDIU R14,R12,2 F D i I X0 W

Cyc D I IQ 0 1 2
0
1 0 Dest/Src0/Src1, Circle denotes value
2 1 0 R1/R2/R3 present in ARF
3 2 1 R11/R10
4 3 R5/R1/R4
5 4 R7/R5/R6 Value bypassed so no circle, present
6 5 2 R12/R11 bit
7 6 4 R13/R12 Value set present by
8 5 R14/R12 Instruction 1 in cycle 5, W
9 Stage
10 3
11 6 R14/R12
12
13
40
14
Assume All Instruction in Issue Queue
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D i I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D i I X0 W
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F D i I X0 W
5 ADDIU R13,R12,1 F D i I X0 W
6 ADDIU R14,R12,2 F D i I X0 W

• Better performance than previous?

41
IO2I: In-order Frontend, Out-of-order
Issue/Writeback, In-order Commit
SB X0 PRF ARF

F D I I
Q L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3

ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
42
IQ W R/W
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D i I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C

0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C


1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D i I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C

43
Out-of-order 2-Wide Superscalar
with 1 ALU
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C

44
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

45
Computer Architecture
ELE 475 / COS 475
Slide Deck 6: Superscalar 3
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation

2
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation

3
Speculation and Branches: I4
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R4, R5, 1 F D I X0 X1 X2 X3 W
2 MUL R6, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 BEQZ R6, Target F D D D I I I I X0 X1 X2 X3 W
4 ADDIU R8, R9 ,1 F F F D D D D I -- -- -- -- --
5 ADDIU R10,R11,1 F F F F D -- -- -- -- -- --
6 ADDIU R12,R13,1 F -- -- -- -- -- -- --
T F D I . . .

• No Speculative Instructions Commit State

X0 X1 X2 X3
F D I M0 M1 X2 X3 W
Y0 Y1 Y2 Y3 4
Speculation and Branches: I2O2
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R4, R5, 1 F D I X0 W
2 MUL R6, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 BEQZ R6, Target F D D D I I I I X0 W
4 ADDIU R8, R9 ,1 F F F D D D D I -- --
5 ADDIU R10,R11,1 F F F F D -- -- --
6 ADDIU R12,R13,1 F -- -- -- --
T F D I . . .

• No Speculative Instructions Commit State


SB X0 ARF

F D I M0 M1 W
Y0 Y1 Y2 Y3 5
Speculation and Branches: I2OI
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R4, R5, 1 F D I X0 W r C
2 MUL R6, R1, R4 F D I I I Y0 Y1 Y2 Y3 W C
3 BEQZ R6, Target F D D D I I I I X0 W C
4 ADDIU R8, R9 ,1 F F F D D D D I -- -- --
5 ADDIU R10,R11,1 F F F F D -- -- -- --
6 ADDIU R12,R13,1 F -- -- -- -- --
T F D I . . .

• Must Squash Instructions in Pipeline after Branch to


prevent PRF Write.
• Can remove from ROB immediately or wait until Commit
SB X0 PRF ARF
F D I L0 L1 W ROB
FSB
C

S0
6
Y0 Y1 Y2 Y3
0 MUL
Speculation and Branches: IO3
R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R4, R5, 1 F D I X0 W
2 MUL R6, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 BEQZ R6, Target F D i I X0 W
4 ADDIU R8, R9 ,1 F D i I X0 W
5 ADDIU R10,R11,1 F D i I X0 W Speculative
6 ADDIU R12,R13,1 F D i I X0 W Instructions
7 ??? F D Wrote to ARF
8 ??? F D
9 ??? F D
10??? F D
11??? F D
T F D I . . .

• No Control speculation for IO3


• Could Stall on Branch
SB X0 ARF
F D I
Q I M0
Y0
M1
Y1 Y2 Y3
W
7
0 MUL
Speculation and Branches: IO2I
R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R4, R5, 1 F D I X0 W r C
2 MUL R6, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 BEQZ R6, Target F D i I X0 W C
4 ADDIU R8, R9 ,1 F D i I X0 W r -- Need to clean up
5 ADDIU R10,R11,1 F D i I X0 W -- Speculative state
6 ADDIU R12,R13,1 F D i -- In PRF. Needs
7 ??? F D -- Selective Rollback
8 ??? F D --
9 ??? F D --
10??? F --
11??? -- D
T F D I . . .

SB X0 PRF ARF
F D I
Q I L0
S0
L1 W ROB
FSB
8
C

Y0 Y1 Y2 Y3
0 MUL
Speculation and Branches: IO2I
R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R4, R5, 1 F D I X0 W r C
2 MUL R6, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 BEQZ R6, Target F D i I X0 W C
4 ADDIU R8, R9 ,1 F D i I X0 W r /
5 ADDIU R10,R11,1 F D i I X0 W r /
Speculative
6 ADDIU R12,R13,1 F D i I X0 / Instructions
7 ??? F D / Wrote to PRF
8 ??? F D / Not ARF
9 ??? F D /
10??? F D /
11??? F D /
12??? F /
13??? /
T F D I . . .

• Copy ARF to PRF on Mispredict


SB X0 PRF ARF
F D I
Q I L0
S0
L1 W ROB
FSB
9
C

Y0 Y1 Y2 Y3
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation

10
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)

Breaking all “Name” Dependencies (Causes problems)


0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C

11
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)

Breaking all “Name” Dependencies (Causes problems)


0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C

12
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)

Breaking all “Name” Dependencies (Causes problems)


0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C

13
WAW and WAR “Name” Dependencies
• WAW and WAR are not “True” data dependencies
• RAW is “True” data dependency because reader
needs result of writer
• “Name” dependencies exist because we have
limited number of “Names” (register specifiers or
memory addresses)

Breaking all “Name” Dependencies (Causes problems)


0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C
WAW

WAR 14
Adding More Registers
Breaking all “Name” Dependencies
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C

IO2I Microarchitecture Conservatively Stalls


0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D D D D D D D D D D I X0 W C

Manual Register Renaming. What if we could use more registers? Second R4 Write to R8?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R8, R7, 1 F D i I X0 W r C
15
Register Renaming
• Adding more “Names” (registers/memory)
removes dependence, but architecture
namespace is limited.
– Registers: Larger namespace requires more bits in
instruction encoding. 32 registers = 5 bits, 128
registers = 7 bits.

• Register Renaming: Change naming of registers


in hardware to eliminate WAW and WAR hazards

16
Register Renaming Overview
• 2 Schemes
– Pointers in the Instruction Queue/ReOrder Buffer
– Values in the Instruction Queue/ReOrder Buffer

• IO2I Uses pointers in IQ and ROB therefore


start with that design.

17
IO2I: Register Renaming with Pointers
FL in IQ and ROB
RT SB X0 PRF ARF

F D I I
Q L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3
• All data structures same as in IO2I Except:
– Add two fields to ROB
– Add Rename Table (RT) and Free List (FL) of
registers
• Increase size of PRF to provide more register
“Names” 18
IO2I: Register Renaming with Pointers
FL in IQ and ROB
RT SB X0 PRF ARF

F D I I
Q L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3
ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
IQ W W
RT R/W W
19
FL R/W W
Modified Reorder Buffer (ROB)
State S ST V Preg Areg Ppreg
--
P
F
P
P
F
P
P
--
--
State: {Free, Pending, Finished} Areg: Architectural Register File Specifier
S: Speculative Ppreg: Previous Physical Register
ST: Store bit
V: Destination is valid
Preg: Physical Register File Specifier 20
Rename Table (RT)
P Preg
P: Pending, Write to Destination in flight
R1
Preg: Physical Register Architectural
R2 Register maps to.
R3

R31

21
Free List (FL)
Free
Free: Register is free for renaming
p1
p2
If Free == 0, physical register is in use and cannot be
p3 used for renaming

pN

22
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C

RT FL IQ ROB
Cy D I W C R1 R2 R3 R4 R5 R6 R7 0 1 2 3 0 1 2 3
0 p0 p1 p2 p3 p4 p5 p6 p{7,8,9,10}
1 0 p{7,8,9,10}
2 1 0 p7 p{8,9,10} p7/p1/p2 p7/R1/p0
3 2 p8 p{9,10} p8/p7/p4 p8/R4/p3
4 3 p9 p10 p9/p8 p9/R6/p5
5 p10 p10/p6 p10/R4/p8
6 1
7 0
8 3 0 p7 p7/R1/p0
9 3 p0
10 2 p10 p0 p10/R4/p8
11 1 p0
12 2 1 p0 p8/R4/p3
13 2 p9 p{0,3} p9/R6/p5
14 3 p{0,3,5}
15 p{0,3,5,8}

23
Freeing Physical Registers
ADDU R1,R2,R3 <-Assume Arch. Reg R1 maps to Phys. Reg p0
ADDU R4,R1,R5
ADDU R1,R6,R7 <-Next write of Arch Reg R1, Mapped to Phys. Reg p1
ADDU R8,R9,R10

0 ADDU R1,R2,R3 I X W C
1 ADDU R4,R1,R5 I X W C
2 ADDU R1,R6,R7 IX W r C
3 ADDU R8,R9,R10 F D I X W r C
Write p0 Free p0 Alloc p0 Write p0 Read Wrong
value in p0

0 ADDU R1,R2,R3 I X W C
1 ADDU R4,R1,R5 I X W C
2 ADDU R1,R6,R7 I X W r C
3 ADDU R8,R9,R10 F D I X W r C
Write p0 Alloc p2 Write p2 Dealloc p0
• If Arch. Reg Ri mapped to Phys. Reg pj, we can free pj when the next instruction
that writes Ri commits 24
Unified Physical/Architectural
Register File
• Combine PRF and ARF into one register file
• Replace ARF with Architectural Rename Table
• Instead of copying Values, Commit stage
copies Preg pointer into appropriate entry of
Architectural Rename Table
• Unified Physical/Architectural Register file can
be smaller than separate

25
IO2I: Register Renaming with Values in
IQ and ROB
RT SB X0 ARF

F D I I
Q L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3
• All data structures same as previous Except:
– Modified ROB (Values instead of Register Specifier)
– Modified RT
– Modified IQ
– No FL
– No PRF, values merged into ROB
26
IO2I: Register Renaming with Values in
IQ and ROB
RT SB X0 ARF

F D I I
Q L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3
ARF R W
SB R/W W
ROB R/W W R/W
FSB W R/W
IQ W W
RT R/W W

27
Modified Reorder Buffer (ROB)
State S ST V Value Areg
--
P
F
P
P
F
P
P
--
--
State: {Free, Pending, Finished} Areg: Architectural Register File Specifier
S: Speculative
ST: Store bit
V: Destination is valid
Value: Actual Register Value 28
Modified Issue Queue (IQ)
Op Imm S V Dest V P Src0 V P Src1
Op: Opcode
Imm.: Immediate
S: Speculative Bit
V: Valid (Instruction has
corresponding Src/Dest)
P: Pending (Waiting on
operands to be produced)

If Pending, Source Field contains


index into ROB. Like a Preg identifier

29
Modified Rename Table (RT)
V P Preg
V: Valid Bit
R1
P: Pending, Write to Destination in flight
R2 Preg: Index into ROB
R3

R31
V:
If V == 0:
Value in ARF is up to date
If V == 1:
Value is in-flight or in ROB
P:
If P == 0:
Value is in ROB
if P == 1:
Value is in flight 30
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 MUL R4, R1, R5 F D i I Y0 Y1 Y2 Y3 W C
2 ADDIU R6, R4, 1 F D i I X0 W C
3 ADDIU R4, R7, 1 F D i I X0 W r C

RT IQ ROB
Cy D I W C R1 R2 R3 R4 R5 R6 R7 0 1 2 3 0 1 2 3
0
1 0
2 1 0 p0 p0/R2/R3 p0/R1
3 2 p1 p1/p0/R5 p1/R4
4 3 p2 p2/p1 p2/R6
5 p3 p3/R7 p3/R4
6 1
7 0
8 3 0 p0/R1
9 3
10 2 p3 p3/R4
11 1
12 2 1 p1/R4
13 2 p2/R6
14 3
15

31
Agenda
• Speculation and Branches
• Register Renaming
• Memory Disambiguation

32
Memory Disambiguation
st R1, 0(R2)
ld R3, 0(R4)

When can we execute the load?

33
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave IQ for execution
until all previous loads and stores have
completed execution

• Can still execute loads and stores speculatively,


and out-of-order with respect to other (non-
memory) instructions

• Need a structure to handle memory ordering…

34
IO2I: With In-Order LD/ST IQ
Int SB X0 PRF ARF

F D I
Q I L0 L1 W ROB
FSB
C

LD/ S0
ST
I Y0 Y1 Y2 Y3
Q

35
Conservative OOO Load Execution
st R1, 0(R2)
ld R3, 0(R4)
• Split execution of store instruction into two phases: address
calculation and data write

• Can execute load before store, if addresses known and r4 != r2

• Each load address compared with addresses of all previous


uncommitted stores (can use partial conservative check i.e.,
bottom 12 bits of address)

• Don’t execute load if any previous store address not known


36
(MIPS R10K, 16 entry address queue)
Address Speculation
st R1, 0(R2)
ld R3, 0(R4)
• Guess that r4 != r2

• Execute load before store address known

• Need to hold all completed but uncommitted load/store


addresses in program order

• If subsequently find r4==r2, squash load and all following


instructions

=> Large penalty for inaccurate address speculation 37


IO2I: With OOO Load and Stores
SB X0 PRF ARF

F D I
Q I L0 L1 W ROB
FSB
C

S0 FLB

Y0 Y1 Y2 Y3

38
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)

• Guess that r4 != r2 and execute load before


store
• If later find r4==r2, squash load and all
following instructions, but mark load
instruction as store-wait
• Subsequent executions of the same load
instruction will wait for all previous stores to
complete
• Periodically clear store-wait bits

39
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

40
Speculative Loads / Stores
Just like register updates, stores should not modify
the memory until after the instruction is committed

- A speculative store buffer is a structure introduced to hold


speculative store data.

42
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V S Tag Data Tags Data
V S Tag Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data

• On store execute:
– mark entry valid and speculative, and save data and tag of
instruction.
• On store commit:
– clear speculative bit and eventually move data to cache
• On store abort:
– clear valid bit
43
Speculative Store Buffer
Speculative Load Address
L1 Data
Store
Cache
Buffer
V S Tag Data
V S Tag Data
V S Tag Data Tags Data
V S Tag Data
V S Tag Data
V S Tag Data
Store Commit Path
Load Data

• If data in both store buffer and cache, which should we use?


Speculative store buffer
• If same address in store buffer twice, which should we use?
Youngest store older than load

44
Computer Architecture
ELE 475 / COS 475
Slide Deck 7: VLIW
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Superscalar Control Logic Scaling
Issue Width W

Issue Group

Previously
Issued Lifetime L
Instructions

• Each issued instruction must somehow check against W*L instructions, i.e.,
growth in hardware  W*(W*L)
• For in-order machines, L is related to pipeline latencies and check is done during
issue (scoreboard)
• For out-of-order machines, L also includes time spent in IQ, SB, and check is
done by broadcasting tags to waiting instructions at completion
• As W increases, larger instruction window is needed to find enough parallelism
to keep machine busy => greater L
=> Out-of-order control logic grows faster than W2 (~W3) 2
Out-of-Order Control Complexity:
MIPS R10000

[A. Ahi et al., MIPS R10000 Superscalar Microprocessor, Hot Chips, 1995 ]
3
Image Credit: MIPS Technologies Inc. / Silicon Graphics Computer Systems
Out-of-Order Control Complexity:
MIPS R10000

Control
Logic

[A. Ahi et al., MIPS R10000 Superscalar Microprocessor, Hot Chips, 1995 ]
4
Image Credit: MIPS Technologies Inc. / Silicon Graphics Computer Systems
Sequential ISA Bottleneck
Sequential Superscalar compiler
source code
a = foo(b);
for (i=0, i<

Find independent
operations

5
Sequential ISA Bottleneck
Sequential Superscalar compiler
source code
a = foo(b);
for (i=0, i<

Find independent Schedule


operations operations

6
Sequential ISA Bottleneck
Sequential Superscalar compiler Sequential
source code machine code
a = foo(b);
for (i=0, i<

Find independent Schedule


operations operations

Superscalar processor

Check instruction
dependencies 7
Sequential ISA Bottleneck
Sequential Superscalar compiler Sequential
source code machine code
a = foo(b);
for (i=0, i<

Find independent Schedule


operations operations

Superscalar processor

Check instruction Schedule


dependencies execution 8
VLIW: Very Long Instruction Word
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units,


Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
• Multiple operations packed into one instruction
• Each operation slot is for a fixed function
• Constant operation latencies are specified
• Architecture requires guarantee of:
– Parallelism within an instruction => no cross-operation RAW check
– No data use before data ready => no data interlocks
9
VLIW Equals (EQ) Scheduling Model
• Each operation takes exactly specified
latency
• Efficient register usage (Effectively more
registers)
• No need for register renaming or buffering
– Bypass from functional unit output to inputs
– Register writes whenever functional unit
completes
• Compiler depends on not having registers
visible early

10
VLIW Less-Than-or-Equals (LEQ)
Scheduling Model
• Each operation may take less than or equal
to its specified latency
– Destination can be written any time after
instruction issue
– Dependent instruction still needs to be
scheduled after instruction latency
• Precise interrupts simplified
• Binary compatibility preserved when latencies
are reduced

11
Early VLIW Machines
• FPS AP120B (1976)
– scientific attached array processor
– first commercial wide instruction machine
– hand-coded vector math libraries using software pipelining and loop
unrolling
• Multiflow Trace (1987)
– commercialization of ideas from Fisher’s Yale group including “trace
scheduling”
– available in configurations with 7, 14, or 28 operations/instruction
– 28 operations packed into a 1024-bit instruction word
• Cydrome Cydra-5 (1987)
– 7 operations encoded in 256-bit instruction word
– rotating register file
12
VLIW Compiler Responsibilities
• Schedule operations to maximize parallel
execution

• Guarantees intra-instruction parallelism

• Schedule to avoid data hazards (no interlocks)


– Typically separates operations with explicit NOPs

13
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop:
Compile

loop: lw F1, 0(R1)


addiu R1, R1, 4 Schedule
add.s F2, F0, F1
sw F2, 0(R2)
addiu R2, R2, 4
bne R1, R3,loop

14
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile

loop: lw F1, 0(R1) add.s


addiu R1, R1, 4 Schedule
add.s F2, F0, F1
sw F2, 0(R2)
addiu R2, R2, 4
bne R1, R3,loop

15
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile

loop: lw F1, 0(R1) add.s


addiu R1, R1, 4 Schedule
add.s F2, F0, F1
sw F2, 0(R2)
addiu R2, R2, 4
sw
bne R1, R3,loop

16
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile

loop: lw F1, 0(R1) add.s


addiu R1, R1, 4 Schedule
add.s F2, F0, F1
sw F2, 0(R2)
addiu R2, R2, 4
add R2 sw
bne R1, R3,loop

17
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile

loop: lw F1, 0(R1) add.s


addiu R1, R1, 4 Schedule
add.s F2, F0, F1
sw F2, 0(R2)
addiu R2, R2, 4
add R2 bne sw
bne R1, R3,loop

How many FP ops/cycle?


18
Loop Execution
for (i=0; i<N; i++)
Int1 Int 2 M1 M2 FP+ FPx
B[i] = A[i] + C;
loop: add R1 lw
Compile

loop: lw F1, 0(R1) add.s


addiu R1, R1, 4 Schedule
add.s F2, F0, F1
sw F2, 0(R2)
addiu R2, R2, 4
add R2 bne sw
bne R1, R3,loop

How many FP ops/cycle?


1 add.s / 8 cycles = 0.125 19
Loop Unrolling
for (i=0; i<N; i++)
B[i] = A[i] + C;

Unroll inner loop to perform 4


iterations at once
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}

20
Loop Unrolling
for (i=0; i<N; i++)
B[i] = A[i] + C;

Unroll inner loop to perform 4


iterations at once
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}

Need to handle values of N that are not multiples


of unrolling factor with final cleanup loop 21
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop:
lw F4, 12(r1)
addiu R1, R1, 16
add.s F5, F0, F1
add.s F6, F0, F2 Schedule
add.s F7, F0, F3
add.s F8, F0, F4
sw F5, 0(R2)
sw F6, 4(R2)
sw F7, 8(R2)
sw F8, 12(R2)
addiu R2, R2, 16
bne R1, R3, loop

22
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4
add.s F6, F0, F2 Schedule
add.s F7, F0, F3
add.s F8, F0, F4
sw F5, 0(R2)
sw F6, 4(R2)
sw F7, 8(R2)
sw F8, 12(R2)
addiu R2, R2, 16
bne R1, R3, loop

23
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4 add.s F5
add.s F6, F0, F2 Schedule add.s F6
add.s F7, F0, F3 add.s F7
add.s F8, F0, F4 add.s F8
sw F5, 0(R2) sw F5
sw F6, 4(R2) sw F6
sw F7, 8(R2) sw F7
sw F8, 12(R2)
add R2 bne sw F8
addiu R2, R2, 16
bne R1, R3, loop

24
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4 add.s F5
add.s F6, F0, F2 Schedule add.s F6
add.s F7, F0, F3 add.s F7
add.s F8, F0, F4 add.s F8
sw F5, 0(R2) sw F5
sw F6, 4(R2) sw F6
sw F7, 8(R2) sw F7
sw F8, 12(R2)
add R2 bne sw F8
addiu R2, R2, 16
bne R1, R3, loop

How many FLOPS/cycle?


25
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx
lw F2, 4(r1)
lw F3, 8(r1) loop: lw F1
lw F4, 12(r1) lw F2
addiu R1, R1, 16 lw F3
add.s F5, F0, F1 add R1 lw F4 add.s F5
add.s F6, F0, F2 Schedule add.s F6
add.s F7, F0, F3 add.s F7
add.s F8, F0, F4 add.s F8
sw F5, 0(R2) sw F5
sw F6, 4(R2) sw F6
sw F7, 8(R2) sw F7
sw F8, 12(R2)
add R2 bne sw F8
addiu R2, R2, 16
bne R1, R3, loop

How many FLOPS/cycle?


4 add.s / 11 cycles = 0.36 26
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1)
lw F2, 4(R1)
lw F3, 8(R1)
lw F4, 12(R1)
addiu R1, R1, 16
add.s F5, F0, F1
add.s F6, F0, F2
add.s F7, F0, F3
add.s F8, F0, F4
sw F5, 0(R2)
sw F6, 4(R2)
sw F7, 8(R2)
sw F8, 12(R2)
addiu R2, R2, 16
bne R1, R3, loop

27
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
addiu R1, R1, 16 add.s F5
add.s F5, F0, F1 add.s F6
add.s F6, F0, F2 add.s F7
add.s F7, F0, F3 add.s F8
add.s F8, F0, F4 sw F5
sw F5, 0(R2) sw F6
sw F6, 4(R2) add R2 sw F7
sw F7, 8(R2) bne sw F8
sw F8, 12(R2)
addiu R2, R2, 16
bne R1, R3, loop

28
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 sw F5 add.s F5
sw F5, 0(R2) sw F6 add.s F6
sw F6, 4(R2) add R2 sw F7 add.s F7
sw F7, 8(R2) bne sw F8 add.s F8
sw F8, 12(R2)
sw F5
addiu R2, R2, 16
sw F6
bne R1, R3, loop
add R2 sw F7
bne sw F8
29
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 lw F1 sw F5 add.s F5
sw F5, 0(R2) lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop
add R2 sw F7 add.s F7
bne sw F8 add.s F8
sw F5 30
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
prolog
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 loop: lw F1 sw F5 add.s F5
sw F5, 0(R2) iterate
lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop epilog
add R2 sw F7 add.s F7
bne sw F8 add.s F8
sw F5 31
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
prolog
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 loop: lw F1 sw F5 add.s F5
sw F5, 0(R2) iterate
lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop epilog
add R2 sw F7 add.s F7
bne sw F8 add.s F8
How many FLOPS/cycle?
sw F5 32
Software Pipelining
Unroll 4 ways first Int1 Int 2 M1 M2 FP+ FPx
loop: lw F1, 0(R1) lw F1
lw F2, 4(R1) lw F2
lw F3, 8(R1) lw F3
lw F4, 12(R1) add R1 lw F4
prolog
addiu R1, R1, 16 lw F1 add.s F5
add.s F5, F0, F1 lw F2 add.s F6
add.s F6, F0, F2 lw F3 add.s F7
add.s F7, F0, F3 add R1 lw F4 add.s F8
add.s F8, F0, F4 loop: lw F1 sw F5 add.s F5
sw F5, 0(R2) iterate
lw F2 sw F6 add.s F6
sw F6, 4(R2) add R2 lw F3 sw F7 add.s F7
sw F7, 8(R2) add R1 bne lw F4 sw F8 add.s F8
sw F8, 12(R2)
sw F5 add.s F5
addiu R2, R2, 16
sw F6 add.s F6
bne R1, R3, loop epilog
add R2 sw F7 add.s F7
bne sw F8 add.s F8
How many FLOPS/cycle?
sw F5
4 add.s / 4 cycles = 1
33
Software Pipelining vs. Loop Unrolling
Loop Unrolled Wind-down overhead
performance

Startup overhead

Loop Iteration time

Software Pipelined
performance

Loop time
Iteration
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration
34
What if there are no loops?

• Branches limit basic block size


in control-flow intensive
Basic block irregular code
• Single entry • Difficult to find ILP in individual
• Single exit basic blocks

35
Trace Scheduling [ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace

36
Trace Scheduling [ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace

37
Problems with “Classic” VLIW
• Object-code compatibility
– have to recompile all code for every machine, even for two machines in same
generation
• Object code size
– instruction padding wastes instruction memory/cache
– loop unrolling/software pipelining replicates code
• Scheduling variable latency memory operations
– caches and/or memory bank conflicts impose statically unpredictable
variability
• Knowing branch probabilities
– Profiling requires an significant extra step in build process
• Scheduling for statically unpredictable branches
– optimal schedule varies with branch path
• Precise Interrupts can be challenging
– Does fault in one portion of bundle fault whole bundle?
– EQ Model has problem with single step, etc.
38
VLIW Instruction Encoding

Group 1 Group 2 Group 3

• Schemes to reduce effect of unused fields


– Compressed format in memory, expand on I-cache refill
• used in Multiflow Trace
• introduces instruction addressing challenge
– Mark parallel groups
• used in TMS320C6x DSPs, Intel IA-64
– Provide a single-op VLIW instruction
• Cydra-5 UniOp instructions
39
Predication
Problem: Mispredicted branches limit ILP
Solution: Eliminate hard to predict branches with predicated
execution

Predication helps with small branch regions and/or branches


that are hard to predict by turning control flow into data flow
Most basic form of predication: conditional moves
• movz rd, rs, rt if ( R[rt] == 0 ) then R[rd] <- R[rs]
• movn rd, rs, rt if ( R[rt] != 0 ) then R[rd] <- R[rs]

if (a<b) slt R1, R2, R3 slt R1, R2, R3


x=a beq R1, R0, L1 movz R4, R2, R1
else move R4, R2 movn R4, R3, R1
x=b j L2 What if-then-else has many
L1:move R4, R3 instructions? What if
40
L2: unbalanced?
Full Predication
– Almost all instructions can be executed
conditionally under predicate
– Instruction becomes NOP if predicate register false

41
Full Predication
– Almost all instructions can be executed
conditionally under predicate
– Instruction becomes NOP if predicate register false
b0: Inst 1 if
Inst 2
br a==b, b2

b1: Inst 3 else


Inst 4
br b3
b2: Inst 5
then
Inst 6

b3: Inst 7
Inst 8

42
Four basic blocks
Full Predication
– Almost all instructions can be executed
conditionally under predicate
– Instruction becomes NOP if predicate register false
b0: Inst 1 if
Inst 2
br a==b, b2 Inst 1
Inst 2
b1: Inst 3 else
Inst 4 p1,p2 <- cmp(a==b)
br b3 Predication (p1) Inst 3 || (p2) Inst 5
(p1) Inst 4 || (p2) Inst 6
b2: Inst 5 Inst 7
then
Inst 6 Inst 8
One basic block
b3: Inst 7
Inst 8
Mahlke et al, ISCA95: On
average >50% branches removed 43
Four basic blocks
Predicates and the Datapath
stall PC for JAL, ...

0x4 nop
E M W
IR IR IR
Add
ASrc 31

we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2

movz rd, rs, rt if ( R[rt] == 0 ) then R[rd] <- R[rs]


• Suppress Writeback

44
Predicates and the Datapath
stall PC for JAL, ...

0x4 nop
E M W
IR IR IR
Add
ASrc 31

we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2

movz rd, rs, rt if ( R[rt] == 0 ) then R[rd] <- R[rs]


• Suppress Writeback

45
Predicates and the Datapath
stall PC for JAL, ...

0x4 nop
E M W
IR IR IR
Add
ASrc 31

we
rs1
rs2
A
PC addr D rd1 we
inst IR ws ALU Y addr
wd rd2
Inst GPRs B rdata
Memory Data
Imm Memory R
wdata
Ext wdata
BSrc
MD1 MD2

movz rd, rs, rt if ( R[rt] == 0 ) then R[rd] <- R[rs]


• Suppress Writeback
• Bypassing value doesn’t work!
• Need initial value (Extra read port on RF) 46
Problems with Full Predication
• Adds another register file
• Need to bypass predicates
• Need original value to use data value
bypassing

47
Leveraging Speculative Execution and
Reacting to Dynamic Events in a VLIW
Speculation:
• Moving instructions across branches
• Moving memory operations past other memory
operations

Dynamic Events:
• Cache Miss
• Exceptions
• Branch Mispredict
48
Code Motion
Before Code Motion After Code Motion
MUL R1, R2, R3 LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R5, R1, R4 MUL R1, R2, R3
MUL R7, R5, R6 ADDIU R12,R11,1
SW R7, 0(R16) MUL R5, R1, R4
ADDIU R12,R11,1 ADD R13,R12,R14
LW R14, 0(R9) MUL R7, R5, R6
ADD R13,R12,R14 ADD R14,R12,R13
ADD R14,R12,R13 SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target

49
Scheduling and Bundling
Before Bundling After Bundling
LW R14, 0(R9) {LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R1, R2, R3 MUL R1, R2, R3}
ADDIU R12,R11,1 {ADDIU R12,R11,1
MUL R5, R1, R4 MUL R5, R1, R4}
ADD R13,R12,R14 {ADD R13,R12,R14
MUL R7, R5, R6 MUL R7, R5, R6}
ADD R14,R12,R13 {ADD R14,R12,R13
SW R7, 0(R16) SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target}

50
VLIW Speculative Execution
Problem: Branches restrict compiler code motion
Solution: Speculative operations that don’t cause exceptions

Speculative load
Inst 1 Load.s r1 never causes
Inst 2 Inst 1 exception, but sets
br a==b, b2 Inst 2 “poison” bit on
br a==b, b2 destination register

Load r1
Use r1 Check for exception in
Inst 3 Chk.s r1 original home block
Use r1 jumps to fixup code if
Can’t move load above branch Inst 3
exception detected
because might cause spurious
exception

54
VLIW Speculative Execution
Problem: Branches restrict compiler code motion
Solution: Speculative operations that don’t cause exceptions

Speculative load
Inst 1 Load.s r1 never causes
Inst 2 Inst 1 exception, but sets
br a==b, b2 Inst 2 “poison” bit on
br a==b, b2 destination register

Load r1
Use r1 Check for exception in
Inst 3 Chk.s r1 original home block
Use r1 jumps to fixup code if
Can’t move load above branch Inst 3
exception detected
because might cause spurious
exception

Particularly useful for scheduling long latency loads early


55
VLIW Data Speculation
Problem: Possible memory hazards limit code scheduling
Solution: Hardware to check pointer hazards

57
VLIW Data Speculation
Problem: Possible memory hazards limit code scheduling
Solution: Hardware to check pointer hazards

Data speculative load


Inst 1 adds address to address
Inst 2 check table
Load.a r1
Store Inst 1 Store invalidates any
Load r1 Inst 2 matching loads in
Use r1 Store address check table
Inst 3 Load.c (ALAT)
Use r1
Can’t move load above store Inst 3 Check if load invalid (or
because store might be to same missing), jump to fixup
address code if so

Requires associative hardware in address check table


60
ALAT (Advanced Load Address Table)
Store
CAM on Address
Register Number Address Size • Load.a adds entry to ALAT
• Store removes entry if
address/size match
• Check instruction (chk.a or
ld.c) checks to make sure that
address is still in ALAT and
intermediary store did not
push it out.
Chk.a/ld.c
CAM on Register • If not found, run recovery
Number code (ex. re-execute load)

61
VLIW Multi-Way Branches
Problem: Long instructions provide few opportunities for branches
Solution: Allow one instruction to branch multiple directions

63
VLIW Multi-Way Branches
Problem: Long instructions provide few opportunities for branches
Solution: Allow one instruction to branch multiple directions

{ .mii
cmp.eq P1, P2 = R1, R2
cmp.ne P3,P4 = R4, R5
cmp.lt P5,P6, R8, R9
}
{ .bbb
(P1) br.cond label1
(P2) br.cond label2
(P5) br.cond label3
}
// fall through code here

64
Scheduling Around Dynamic Events
• Cache Miss
– Informing loads (loads nullify subsequent
instructions)
– Elbrus (Soviet/Russian) processor had branch on
cache miss
• Branch Mispredict
– Delay slots with predicated instructions
• Exceptions
– Hard on superscalar also…

65
Clustered VLIW

• Divide machine into


clusters (local register files
and local functional units)
• Lower bandwidth
between clusters/Higher
latency between clusters
• Used in:
– HP/ST Lx Processor
(printers)
– TI C6x Series DSP

TI C64x+ [TMS320C64x/C64x+ DSP CPU and Instruction Set Reference


Datapath Guide, TI, 2010, SPRU732J] 66
Image Credit: Texas Instruments Incorporated
Intel Itanium, EPIC IA-64
• EPIC is the style of architecture (CISC, RISC)
– Explicitly Parallel Instruction Computing
• IA-64 is Intel’s chosen ISA (x86, MIPS)
– IA-64 = Intel Architecture 64-bit
– An object-code-compatible VLIW
• Merced was first Itanium implementation (8086)
– First customer shipment expected 1997 (actually 2001)
– McKinley, second implementation shipped in 2002
– Recent version, Poulson, eight cores, 32nm, announced
2011
70
Eight Core Itanium “Poulson” [Intel 2011]

Image Credit: Intel

• 8 cores
• Cores are 2-way multithreaded
• 1-cycle 16KB L1 I&D caches
• 9-cycle 512KB L2 I-cache • 6 instruction/cycle fetch
– Two 128-bit bundles
• 8-cycle 256KB L2 D-cache
• 32 MB shared L3 cache • Up to 12 insts/cycle execute
• 544mm2 in 32nm CMOS
• Over 3 billion transistors 71
IA-64 Instruction Format
Instruction 2 Instruction 1 Instruction 0 Template

128-bit instruction bundle


• Template bits describe grouping of these
instructions with others in adjacent bundles
• Each group contains instructions that can
execute in parallel
bundle j-1 bundle j bundle j+1 bundle j+2

group i-1 group i group i+1 group i+2

72
IA-64 Registers
• 128 General Purpose 64-bit Integer Registers
• 128 General Purpose 64/80-bit Floating Point
Registers
• 64 1-bit Predicate Registers

• GPRs “rotate” to reduce code size for software


pipelined loops
– Rotation is a simple form of register renaming allowing one
instruction to address different physical registers on each
iteration
73
IA-64 Rotating Register File
Problem: Scheduling of loops requires many register
names and duplicated code in prolog and epilog
Solution: Allocate a new set of registers for each loop
iteration
P7
P6
RRB=3 P5
P4
P3
P2
R1 + P1
P0
Rotating Register Base (RRB) register points to base of
current register set. Value added on to logical register
specifier to give physical register number. Usually, split
into rotating and non-rotating registers. 74
Rotating Register File
(Previous Loop Example)

lw f1, () add.s f5, f4, ... sw f9, () bloop

lw P9, () add.s P13, P12, sw P17, () bloop RRB=8


lw P8, () add.s P12, P11, sw P16, () bloop RRB=7
lw P7, () add.s P11, P10, sw P15, () bloop RRB=6
lw P6, () add.s P10, P9, sw P14, () bloop RRB=5
lw P5, () add.s P9, P8, sw P13, () bloop RRB=4
lw P4, () add.s P8, P7, sw P12, () bloop RRB=3
lw P3, () add.s P7, P6, sw P11, () bloop RRB=2
lw P2, () add.s P6, P5, sw P10, () bloop RRB=1 75
Rotating Register File
(Previous Loop Example)
Three cycle load latency encoded Four cycle add.s latency
as difference of 3 in register encoded as difference of 4 in
specifier number (f4 - f1 = 3) register specifier number (f9 – f5
= 4)

lw f1, () add.s f5, f4, ... sw f9, () bloop

lw P9, () add.s P13, P12, sw P17, () bloop RRB=8


lw P8, () add.s P12, P11, sw P16, () bloop RRB=7
lw P7, () add.s P11, P10, sw P15, () bloop RRB=6
lw P6, () add.s P10, P9, sw P14, () bloop RRB=5
lw P5, () add.s P9, P8, sw P13, () bloop RRB=4
lw P4, () add.s P8, P7, sw P12, () bloop RRB=3
lw P3, () add.s P7, P6, sw P11, () bloop RRB=2
lw P2, () add.s P6, P5, sw P10, () bloop RRB=1 76
Rotating Register File
(Previous Loop Example)
Three cycle load latency encoded Four cycle add.s latency
as difference of 3 in register encoded as difference of 4 in
specifier number (f4 - f1 = 3) register specifier number (f9 – f5
= 4)

lw f1, () add.s f5, f4, ... sw f9, () bloop

lw P9, () add.s P13, P12, sw P17, () bloop RRB=8


lw P8, () add.s P12, P11, sw P16, () bloop RRB=7
lw P7, () add.s P11, P10, sw P15, () bloop RRB=6
lw P6, () add.s P10, P9, sw P14, () bloop RRB=5
lw P5, () add.s P9, P8, sw P13, () bloop RRB=4
lw P4, () add.s P8, P7, sw P12, () bloop RRB=3
lw P3, () add.s P7, P6, sw P11, () bloop RRB=2
lw P2, () add.s P6, P5, sw P10, () bloop RRB=1 77
Rotating Register File
(Previous Loop Example)
Three cycle load latency encoded Four cycle add.s latency
as difference of 3 in register encoded as difference of 4 in
specifier number (f4 - f1 = 3) register specifier number (f9 – f5
= 4)

lw f1, () add.s f5, f4, ... sw f9, () bloop

lw P9, () add.s P13, P12, sw P17, () bloop RRB=8


lw P8, () add.s P12, P11, sw P16, () bloop RRB=7
lw P7, () add.s P11, P10, sw P15, () bloop RRB=6
lw P6, () add.s P10, P9, sw P14, () bloop RRB=5
lw P5, () add.s P9, P8, sw P13, () bloop RRB=4
lw P4, () add.s P8, P7, sw P12, () bloop RRB=3
lw P3, () add.s P7, P6, sw P11, () bloop RRB=2
lw P2, () add.s P6, P5, sw P10, () bloop RRB=1 78
Why Itanium (IA-64) Failed
(My Opinion Only)
• Many Architectural (ISA) visible widgets tied hands of microarchitect
designers
– ALAT
– Predication (increases inter-dependency of instructions)
– Rotating Register file
• First implementations had very low clock rate
• Complex encoding
• Code size bloat
• Did not fundamentally solve some of the dynamic scheduling
• Large compiler complexity
– Profiling needed for good performance
• Limited Static Instruction Level Parallelism (ILP)
• People did build more complex superscalars. Area was not the most
critical constraint. (Code compatibility and Power were critical)
• AMD64!
79
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

80
Computer Architecture
ELE 475 / COS 475
Slide Deck 8: Branch Prediction
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address

2
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address

3
Longer Frontends Means More Control
Flow Penalty

RT
SB X0 ARF

F D I
Q I L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3

Penalty includes
instructions in IQ

4
Longer Pipeline Frontends Amplify
Branch Cost

Pentium 3: 10 cycle branch penalty


Pentium 4: 20 cycle branch penalty
Image from: The Microarchitecture of the Pentium 4 Processor by Glenn Hinton et al.
5
Appeared in Intel Technology Journal Q1, 2001. Image courtesy of Intel
Dual Issue and Branch Cost

IR0 Branch Cond.


ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Read
ALU
addr
B rdata

Data
Cache

6
Superscalars Multiply Branch Cost
BEQZ F D I A0 A1 W
OpA F D I B0 - -
OpB F D I - - -
OpC F D I - - - Dual-issue
Processor
OpD F D - - - - has twice
the mispredict
OpE F D - - - - penalty
OpF F - - - - -
OpG F - - - - -
OpH F D I A0 A1 W
OpI F D I B0 B1 W
How much work is lost if pipeline doesn’t follow correct instruction flow?
~pipeline width x branch penalty 7
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address

8
Branch Prediction
• Essential in modern processors to mitigate
branch delay latencies

Two types of Prediction


1. Predict Branch Outcome
2. Predict Branch/Jump Address

9
Where is the Branch Information
Known?

F D I X M W
Know branch outcome
Know target address for JR, JALR

Know target address for branches, J, JAL

10
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address

11
Branch Delay Slots
(expose control hazard to software)

• Change the ISA semantics so that the instruction


that follows a jump or branch is always executed
– gives compiler the flexibility to put in a useful instruction where normally
a pipeline bubble would have resulted.

I1 096 ADD
I2 100 BEQZ r1 +200
I3 104 ADD Delay slot instructions executed
I4 108 ADD regardless of branch outcome
I5 304 ADD

12
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:

BEZ
backward forward
90% 50%
BEZ

13
Static Software Branch Prediction
• Extend ISA to enable compiler to tell microarchitecture if branch is
likely to be taken or not (Can be up to 80% accurate)
BR.T F D X M W
OpA F - - - -
Targ F D X M W
BR.NT F D X M W
OpB F D X M W
OpC F D X M W

What if hint is wrong?


BR.T F D X M W
OpA F - - - -
Targ F - - - -
OpA F D X M W
14
Static Hardware Branch Prediction
1. Always Predict Not-Taken
– What we have been assuming
– Simple to implement
– Know fall-through PC in Fetch
– Poor Accuracy, especially on backward branches
2. Always Predict Taken
– Difficult to implement because don’t know target until
Decode
– Poor accuracy on if-then-else
3. Backward Branch Taken, Forward Branch Not Taken
– Better Accuracy
– Difficult to implement because don’t know target until
Decode
15
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address

16
Dynamic Hardware Branch Prediction:
Exploiting Temporal Correlation
• Exploit structure in program: The way a
branch resolves may be a good indicator of
the way it will resolve the next time it
executes (Temporal Correlation)

1-bit Saturating Counter


NT

T Predict Predict NT
T NT
T
17
1-bit Saturating Counter
NT

T Predict Predict NT
T NT
T

Iteration Prediction Actual Mispredict? For Backward branch in loop


1 NT T Y • Assume 4 Iterations
• Assume is executed multiple
2 T T times
3 T T
4 T NT Y Always 2 Mispredicts

1 NT T Y
2 T T
3 T T
18
4 T NT Y
2-bit Saturating Counter
Strong Weak Weak Strong
Taken NT Taken NT Not Taken NT Not Taken

T Predict Predict Predict Predict NT


T T NT NT
T T T

Iteration Prediction Actual Mispredict? State


1 NT T Y Strong NT Only 1
Mispredict
2 NT T Y Weak NT
3 T T Weak T
4 T NT Y Strong T

1 T T Weak T
2 T T Strong T
3 T T Strong T
19
4 T NT Y Strong T
Other 2-bit FSM Branch Predictors
T
Strong
Taken
Predict
T Jump directly to
T strong from weak
T
Weak
Taken NT
Predict Predict Weak
T NT Not Taken
T

NT

Predict
Strong NT
Not Taken
NT 20
Branch History Table (BHT)
Fetch PC 00

k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry

Instruction
Opcode offset

Branch? Target PC Taken/¬Taken?

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions


21
Exploiting Spatial Correlation
Yeh and Patt, 1992

if (x[i] < 7) then


y += 1;
if (x[i] < 5) then
c -= 4;

If first condition false, second condition also false

Branch History Register, BHR, records the direction


of the last N branches executed by the processor
(Shift Register)
Branch
Outcome BHR
(T/NT)
22
Pattern History Table (PHT)
PHT

Branch
Outcome BHR Indexes
(T/NT) PHT

FSM
Output
Logic

Prediction (T/NT) 23
Two-Level Branch Predictor
PHT 0 PHT 2^(k-1)

Branch
Outcome
(T/NT)
BHR Indexes
PHT …

PC

FSM
Output
Logic
24
Prediction (T/NT)
Generalized Two-Level Branch
Predictor PHT 0 PHT 2^(k-1)
Branch
Outcome BHR
(T/NT) Indexes
PHT …

PC

FSM
m Output
Logic
For non-trivial m and k, > 97% accuracy 25
Prediction (T/NT)
Tournament Predictors
(ex: Alpha 21264)

PC
Global Local
Predictor Predictor

Choice
Predictor

Prediction (T/NT)
• Choice predictor learns whether best to use local or global branch
history in predicting next branch
• Global history is speculatively updated but restored on mispredict
• Claim 90-100% success on range of applications 26
Agenda
• Branch Cost Motivation
• Branch Prediction
– Outcome
• Static
• Dynamic
– Target Address

27
Predicting Target Address

F D I X M W
Know target address for JR, JALR

Know target address for branches, J, JAL

Even with best possible prediction of branch


outcome, still have to wait for branch target address
to be determined
28
Branch Target Buffer (BTB)
PC
Valid PC Predicted BP State
Target

New PC if FSM
== hit and Output
predicted Logic
taken
Prediction (T/NT)
Hit

Put BTB in Fetch Stage in parallel with PC+4 Speculation logic


29
BTB is only for Control Instructions
• BTB contains useful information for
branch and jump instructions only
– Do not update it for other instructions

• For all other instructions the next PC is


PC+4 !
How to achieve this effect without
decoding the instruction?
When do we update BTB information?

30
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly

• Dynamic function call (jump to run-time function address)


BTB works well if same function usually called, (e.g., in
C++ programming, when objects have same type in
virtual function call)

• Subroutine returns (jump to return address)


BTB works well if usually return to the same place
 Often one function called from many distinct call sites!

How well does BTB work for each of these cases?


31
Subroutine Return Stack
Small structure to accelerate JR for subroutine
returns, typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }

Pop return address


Push call address when
when subroutine
function call executed
return decoded

&fd() k entries
&fc() (typically k=8-16)
&fb()
32
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

33
Computer Architecture
ELE 475 / COS 475
Slide Deck 9: Advanced Caches
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart

2
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart

3
Average Memory Access Time
Hit
Main
Processor CACHE Memory
Miss

• Average Memory Access Time = Hit Time + ( Miss Rate * Miss Penalty )

4
Categorizing Misses: The Three C’s

• Compulsory – first-reference to a block, occur even


with infinite cache
• Capacity – cache is too small to hold all data needed by
program, occur even under perfect replacement policy
(loop over 5 cache lines)
• Conflict – misses that occur because of collisions due
to less than full associativity (loop over 3 cache lines) 5
Reduce Hit Time: Small & Simple
Caches

Plot from Hennessy and Patterson Ed. 4


6
Image Copyright © 2007-2012 Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Block Size

• Less tag overhead • Can waste bandwidth if data is


• Exploit fast burst transfers not used
from DRAM • Fewer blocks -> more conflicts
• Exploit fast burst transfers
over wide on-chip busses

7
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Cache Size

Empirical Rule of Thumb:


If cache size is doubled, miss rate usually drops by about √2

8
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Reduce Miss Rate: High Associativity

Empirical Rule of Thumb:


Direct-mapped cache of size N has about the same miss rate
as a two-way set- associative cache of size N/2
9
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart

10
Write Performance
Tag Index Block
Offset
b
t
k
V Tag Data

2k
lines

t
= WE

HIT Data Word or Byte


11
Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one
cycle for tag check plus one cycle for data write if hit

Solutions:
• Design data RAM that can perform read and write
concurrently, restore old value after tag miss
• Fully-associative (CAM Tag) caches: Word line only
enabled if hit

• Pipelined writes: Hold write data for store in single buffer


ahead of cache, write cache data during next store’s tag
check

12
Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one
cycle for tag check plus one cycle for data write if hit

Solutions:
• Design data RAM that can perform read and write
concurrently, restore old value after tag miss
• Fully-associative (CAM Tag) caches: Word line only
enabled if hit

• Pipelined writes: Hold write data for store in single buffer


ahead of cache, write cache data during next store’s tag
check

13
Pipelining Cache Writes

F D X M W
Delayed
Cache
Write
Buffer

Data from a store hit written into data portion of cache


during tag access of subsequent store 14
Pipelining Cache Writes
Address and Store Data From CPU

Tag Index Store Data

Delayed Write Addr. Delayed Write Data


Load/Store
=?
S
Tags L Data

=? 1 0

Load Data to CPU


Hit?
Data from a store hit written into data portion of cache
during tag access of subsequent store 15
Pipelined Cache Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Pipelined
Writes

16
Pipelined Cache Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Pipelined
Writes - +

17
Write Buffer to Reduce Read Miss
Penalty
CPU L1 Data Unified
Cache L2 Cache
Write
RF
buffer
Evicted dirty lines for writeback cache
OR
All writes in writethrough cache

Processor is not stalled on writes, and read misses


can go ahead of write to main memory
Problem: Write buffer may hold updated value of location needed by a read miss
Simple scheme: on a read miss, wait for the write buffer to go empty
Faster scheme: Check write buffer addresses against read miss addresses, if no
match, allow read miss to go ahead of writes, else, return value in write buffer 18
Write Buffer Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Write Buffer

19
Write Buffer Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Write Buffer
+

20
Multilevel Caches
Problem: A memory cannot be large and fast
Solution: Increasing sizes of cache at each level

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache


Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions

21
Presence of L2 influences L1 design
• Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time and
reduced L1 miss penalty
– Reduces average access energy
• Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)

22
Inclusion Policy
• Inclusive multilevel cache:
– Inner cache holds copies of data in outer cache
– External coherence snoop access need only check
outer cache
• Exclusive multilevel caches:
– Inner cache may hold data not in outer cache
– Swap lines between inner/outer caches on miss
– Used in AMD Athlon with 64KB primary and 256KB
secondary cache
Why choose one type or the other?

23
Itanium-2 On-Chip Caches
(Intel/HP, 2002)
Level 1: 16KB, 4-way s.a.,
64B line, quad-port (2
load+2 store), single
cycle latency

Level 2: 256KB, 4-way s.a,


128B line, quad-port (4
load or 4 store), five
cycle latency

Level 3: 3MB, 12-way s.a.,


128B line, single 32B
port, twelve cycle
latency
24
Image Credit: Intel
Power 7 On-Chip Caches [IBM 2009]
32KB L1 I$/core
32KB L1 D$/core
3-cycle latency

256KB Unified L2$/core


8-cycle latency

32MB Unified Shared L3$


Embedded DRAM
25-cycle latency to local
slice

Image Credit: IBM


Courtesy of International Business Machines Corporation,
25
© International Business Machines Corporation.
Multilevel Cache Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Multilevel
Cache

26
Multilevel Cache Efficacy L1

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Multilevel
Cache +

27
Multilevel Cache Efficacy L1, L2, L3

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Multilevel
Cache + +

28
Victim Cache
• Small Fully Associative cache for recently evicted lines
– Usually small (4-16 blocks)
• Reduced conflict misses
– More associativity for small number of lines
• Can be checked in parallel or series with main cache
• On Miss in L1, Hit in VC: VC->L1, L1->VC
• On Miss in L1, Miss in VC: L1->VC, VC->? (Can always be clean)

Unified
CPU L1 Data L2 Cache
Cache
RF
Evicted Data from L1
Victim ?
Hit Data (miss in L1) Cache (FA,
29
small)
Victim Cache Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Victim Cache

30
Victim Cache Efficacy L1

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Victim Cache
+

31
Victim Cache Efficacy L1 and VC

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Victim Cache
+ +

32
Prefetching
• Speculate on future instruction and data accesses
and fetch them into cache(s)
– Instruction accesses easier to predict than data accesses
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes

• What types of misses does prefetching


affect?
33
Issues in Prefetching
• Usefulness – should produce hits
• Timeliness – not late and not too early
• Cache and bandwidth pollution
L1
Instruction
CPU Unified L2
Cache
RF L1 Data

Prefetched data

34
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and
the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream
buffer block into cache and prefetch next block (i+2)
Prefetched
Req
Stream instruction block
block
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
35
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b

• One Block Lookahead (OBL) scheme


– Initiate prefetch for block b + 1 when block b is accessed
– Why is this different from doubling block size?
– Can extend to N-block lookahead

• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch
b+3N etc.

Example: IBM Power 5 [2003] supports eight independent streams of strided


prefetch per processor, prefetching 12 lines ahead of current access
36
Software Prefetching

for(i=0; i < N; i++) {


prefetch( &a[i + 1] );
prefetch( &b[i + 1] );
SUM = SUM + a[i] * b[i];
}

37
Software Prefetching Issues
• Timing is the biggest issue, not predictability
– If you prefetch very close to when the data is required, you might be
too late
– Prefetch too early, cause pollution
– Estimate how long it will take for the data to come into L1, so we can
set P appropriately
– Why is this hard to do?

for(i=0; i < N; i++) {


prefetch( &a[i + P] );
prefetch( &b[i + P] );
SUM = SUM + a[i] * b[i];
}
Must consider cost of prefetch instructions
38
Prefetching Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Prefetching

39
Prefetching Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Prefetching
+ +

40
Increasing Cache Bandwidth
Multiporting and Banking

IR0 Branch Cond.


ALU
PC
addr
rdata A
RF RF
Instr. IR1 Read
Cache
Read
ALU
addr
B rdata

Data
Cache

41
Increasing Cache Bandwidth
Multiporting and Banking

IR0 Branch Cond.


ALU addr
PC
addr
rdata A rdata
RF RF
IR1 Data
Instr. Read Cache
Cache
Read
ALU addr

B rdata

42
Increasing Cache Bandwidth
Multiporting and Banking

IR0 Branch Cond.


ALU addr
PC
addr
rdata A rdata
RF RF
IR1 Data
Instr. Read Cache
Cache
Read
ALU addr

B rdata

Challenge: Two stores to the same line, or Load and


Store to same line 43
True Multiport Caches
• Large area increase (could be double for 2-port)
• Hit time increase (can be made small)

Address 1
addr
Data 1
rdata
Data
Cache
Address 2
addr
Data 2
rdata 44
Banked Caches
• Partition Address Space into multiple banks
– Use portions of address (low or high order interleaved)
Benefits:
• Higher throughput
Challenges:
• Bank Conflicts
• Extra Wiring
• Uneven utilization
Address 0 Data 0
Bank 0

Address 1 Bank 1 Data 1

45
Cache Banking Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Cache
Banking +

46
Compiler Optimizations
• Restructuring code affects the data block access
sequence
– Group data accesses together to improve spatial locality
– Re-order data accesses to improve temporal locality
• Prevent data from entering the cache
– Useful for variables that will only be accessed once before being
replaced
– Needs mechanism for software to tell hardware not to cache data
(“no-allocate” instruction hints or page table bits)
• Kill data that will never be used again
– Streaming data exploits spatial locality but not temporal locality
– Replace into dead cache locations

47
Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}

for(i=0; i < M; i++) {


for(j=0; j < N; j++) {
x[i][j] = 2 * x[i][j];
}
}

What type of locality does this improve?

48
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];

for(i=0; i < N; i++)


d[i] = a[i] * c[i];

49
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];

for(i=0; i < N; i++)


d[i] = a[i] * c[i];

for(i=0; i < N; i++)


{
a[i] = b[i] * c[i];
d[i] = a[i] * c[i];
}

50
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];

for(i=0; i < N; i++)


d[i] = a[i] * c[i];

for(i=0; i < N; i++)


{
a[i] = b[i] * c[i];
d[i] = a[i] * c[i];
}

What type of locality does this improve?


51
Matrix Multiply, Naïve Code
for(i=0; i < N; i++)
z j
for(j=0; j < N; j++) {
r = 0;
for(k=0; k < N; k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = r;
}

y k x j

i i

Not touched Old access New access


52
Matrix Multiply with Cache Tiling/Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B) z j
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}

y k x j

i i

53
Matrix Multiply with Cache Tiling/Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B) z j
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}

y k x j

i i

What type of locality does this improve?


54
Matrix Multiply with Cache Tiling/Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B) z j
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}

y k x j

i i

What type of locality does this improve?


55
Matrix Multiply with Cache Tiling/Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B) z j
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}

y k x j

i i

What type of locality does this improve?


56
Matrix Multiply with Cache Tiling/Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B) z j
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}

y k x j

i i

What type of locality does this improve?


57
Compiler Memory Optimizations
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization

Compiler
Optimization

58
Compiler Memory Optimizations
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization

Compiler
Optimization +

59
Non-Blocking Caches
(aka Out-Of-Order Memory System)
(aka Lockup Free Caches)
• Enable subsequent cache accesses after a cache
miss has occurred
– Hit-under-miss
– Miss-under-miss (concurrent misses)
• Suitable for in-order processor or out-of-order
processors
• Challenges
– Maintaining order when multiple misses that might
return out of order
– Load or Store to an already pending miss address
(need merge) 60
Non-Blocking Cache Timeline
Cache Miss

CPU Time CPU Time


Blocking Cache:
Miss Penalty

Cache Miss Hit Stall on use

CPU Time CPU Time


Hit Under Miss:
Miss Penalty

Cache Miss Miss Stall on use

CPU Time CPU Time


Miss Under Miss:
Miss Penalty
Miss Penalty
Time 61
Miss Status Handling Register (MSHR)/
Miss Address File (MAF)
Load/Store Entry
MSHR/MAF V MSHR Type Offset Destination
V Block Issued Entry
Address

V: Valid V: Valid
Block Address: Address of cache block MSHR Entry: Entry Number
in memory system Type: {LW, SW, LH, SH, LB, SB}
Issues: Issued to Main Memory/Next Offset: Offset within the block
level of cache Destination: (Loads) Register, (Stores)
Store buffer entry 62
Non-Blocking Cache Operation
On Cache Miss:
• Check MSHR for matched address
– If found: Allocate new Load/Store entry pointing to MSHR
– If not found: Allocate new MSHR entry and Load/Store entry
– If all entries full in MSHR or Load/Store entry table, stall or
prevent new LDs/STs
On Data Return from Memory:
• Find Load or Store waiting for it
– Forward Load data to processor/Clear Store Buffer
– Could be multiple Loads and Stores
• Write Data to cache
When Cache Lines is Completely Returned:
• De-allocate MSHR entry
63
Non-Blocking Cache with In-Order
Pipelines
• Need Scoreboard for Individual Registers

On Load Miss:
• Mark Destination Register as Busy

On Load Data Return:


• Mark Destination Register as Available

On Use of Busy Register:


• Stall Processor

64
Non-Blocking Cache Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Non-blocking
Cache

65
Non-Blocking Cache Efficacy

Cache Miss Rate Miss Penalty Hit Time Bandwidth


Optimization

Non-blocking
Cache + +

66
Critical Word First
• Request the missed word from memory first.
• Rest of cache line comes after “critical word”
– Commonly words come back in rotated order
CPU Time CPU Time
Basic Blocking Cache:
Miss Penalty
Order of fill: 0, 1, 2, 3, 4, 5, 6, 7

CPU Time CPU Time


Blocking Cache with
Critical Word first: Miss Penalty

Order of fill: 3, 4, 5, 6, 7, 0, 1, 2
67
Early Restart
• Data returns from memory in order
• Processor Restarts when needed word is
returned
CPU Time CPU Time
Basic Blocking Cache:
Miss Penalty
Order of fill: 0, 1, 2, 3, 4, 5, 6, 7

CPU Time CPU Time


Blocking Cache with
Early Restart: Miss Penalty

Order of fill: 0, 1, 2, 3, 4, 5, 6, 7
68
Critical Word First and Early Restart
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization

Critical Word
First/Early
Restart

69
Critical Word First and Early Restart
Efficacy
Cache Miss Rate Miss Penalty Hit Time Bandwidth
Optimization

Critical Word
First/Early +
Restart

70
Agenda
• Review
– Three C’s
– Basic Cache Optimizations
• Advanced Cache Optimizations
– Pipelined Cache Write
– Write Buffer
– Multilevel Caches
– Victim Caches
– Prefetching
• Hardware
• Software
– Multiporting and Banking
– Software Optimizations
– Non-Blocking Cache
– Critical Word First/Early Restart

71
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

72
Computer Architecture
ELE 475 / COS 475
Slide Deck 10: Address Translation
and Protection
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Memory Management
• From early absolute addressing schemes, to
modern virtual memory systems with support for
virtual machine monitors

• Can separate into orthogonal functions:


– Translation (mapping of virtual address to physical address)
– Protection (permission to access word in memory)
– Virtual Memory (transparent extension of memory space using slower disk
storage)
• But most modern systems provide support for all
the above functions with a single page-based
system
2
Absolute Addresses
EDSAC, early 50’s
• Only one program ran at a time, with unrestricted
access to entire machine (RAM + I/O devices)
• Addresses in a program depended upon where the
program was to be loaded in memory
• But it was more convenient for programmers to
write location-independent subroutines
How could location independence be achieved?

Linker and/or loader modify addresses of subroutines


and callers when building a program memory image

3
Bare Machine
Physical Physical
Address Inst. Address Data
PC D Decode E + M W
Cache Cache

Physical Memory Controller Physical


Address Address
Physical Address
Main Memory (DRAM)

• In a bare machine, the only kind of address


is a physical address

4
Dynamic Address Translation
Location-independent programs
Programming and storage management ease
 need for a base register prog1
Protection

Physical Memory
Independent programs should not affect
each other inadvertently
 need for a bound register
Multiprogramming drives requirement for
resident supervisor to manage context prog2
switches between multiple programs

OS

5
Simple Base and Bound Translation
Segment Length
Bound Bounds
Register  Violation?

Physical Memory
Physical current
Load X Effective Address segment
Address +

Base
Register
Base Physical Address
Program
Address
Space

Base and bounds registers are visible/accessible only


when processor is running in the supervisor mode
6
Separate Areas for Program and Data
Bounds
Data Bound
Register
 Violation?
data
Effective Address Logical
Load X Register Address segment

Main Memory
Data Base Physical
Register + Address

Program Program Bound Bounds


Address Register  Violation?
Space Program Counter Logical program
Address segment
Program Base Physical
Register + Address

What is an advantage of this separation?


(Scheme used on all Cray vector supercomputers prior to X1, 2002)
7
Base and Bound Machine
Prog. Bound Data Bound
Register Bounds Violation? Register Bounds Violation?
Logical
 Logical

Address Address

Inst. Data
PC + D Decode E + M + Cache W
Cache
Physical Physical
Address Address
Program Base Data Base
Register Register
Physical Physical
Address Address
Memory Controller

Physical Address
Main Memory (DRAM)

[ Can fold addition of base register into (base+offset) calculation using a


carry-save adder (sums three numbers with only a few gate delays
more than adding two numbers) ] 8
Memory Fragmentation
Users 4 & 5 Users 2 & 5
free
OS arrive OS leave OS
Space Space Space
user 1 16K user 1 16K 16K
user 1
user 2 24K user 2 24K 24K
user 4 16K
24K user 4 16K
8K 8K
user 3 32K user 3 32K user 3 32K
24K user 5 24K 24K

As users come and go, the storage is “fragmented”.


Therefore, at some stage programs have to be moved
around to compact the storage.
9
Paged Memory Systems
• Processor-generated address can be interpreted as a pair
<page number, offset>:
page number offset
• A page table contains the physical address of the base of each
page:
1
0 0 0
1 1 Physical
2 2 Memory
3 3 3
Address Space Page Table
of User-1 2
of User-1

Page tables make it possible to store the


pages of a program non-contiguously.

10
Private Address Space per User
OS
User 1 VA1 pages
Page Table

User 2

Physical Memory
VA1

Page Table

User 3 VA1

Page Table free

• Each user has a page table


• Page table contains an entry for each user page
11
Where Should Page Tables Reside?
• Space required by the page tables (PT) is
proportional to the address space, number of
users, (inverse to) size of each page, ...
– Space requirement is large
– Too expensive to keep in registers
• Idea: Keep PTs in the main memory
– needs one reference to retrieve the page base address
and another to access the data word
• doubles the number of memory references!
– Storage space to store PT grows with size of memory

12
Page Tables in Physical Memory
PT
User
1
VA1
PT

Physical Memory
User
User 1 Virtual 2
Address Space

VA1

User 2 Virtual
Address Space

13
Linear Page Table
• Page Table Entry (PTE) Data Pages
Page Table
contains: PPN
– A bit to indicate if a page PPN
exists DPN
PPN
– PPN (physical page number) Data word
for a memory-resident page
Offset
– DPN (disk page number) for
a page on the disk
– Status bits for protection DPN
and usage PPN
• OS sets the Page Table Base PPN
DPN
Register whenever active DPN
user process changes VPN
DPN
PPN
PPN
PT Base Register VPN Offset
Virtual address
14
Size of Linear Page Table
With 32-bit addresses, 4-KB pages & 4-byte PTEs:
 220 PTEs, i.e, 4 MB page table per user per process
 4 GB of swap needed to back up full virtual address
space

Larger pages?
• Internal fragmentation (Not all memory in page is used)
• Larger page fault penalty (more time to read from disk)

What about 64-bit virtual address space???


• Even 1MB pages would require 244 8-byte PTEs (35 TB!)
What is the “saving grace” ?
15
Hierarchical Page Table
Virtual Address
31 22 21 12 11 0
p1 p2 offset

10-bit 10-bit
L1 index L2 index offset

Physical Memory
Root of the Current
Page Table p2
p1

(Processor Level 1
Register) Page Table

Level 2
page in Memory Page Tables
page on Disk

PTE of a nonexistent page


Data Pages
16
Two-Level Page Tables in Physical
Virtual
Memory Physical
Memory
Address
Spaces Level 1 PT
User 1
VA1
Level 1 PT
User 2
User 1

User2/VA1
VA1 User1/VA1

User 2
Level 2 PT
User 2
17
Address Translation & Protection
Virtual Address Virtual Page No. (VPN) offset
Kernel/User Mode

Read/Write
Protection Address
Check Translation

Exception?
Physical Address Physical Page No. (PPN) offset

• Every instruction and data access needs address


translation and protection checks

A good Virtual Memory (VM) design needs to be fast


(~ one cycle) and space efficient
18
Translation Lookaside Buffers (TLB)
Problem: Address translation is very expensive!
In a two-level page table, each reference
becomes several memory accesses
Solution: Cache translations in TLB
TLB hit  Single-Cycle Translation
TLB miss  Page-Table Walk to refill

virtual address VPN offset

VRWD tag PPN (VPN = virtual page number)

(PPN = physical page number)

hit? physical address PPN offset

19
TLB Designs
• Typically 16-128 entries, usually fully associative
– Each entry maps a large page, hence less spatial locality across
pages  more likely that two entries conflict
– Sometimes larger TLBs (256-512 entries) are 4-8 way set-
associative
– Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random (Clock Algorithm) or FIFO replacement policy
• No process information in TLB
– Flush TLB on Process Context Switch
• TLB Reach: Size of largest virtual address space that can be
simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry

TLB Reach = _____________________________________?

20
TLB Designs
• Typically 16-128 entries, usually fully associative
– Each entry maps a large page, hence less spatial locality across
pages  more likely that two entries conflict
– Sometimes larger TLBs (256-512 entries) are 4-8 way set-
associative
– Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random (Clock Algorithm) or FIFO replacement policy
• No process information in TLB
– Flush TLB on Process Context Switch
• TLB Reach: Size of largest virtual address space that can be
simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry

64 entries * 4 KB = 256 KB (if contiguous)


TLB Reach = _____________________________________?

21
TLB Extensions
• Address Space Identifier (ASID)
– Allow TLB Entries from multiple processes to be in
TLB at same time. ID of address space (Process) is
matched on.
– Global Bit (G) can match on all ASIDs
• Variable Page Size (PS)
– Can increase reach on a per page basis
VRWD tag PPN PS G ASID

22
Handling a TLB Miss
Software (MIPS, Alpha)
TLB miss causes an exception and the operating system
walks the page tables and reloads TLB. A privileged
“untranslated” addressing mode used for walk

Hardware (SPARC v8, x86, PowerPC)


A memory management unit (MMU) walks the page
tables and reloads the TLB

If a missing (data or PT) page is encountered during the


TLB reloading, MMU gives up and signals a Page-Fault
exception for the original instruction

23
Hierarchical Page Table Walk:
SPARC v8
Virtual Address Index 1 Index 2 Index 3 Offset
31 23 17 11 0
Context Context Table
Table
Register L1 Table
root ptr
Context
Register L2 Table
PTP L3 Table
PTP

PTE

31 11
Physical Address 0 PPN Offset

MMU does this table walk in hardware on a TLB miss


24
Page-Based Virtual-Memory Machine
(Hardware Page-Table Walk)
Page Fault? Page Fault?
Protection violation? Protection violation?
Virtual Virtual
Address Physical Address Physical
Address Address
Inst. Inst. Decode Data Data
PC D E + M W
TLB Cache TLB Cache

Miss? Miss?
Page-Table Base
Register Hardware Page
Table Walker

Physical Physical
Memory Controller Address
Address
Physical Address
Main Memory (DRAM)
• Assumes page tables held in untranslated physical memory
25
Address Translation:
putting it all together
Virtual Address
hardware
Restart instruction hardware or software
TLB
software
Lookup
miss hit

Page Table Protection


Walk Check
the page is
 memory  memory denied permitted

Page Fault
Update TLB Protection Physical
(OS loads page) Address
Fault
(to cache)
SEGFAULT
26
Modern Virtual Memory Systems
Illusion of a large, private, uniform store
Protection & Privacy OS
several users, each with their private
address space and one or more
shared address spaces useri
page table  name space
Swapping
Demand Paging Store
Provides the ability to run programs Primary
larger than the primary memory Memory

Hides differences in machine


configurations

The price is address translation on


each memory reference VA mapping PA
TLB
27
Address Translation in CPU Pipeline
Inst Inst. Data Data
PC TLB D Decode E + M W
Cache TLB Cache

TLB miss? Page Fault? TLB miss? Page Fault?


Protection violation? Protection violation?

• Software handlers need restartable exception on TLB fault


• Handling a TLB miss needs a hardware or software
mechanism to refill TLB
• Need to cope with additional latency of TLB:
– slow down the clock?
– pipeline the TLB and cache access?
– virtual address caches
– parallel TLB/cache access
28
Virtual-Address Caches
PA
VA Physical Primary
CPU TLB
Cache Memory

Alternative: place the cache before the TLB


VA
Primary
Memory (StrongARM)
Virtual PA
CPU TLB
Cache

• one-step process in case of a hit (+)


• cache needs to be flushed on a context switch unless address space
identifiers (ASIDs) included in tags (-)
• aliasing problems due to the sharing of pages (-)
• maintaining cache coherence (-) (see later in course)

29
Virtually Addressed Cache
(Virtual Index/Virtual Tag)
Virtual Virtual
Address Address

Inst. Data
PC D Decode E + M W
Cache Cache
Miss? Miss?
Inst.
Page-Table Base Data
TLB Register Hardware Page
TLB
Physical Table Walker
Address Physical
Instruction Memory Controller Address
data
Physical Address
Main Memory (DRAM)
Translate on miss

30
Aliasing in Virtual-Address Caches
Page Table Tag Data
VA1
Data Pages VA1 1st Copy of Data at PA

PA VA2 2nd Copy of Data at PA


VA2
Virtual cache can have two
copies of same physical data.
Two virtual pages share
Writes to one copy not visible
one physical page
to reads of other!

General Solution: Prevent aliases coexisting in cache


Software (i.e., OS) solution for direct-mapped cache
VAs of shared pages must agree in cache index bits; this
ensures all VAs accessing same PA will conflict in direct-
mapped cache (early SPARCs)

31
Cache-TLB Interactions
• Physically Indexed/Physically Tagged
• Virtually Indexed/Virtually Tagged
• Virtually Indexed/Physically Tagged
– Concurrent cache access with TLB Translation
• Both Indexed/Physically Tagged
– Small enough cache or highly associative cache
will have fewer indexes than page size
– Concurrent cache access with TLB Translation
• Physically Indexed/Virtually Tagged
32
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

33
Computer Architecture
ELE 475 / COS 475
Slide Deck 11: Vector, SIMD, and GPUs
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Vector Processors
• Single Instruction Multiple Data (SIMD)
Instruction Set Extensions
• Graphics Processing Units (GPU)

2
Vector Programming Model
Scalar Registers Vector Registers
r15 v15

r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR

3
Vector Programming Model
Scalar Registers Vector Registers
r15 v15

r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR

V1
Vector Arithmetic V2
Instructions + + + + + +
ADDVV V3, V1, V2 V3
[0] [1] [VLR-1]

4
Vector Programming Model
Scalar Registers Vector Registers
r15 v15

r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR

V1
Vector Arithmetic V2
Instructions + + + + + +
ADDVV V3, V1, V2 V3
[0] [1] [VLR-1]

Vector Load and Vector Register


V1
Store Instructions
LV v1, r1, r2

Memory
Base, r1 Stride, r2
5
Vector Code Element-by-Element
Multiplication
# C code # Scalar Assembly Code # Vector Assembly Code
for (i=0; i<64; i++) LI R4, 64 LI VLR, 64
C[i] = A[i] * B[i]; loop: LV V1, R1
L.D F0, 0(R1) LV V2, R2
L.D F2, 0(R2) MULVV.D V3, V1, V2
MUL.D F4, F2, F0 SV V3, R3
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop

6
Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) to
execute element operations V1 V2 V3
• Simplifies control of deep pipeline
because elements in vector are
independent
• no data hazards!
• no bypassing needed
Six stage multiply pipeline

V3 <- V1 * V2

7
Interleaved Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
• Bank busy time: Time before bank ready to accept next request

Base Stride
Vector Registers

Address
Generator
+

0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks

8
Example Vector Microarchitecture
SRF
VLR
X0 VRF

F D R L0 L1 W
S0 S1
Y0 Y1 Y2 Y3
Commit Point

9
Basic Vector Execution
# C code # Vector Assembly Code
for (i=0; i<4; i++) LI VLR, 4
C[i] = A[i] * B[i]; LV V1, R1
LV V2, R2
VLR = 4 MULVV.D V3, V1, V2
SV V3, R3
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D D D D D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F F F F F F D D D D D D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
10
Vector Instruction Parallelism
• Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and
8 lanes
Load Unit Multiply Unit Add Unit

time

Instruction
issue
11
Vector Instruction Parallelism
• Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and
8 lanes
Load Unit Multiply Unit Add Unit
load
mul
add
time
load
mul
add

Instruction
issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle12
Vector Chaining
• Vector version of register bypassing
– introduced with Cray-1

LV V1
MULVV V3,V1,v2
ADDVV V5,V3, v4

13
Vector Chaining
• Vector version of register bypassing
– introduced with Cray-1

V1 V2 V3 V4 V5
LV V1
MULVV V3,V1,v2
ADDVV V5,V3, v4

Chain Chain

Load
Unit
Mult. Add

Memory

14
Vector Chaining Advantage
• Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Time Add

• With chaining, can start dependent instruction as soon as first


result appears
Load
Mul
Add
15
Chaining (Register File) Vector
Execution
# C code # Vector Assembly Code
for (i=0; i<4; i++) LI VLR, 4
C[i] = A[i] * B[i]; LV V1, R1
VLR = 4
LV V2, R2
MULVV.D V3, V1, V2
SV V3, R3
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F F F D D D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
16
R S0 S1 W
Chaining (Bypass Network) Vector
Execution
# C code # Vector Assembly Code
for (i=0; i<4; i++) LI VLR, 4
C[i] = A[i] * B[i]; LV V1, R1
VLR = 4
LV V2, R2
MULVV.D V3, V1, V2
SV V3, R3
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F F F D D D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
17
R S0 S1 W
Chaining (Bypass Network) Vector
Execution and More RF Ports
# C code # Vector Assembly Code
for (i=0; i<4; i++) LI VLR, 4
C[i] = A[i] * B[i]; LV V1, R1
VLR = 4
LV V2, R2
MULVV.D V3, V1, V2
SV V3, R3
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
18
R S0 S1 W
Chaining (Bypass Network) Vector
VLR = 8
Execution and More RF Ports
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W 19
R S0 S1 W
Vector Stripmining
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit in registers, “Stripmining”

20
Vector Stripmining
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit in registers, “Stripmining”
ANDI R1, N, 63 # N mod 64
for (i=0; i<N; i++) MTC1 VLR, R1 # Do remainder
C[i] = A[i]*B[i]; loop:
A B C LV V1, RA
LV V2, RB
+ Remainder MULVV.D V3, V1, V2
SV V3, RC
DSLL R2, R1, 3 # Multiply by 8
+ 64 elements DADDU RA, RA, R2 # Bump pointer
DADDU RB, RB, R2
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
+ LI R1, 64
MTC1 VLR, R1 # Reset full length
21
BGTZ N, loop # Any more to do?
Vector Stripmining
VLR = 4
LV F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
LV V2, R2 F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D V3, V1, V2 F D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV V3, R3 F F D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
DSLL R2, R1, 3 F F F F D R X W
DADDU RA, RA, R2 F D R X W
DADDU RB, RB, R2 F D R X W
DADDU RC, RC, R2 F D R X W
DSUBU N, N, R1 F D R X W
LI R1, 64 F D R X W
MTC1 VLR, R1 F D R X W
22
BGTZ N, loop F D R X W
Vector Instruction Execution
MULVV C,A,B

23
Vector Instruction Execution
MULVV C,A,B

Execution using
one pipelined
functional unit

A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]

C[2]

C[1]
C[0]

24
Vector Instruction Execution
MULVV C,A,B

Execution using Execution using


one pipelined four pipelined
functional unit functional units

A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]


C[0] C[0] C[1] C[2] C[3]

25
Two Lane Vector Microarchitecture
SRF
VLR
X0 VRF

F D R L0 L1
S0 S1
Y0 Y1 Y2 Y3 W
X0
L0 L1
S0 S1
Y0 Y1 Y2 Y3
26
Vector Stripmining 2-Lanes
VLR = 4
LV F D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
LV F D D R L0 L1 W
R L0 L1 W
R L0 L1 W
R L0 L1 W
MULVV.D F F D D R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
R Y0 Y1 Y2 Y3 W
SV F F D D D D R S0 S1 W
R S0 S1 W
R S0 S1 W
R S0 S1 W
DSLL R2, R1, 3 F F F F D R X W
DADDU RA, RA, R2 F D R X W
DADDU RB, RB, R2 F D R X W
DADDU RC, RC, R2 F D R X W
DSUBU N, N, R1 F D R X W
LI R1, 64 F D R X W
MTC1 VLR, R1 F D R X W
27
BGTZ N, loop F D R X W
Vector Unit Structure

Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …

Memory Subsystem
28
Vector Unit Structure

Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …

Lane

Memory Subsystem
29
Vector Unit Structure
Functional Unit

Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …

Lane

Memory Subsystem
30
T0 Vector Microprocessor (UCB/ICSI, 1995)

Lane

Photo of Berkeley T0, © University of California (Berkeley)


31
http://www1.icsi.berkeley.edu/Speech/spert/t0die.jpg
T0 Vector Microprocessor (UCB/ICSI, 1995)

Vector register Lane


elements striped
over lanes
[24][25] [26] [27][28] [29] [30] [31]
[16][17] [18] [19][20] [21] [22] [23]
[8] [9] [10] [11][12] [13] [14] [15]
[0] [1] [2] [3] [4] [5] [6] [7]

Photo of Berkeley T0, © University of California (Berkeley)


32
http://www1.icsi.berkeley.edu/Speech/spert/t0die.jpg
Vector Instruction Set Advantages
• Compact
– one short instruction encodes N operations
• Expressive, tells hardware that these N operations:
– are independent
– use the same functional unit
– access disjoint registers
– access registers in same pattern as previous instructions
– access a contiguous block of memory (unit-stride
load/store)
– access memory in a known pattern (strided load/store)
• Scalable
– can run same code on more parallel pipelines (lanes)

33
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];

34
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
Scalar Sequential Code

load

Iter. 1 load

mul

store

load

Iter. 2 load

mul

store 35
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
Scalar Sequential Code Vectorized Code

load load load

Iter. 1 load load load

mul Time mul mul

store store store

load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

mul

store 36
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] * B[i];
Scalar Sequential Code Vectorized Code

load load load

Iter. 1 load load load

mul Time mul mul

store store store

load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

mul
Vectorization is a massive compile-time
reordering of operation sequencing
 requires extensive loop dependence analysis
store 37
Vector Conditional Execution
Problem: Want to vectorize loops with conditional code:
for (i=0; i<N; i++)
if (A[i]>0) then
A[i] = B[i];

Solution: Add vector mask (or flag) registers


– vector version of predicate registers, 1 bit per element
…and maskable vector instructions
– vector operation becomes NOP at elements where mask bit is clear
Code example:
CVM # Turn on all elements
LV VA, RA # Load entire A vector
SGTVS.D VA, F0 # Set bits in mask register where A>0
LV VA, RB # Load B vector into A under mask
SV VA, RA # Store A back to memory under mask

38
Masked Vector Instructions
Simple Implementation
– execute all N operations, turn off
result writeback according to mask

M[7]=1 A[7] B[7]


M[6]=0 A[6] B[6]
M[5]=1 A[5] B[5]
M[4]=1 A[4] B[4]
M[3]=0 A[3] B[3]

M[2]=0 C[2]

M[1]=1 C[1]

M[0]=0 C[0]

Write Enable Write data port

39
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks

M[7]=1 A[7] B[7] M[7]=1


M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
M[5]=1 A[5] B[5] M[5]=1
M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]

M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]

Write data port

M[0]=0 C[0]

Write Enable Write data port

40
Vector Reductions
Problem: Loop-carried dependence on reduction variables
sum = 0;
for (i=0; i<N; i++)
sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary tree to perform reduction
# Rearrange as:
sum[0:VL-1] = 0 # Vector of VL partial sums
for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks
sum[0:VL-1] += A[i:i+VL-1]; # Vector sum
# Now have VL partial sums in one vector register
do {
VL = VL/2; # Halve vector length
sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials
} while (VL>1)

41
Vector Scatter/Gather

Want to vectorize loops with indirect accesses:


for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]

Indexed load instruction (Gather)


LV vD, rD # Load indices in D vector
LVI vC, rC, vD # Load indirect from rC base
LV vB, rB # Load B vector
ADDV.D vA,vB,vC # Do add
SV vA, rA # Store result

42
Vector Supercomputers
Epitomized by Cray-1, 1976:

• Scalar Unit
– Load/Store Architecture
• Vector Extension
– Vector Registers
– Vector Instructions
• Implementation
– Hardwired Control
– Highly Pipelined Functional
Units
– Interleaved Memory System
– No Data Caches
– No Virtual Memory
Cray 1 at The Deutsches Museum
Image Credit: Clemens Pfeiffer 43
http://en.wikipedia.org/wiki/File:Cray-1-deutsches-museum.jpg
Cray-1 (1976)
V0 Vi V. Mask
V1
V2 Vj
64 Element Vector V3 V. Length
Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers

memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)


44
Agenda
• Vector Processors
• Single Instruction Multiple Data (SIMD)
Instruction Set Extensions
• Graphics Processing Units (GPU)

45
SIMD / Multimedia Extensions
64b
32b 32b

16b 16b 16b 16b

8b 8b 8b 8b 8b 8b 8b 8b
• Very short vectors added to existing ISAs for microprocessors
• Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b
– This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b
datapath split into 2x18b or 4x9b
– Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4)
or 256-bit registers (Intel AVX)
• Single instruction operates on all elements within register
16b 16b 16b 16b

16b 16b 16b 16b

4x16b adds + + + +

16b 16b 16b 16b 46


Multimedia Extensions versus Vectors
• Limited instruction set:
– no vector length control
– no strided load/store or scatter/gather
– unit-stride loads must be aligned to 64/128-bit boundary
• Limited vector register length:
– requires superscalar dispatch to keep multiply/add/load units busy
– loop unrolling to hide latencies increases register pressure
• Trend towards fuller vector support in
microprocessors
– Better support for misaligned memory accesses
– Support of double-precision (64-bit floating-point)
– New Intel AVX spec (announced April 2008), 256b vector registers
(expandable up to 1024b)

47
Agenda
• Vector Processors
• Single Instruction Multiple Data (SIMD)
Instruction Set Extensions
• Graphics Processing Units (GPU)

48
Graphics Processing Units (GPUs)
• Original GPUs were dedicated fixed-function devices for
generating 3D graphics (mid-late 1990s) including high-
performance floating-point units
– Provide workstation-like graphics for PCs
– User could configure graphics pipeline, but not really program it
• Over time, more programmability added (2001-2005)
– E.g., New language Cg for writing small programs run on each
vertex or each pixel, also Windows DirectX variants
– Massively parallel (millions of vertices or pixels per frame) but
very constrained programming model
• Some users noticed they could do general-purpose
computation by mapping input and output data to images,
and computation to vertex and pixel shading computations
– Incredibly difficult programming model as had to use graphics
pipeline model for general computation
49
General Purpose GPUs (GPGPUs)
• In 2006, Nvidia introduced GeForce 8800 GPU supporting a
new programming language: CUDA
– “Compute Unified Device Architecture”
– Subsequently, broader industry pushing for OpenCL, a vendor-
neutral version of same ideas.
• Idea: Take advantage of GPU computational performance
and memory bandwidth to accelerate some kernels for
general-purpose computing
• Attached processor model: Host CPU issues data-parallel
kernels to GP-GPU for execution
• This lecture has a simplified version of Nvidia CUDA-style
model and only considers GPU execution for computational
kernels, not graphics

50
Simplified CUDA Programming Model
• Computation performed by a very large number of
independent small scalar threads (CUDA threads or
microthreads) grouped into thread blocks.
// C version of DAXPY loop.
void daxpy(int n, double a, double*x, double*y)
{ for (int i=0; i<n; i++)
y[i] = a*x[i] + y[i]; }

// CUDA version.
__host__ // Piece run on host processor.
int nblocks = (n+255)/256; // 256 CUDA threads/block
daxpy<<<nblocks,256>>>(n,2.0,x,y);
__device__ // Piece run on GPGPU.
void daxpy(int n, double a, double*x, double*y)
{ int i = blockIdx.x*blockDim.x + threadId.x;
if (i<n) y[i]=a*x[i]+y[i]; }

51
“Single Instruction, Multiple Thread”
• GPUs use a SIMT model, where individual scalar
instruction streams for each CUDA thread are grouped
together for SIMD execution on hardware (Nvidia
groups 32 CUDA threads into a warp)

µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7


ld x
Scalar mul a
instruction ld y
stream add
st y

SIMD execution across warp

52
Hardware Execution Model
Lane 0 Lane 0 Lane 0
Lane 1 Lane 1 Lane 1
CPU

Lane 15 Lane 15 Lane 15


Core 0 Core 1 Core 15
CPU Memory GPU

GPU Memory

• GPU is built from multiple parallel cores, each core contains a


multithreaded SIMD processor with multiple lanes but with
no scalar processor
• CPU sends whole “grid” over to GPU, which distributes thread
blocks among cores (each thread block executes on one core)
– Programmer unaware of number of cores

53
“Single Instruction, Multiple Thread”
• GPUs use a SIMT model, where individual scalar
instruction streams for each CUDA thread are grouped
together for SIMD execution on hardware (Nvidia
groups 32 CUDA threads into a warp)

µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7


ld x
Scalar mul a
instruction ld y
stream add
st y

SIMD execution across warp

54
Implications of SIMT Model
• All “vector” loads and stores are scatter-gather,
as individual µthreads perform scalar loads and
stores
– GPU adds hardware to dynamically coalesce individual
µthread loads and stores to mimic vector loads and
stores
• Every µthread has to perform stripmining
calculations redundantly (“am I active?”) as there
is no scalar processor equivalent
• If divergent control flow, need predicates

55
GPGPUs are Multithreaded SIMD

Image Credit: NVIDIA 56


http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Nvidia Fermi GF100 GPU

57
Image Credit: NVIDIA [Wittenbrink, Kilgariff, and Prabhu, Hot Chips 2010]
Fermi “Streaming
Multiprocessor” Core

Image Credit: NVIDIA


58
[Wittenbrink, Kilgariff, and Prabhu, Hot Chips 2010]
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

59
Copyright © 2013 David Wentzlaff

60
Computer Architecture
ELE 475 / COS 475
Slide Deck 12: Multithreading
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Multithreading Motivation
• Course Grain Multithreading
• Simultaneous Multithreading

2
Multithreading
• Difficult to continue to extract instruction-level
parallelism (ILP) or data level parallelism (DLP)
from a single sequential thread of control
• Many workloads can make use of thread-level
parallelism (TLP)
– TLP from multiprogramming (run independent sequential jobs)
– TLP from multithreaded applications (run one job faster using
parallel threads)
• Multithreading uses TLP to improve utilization of
a single processor

3
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

LW r1, 0(r2) F D X MW
LW r5, 12(r1) F D D D D X MW
ADDI r5, r5, #12 F F F F D D D D X MW
SW 12(r1), r5 F F F F D D D D

• Each instruction may depend on the next

4
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

LW r1, 0(r2) F D X MW
LW r5, 12(r1) F D D D D X MW
ADDI r5, r5, #12 F F F F D D D D X MW
SW 12(r1), r5 F F F F D D D D

• Each instruction may depend on the next

What is usually done to cope with this?

5
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

LW r1, 0(r2) F D X MW
LW r5, 12(r1) F D D D D X MW
ADDI r5, r5, #12 F F F F D D D D X MW
SW 12(r1), r5 F F F F D D D D

• Each instruction may depend on the next

What is usually done to cope with this?


– interlocks (slow)
– or bypassing (needs hardware, doesn’t help all
hazards)

6
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
-- One way is to interleave execution of instructions
from different program threads on same pipeline

7
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
-- One way is to interleave execution of instructions
from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

T1: LW r1, 0(r2) F D X M W


T2: ADD r7, r1, r4 F D X M W
T3: XORI r5, r4, #12 F D X MW
T4: SW 0(r7), r5 F D X MW
T1: LW r5, 12(r1) F D X MW

8
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
-- One way is to interleave execution of instructions
from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

T1: LW r1, 0(r2) F D X M W Prior instruction in


T2: ADD r7, r1, r4 F D X M W a thread always
completes write-
T3: XORI r5, r4, #12 F D X MW back before next
T4: SW 0(r7), r5 F D X MW instruction in
T1: LW r5, 12(r1) same thread reads
F D X MW register file

9
Simple Multithreaded Pipeline

PC X
PC GPR1
PC 1 I$ IR GPR1
GPR1
PC 1 GPR1
1
1 D$
Y

+1

2 Thread 2
select
• Have to carry thread select down pipeline to ensure correct state bits
read/written at each pipe stage
• Appears to software (including OS) as multiple, albeit slower, CPUs

10
Multithreading Costs
• Each thread requires its own user state
– PC
– GPRs

• Also, needs its own system state


– virtual memory page table base register
– exception handling registers
– Other system state

• Other overheads:
– Additional cache/TLB conflicts from competing threads
– (or add larger cache/TLB capacity)
– More OS overhead to schedule more threads (where do all these
threads come from?)

11
Thread Scheduling Policies
• Fixed interleave (CDC 6600 PPUs, 1964)
– Each of N threads executes one instruction every N cycles
– If thread not ready to go in its slot, insert pipeline bubble
– Can potentially remove bypassing and interlocking logic

• Software-controlled interleave (TI ASC PPUs, 1971)


– OS allocates S pipeline slots amongst N threads
– Hardware performs fixed interleave over S slots, executing
whichever thread is in that slot

• Hardware-controlled thread scheduling (HEP, 1982)


– Hardware keeps track of which threads are ready to go
– Picks next thread to execute based on hardware priority
scheme

12
Coarse-Grain Hardware Multithreading
• Some architectures do not have many low-
latency bubbles
• Add support for a few threads to hide
occasional cache miss latency
• Swap threads in hardware on cache miss

13
Denelcor HEP
(Burton Smith, 1982)

BRL HEP Machine


Image Credit:
Denelcor

http://ftp.arl.army.mil/ftp/histori
c-computers/png/hep2.png

First commercial machine to use hardware threading in


main CPU
– 120 threads per processor
– 10 MHz clock rate
– Up to 8 processors
– precursor to Tera MTA / Cray XMT (Multithreaded Architecture)
14
Tera (Cray) MTA (1990)
• Up to 256 processors
• Up to 128 active threads per processor
• Processors and memory modules populate
a sparse 3D torus interconnection fabric
• Flat, shared main memory
– No data cache
– Sustains one main memory access per cycle per processor
Image Credit:
• GaAs logic in prototype, 1KW/processor @ Tera Computer Company
260MHz
– Second version CMOS, MTA-2, 50W/processor
– New version, XMT, fits into AMD Opteron
socket, runs at 500MHz

15
MTA Pipeline
Issue Pool Inst Fetch
• Every cycle, one
W VLIW instruction from
M A C one active thread is
launched into pipeline
• Instruction pipeline
is 21 cycles long

Memory Pool
W • Memory operations
Write Pool

incur ~150 cycles of


W latency

Retry Pool

Interconnection Network

Memory pipeline

16
MIT Alewife (1990)

•Modified SPARC chips


– register windows hold
different thread contexts
•Up to four threads per node
•Thread switch on local cache
miss

Image Credit: MIT


17
Oracle/Sun Niagara processors
• Target is datacenters running web servers and
databases, with many concurrent requests
• Provide multiple simple cores each with multiple
hardware threads, reduced energy/operation
though much lower single thread performance

• Niagara-1 [2004], 8 cores, 4 threads/core


• Niagara-2 [2007], 8 cores, 8 threads/core
• Niagara-3 [2009], 16 cores, 8 threads/core

18
Oracle/Sun Niagara-3, “Rainbow Falls” 2009

Image Credit: Oracle/Sun

Image Credit: Oracle/Sun


19
From Hot Chips 2009 Presentation by Sanjay Patel
Simultaneous Multithreading (SMT)
for OOO Superscalars

• Techniques presented so far have all been


“vertical” multithreading where each pipeline
stage works on one thread at a time
• SMT uses fine-grain control already present
inside an OOO superscalar to allow
instructions from multiple threads to enter
execution on same clock cycle. Gives better
utilization of machine resources.

20
Ideal Superscalar Multithreading
[Tullsen, Eggers, Levy, UW, 1995]
Issue width

Time

• Interleave multiple threads to multiple issue


slots with no restrictions 21
For most apps, most execution units lie
idle in an OOO superscalar
For an 8-way
superscalar.

Image From: Tullsen, Eggers,


and Levy,
“Simultaneous Multithreading:
Maximizing On-chip Parallelism”,
ISCA 1995. 22
Superscalar Machine Efficiency
Issue width
Instruction
issue
Completely idle cycle
(vertical waste)

Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)

23
Vertical Multithreading
Issue width
Instruction
issue

Second thread interleaved


cycle-by-cycle

Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)

• What is the effect of cycle-by-cycle interleaving?

24
Vertical Multithreading
Issue width
Instruction
issue

Second thread interleaved


cycle-by-cycle

Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)

• What is the effect of cycle-by-cycle interleaving?


– removes vertical waste, but leaves some horizontal
waste 25
Chip Multiprocessing (CMP)
Issue width

Time

• What is the effect of splitting into multiple processors?

26
Chip Multiprocessing (CMP)
Issue width

Time

• What is the effect of splitting into multiple processors?


– reduces horizontal waste,
– leaves some vertical waste, and
– puts upper limit on peak throughput of each thread.
27
Ideal Superscalar Multithreading
[Tullsen, Eggers, Levy, UW, 1995]
Issue width

Time

• Interleave multiple threads to multiple issue


slots with no restrictions 28
OOO Simultaneous Multithreading
[Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]

• Add multiple contexts and fetch engines and


allow instructions fetched from different threads
to issue simultaneously
• Utilize wide out-of-order superscalar processor
issue queue to find instructions to issue from
multiple threads
• OOO instruction window already has most of the
circuitry required to schedule from multiple
threads
• Any single thread can utilize whole machine
29
SMT adaptation to parallelism type
For regions with high thread level For regions with low thread level
parallelism (TLP) entire machine width parallelism (TLP) entire machine width is
is shared by all threads available for instruction level parallelism
(ILP)
Issue width Issue width

Time Time

30
Power 4

[POWER 4 system microarchitecture, Tendler et al, IBM J. Res. & Dev., Jan 2002] Image Credit: IBM
Courtesy of International Business Machines, © International Business Machines. 2 commits
Power 5 (architected
register sets)

2 fetch (PC),
2 initial decodes
[POWER 5 system microarchitecture, Sinharoy et al, IBM J. Res. & Dev., Jul/Sept 2005] Image Credit: IBM 31
Courtesy of International Business Machines, © International Business Machines.
Power 5 data flow ...
Image Credit: Carsten Schulz

[POWER 5 system microarchitecture, Sinharoy et al, IBM J. Res. & Dev., Jul/Sept 2005] Image Credit: IBM
Courtesy of International Business Machines, © International Business Machines.

Why only 2 threads? With 4, one of the shared


resources (physical registers, cache, memory
bandwidth) would be prone to bottleneck
32
Changes in Power 5 to support SMT
• Increased associativity of L1 instruction cache and
the instruction address translation buffers
• Added per thread load and store queues
• Increased size of the L2 (1.92 vs. 1.44 MB) and L3
caches
• Added separate instruction prefetch and buffering
per thread
• Increased the number of virtual registers from 152
to 240
• Increased the size of several issue queues
• The Power5 core is about 24% larger than the
Power4 core because of the addition of SMT
support
33
Pentium-4 Hyperthreading (2002)
• First commercial SMT design (2-way SMT)
– Hyperthreading == SMT
• Logical processors share nearly all resources of the physical
processor
– Caches, execution units, branch predictors
• Die area overhead of hyperthreading ~ 5%
• When one logical processor is stalled, the other can make progress
– No logical processor can use all entries in queues when two threads are
active
• Processor running only one active software thread runs at
approximately same speed with or without hyperthreading
• Hyperthreading dropped on OOO P6 based follow-ons to Pentium-4
(Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem
generation machines in 2008.
• Intel Atom (in-order x86 core) has two-way vertical multithreading
34
Initial Performance of SMT
• Pentium 4 Extreme SMT yields 1.01 speedup for
SPECint_rate benchmark and 1.07 for SPECfp_rate
– Pentium 4 is dual threaded SMT
– SPECRate requires that each SPEC benchmark be run against a
vendor-selected number of copies of the same benchmark
• Running on Pentium 4 each of 26 SPEC benchmarks paired
with every other (262 runs) speed-ups from 0.90 to 1.58;
average was 1.20
• Power 5, 8-processor server 1.23 faster for SPECint_rate
with SMT, 1.16 faster for SPECfp_rate
• Power 5 running 2 copies of each app speedup between
0.89 and 1.41
– Most gained some
– Floating Point apps had most cache conflicts and least gains

35
Icount Choosing Policy
Fetch from thread with the least instructions in flight.

Why does this enhance throughput?


36
Summary: Multithreaded Categories
Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)

Thread 1 Thread 3 Thread 5


Thread 2 Thread 4 Idle slot
37
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

38
Copyright © 2013 David Wentzlaff

39
Computer Architecture
ELE 475 / COS 475
Slide Deck 13: Parallel Programming
and Small Multiprocessors
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Trends in Computation Transistors
(Thousands)

Sequential
Performance
(SpecINT)
Frequency
(MHz)

Typical Power
(Watts)

2
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, and D. Wentzlaff
Trends in Computation Transistors
(Thousands)

Sequential
Performance
(SpecINT)
Frequency
(MHz)

Typical Power
(Watts)
Cores

3
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten, and D. Wentzlaff
Symmetric Multiprocessors

Processor Processor

CPU-Memory bus
bridge

I/O bus
Memory
I/O controller I/O controller I/O controller
symmetric
• All memory is equally far Graphics
away from all processors output
• Any processor can do any I/O Networks
(set up a DMA transfer)
4
Synchronization
The need for synchronization arises whenever
there are concurrent processes in a system
(even in a uniprocessor system) producer

consumer
Producer-Consumer: A consumer process
must wait until the producer process has
produced data
P1 P2
Mutual Exclusion: Ensure that only one
process uses a resource at a given time
Shared
Resource

5
A Producer-Consumer Example
tail head
Producer Consumer

Rtail Rtail Rhead R

Producer posting Item x: Consumer:


Load Rtail, (tail) Load Rhead, (head)
Store x, (Rtail) spin: Load Rtail, (tail)
Rtail=Rtail+1 if Rhead==Rtail goto spin
Store Rtail, (tail) Load R, (Rhead)
Rhead=Rhead+1
Store Rhead, (head)
process(R)
The program is written assuming
instructions are executed in order. Problems?
6
A Producer-Consumer Example
continued

Producer posting Item x: Consumer:


Load Rtail, (tail) Load Rhead, (head)
1 Store x, (Rtail) spin: Load Rtail, (tail) 3
Rtail=Rtail+1 if Rhead==Rtail goto spin
2 Store Rtail, (tail) Load R, (Rhead) 4
Rhead=Rhead+1
Can the tail pointer get updated Store Rhead, (head)
before the item x is stored? process(R)

Programmer assumes that if 3 happens after 2, then 4


happens after 1.

Problem sequences are:


2, 3, 4, 1
4, 1, 2, 3

7
Sequential Consistency
A Memory Model

P P P P P P

“ A system is sequentially consistent if the result of


any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in the order specified by the program”
Leslie Lamport

Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs

8
Sequential Consistency
Sequential concurrent tasks: T1, T2
Shared variables: X, Y (initially X = 0, Y = 10)

T1: T2:
Store 1, (X) (X = 1) Load R1, (Y)
Store 11, (Y) (Y = 11) Store R1, (Y’) (Y’= Y)
Load R2, (X)
Store R2, (X’) (X’= X)

what are the legitimate answers for X’ and Y’ ?

(X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ?

If Y is 11 then X cannot be 0
9
Sequential Consistency
Sequential consistency imposes more memory ordering
constraints than those imposed by uniprocessor
program dependencies ( )

What are these in our example ?

T1: T2:
Store 1, (X) (X = 1) Load R1, (Y)
Store 11, (Y) (Y = 11) Store (Y’), R1 (Y’= Y)
Load R2, (X)
additional SC requirements Store (X’), R2 (X’= X)

Does (can) a system with caches or out-of-order


execution capability provide a sequentially consistent
view of the memory ?

10
Multiple Consumer Example
tail head Rhead R
Producer Consumer
1 Rtail

Rtail Rhead R
Consumer
2 Rtail

Producer posting Item x: Consumer:


Load Rtail, (tail) Load Rhead, (head)
Store x, (Rtail) spin: Load Rtail, (tail)
Rtail=Rtail+1 if Rhead==Rtail goto spin
Store Rtail ,(tail) Load R, (Rhead)
Rhead=Rhead+1
Store Rhead , (head)
Critical section:
process(R)
Needs to be executed atomically
by one consumer  locks What is wrong with this code?
11
Locks or Semaphores
E. W. Dijkstra, 1965

A semaphore is a non-negative integer, with the


following operations:

P(s): if s>0, decrement s by 1, otherwise wait


probeer te verlagen, literally ("try to reduce”)

V(s): increment s by 1 and wake up one of


the waiting processes
verhogen ("increase")
P’s and V’s must be executed atomically, i.e., without
• interruptions or
• interleaved accesses to s by other processors
Process i
initial value of s determines
P(s)
the maximum no. of processes
<critical section>
in the critical section
V(s)

12
Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented
using ordinary Load and Store instructions in the
Sequential Consistency memory model. However,
protocols for mutual exclusion are difficult to design...

Simpler solution:
atomic read-modify-write instructions
Examples: m is a memory location, R is a register

Test&Set (m), R: Fetch&Add (m), RV, R: Swap (m), R:


R  M[m]; R  M[m]; Rt  M[m];
if R==0 then M[m] R + RV; M[m] R;
M[m] 1; R Rt;

13
Multiple Consumers Example
using the Test&Set Instruction

P: Test&Set (mutex),Rtemp
if (Rtemp!=0) goto P
Load Rhead, (head)
spin: Load Rtail, (tail)
Critical
if Rhead==Rtail goto spin Section
Load R, (Rhead)
Rhead=Rhead+1
Store Rhead, (head)
V: Store 0, (mutex)
process(R)

Other atomic read-modify-write instructions (Swap,


Fetch&Add, etc.) can also implement P’s and V’s

What if the process stops or is swapped out while


in the critical section?

14
Nonblocking Synchronization
Compare&Swap(m), Rt, Rs:
if (Rt==M[m]) status is an
then M[m]=Rs; implicit
Rs=Rt ; argument
status success;
else status fail;

try: Load Rhead, (head)


spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
Rnewhead = Rhead+1
Compare&Swap(head), Rhead, Rnewhead
if (status==fail) goto try
process(R)

15
Load-link & Store-conditional
aka Load-reserve, Load-Locked
Special register(s) to hold reservation flag and address,
and the outcome of store-conditional

Load-link R, (m): Store-conditional (m), R:


<flag, adr>  <1, m>; if <flag, adr> == <1, m>
R  M[m]; then cancel other procs’
reservation on m;
M[m] R;
status succeed;
else status fail;

try: Load-link Rhead, (head)


spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
Load R, (Rhead)
Rhead = Rhead + 1
Store-conditional Rhead, (head)
if (status==fail) goto try
process(R)
16
Performance of Locks
Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap
vs
Non-blocking atomic read-modify-write instructions
e.g., Compare&Swap,
Load-link/Store-conditional
vs
Protocols based on ordinary Loads and Stores

Performance depends on several interacting factors:


degree of contention,
caches,
out-of-order execution of Loads and Stores

17
Issues in Implementing
Sequential Consistency
P P P P P P

M
Implementation of SC is complicated by two issues

• Out-of-order execution capability


Load(a); Load(b) yes
Load(a); Store(b) yes if a  b
Store(a); Load(b) yes if a  b
Store(a); Store(b) yes if a  b

• Caches
Caches can prevent the effect of a store from
being seen by other processors
SC complications motivate architects to consider
weak or relaxed memory models 18
Memory Fences
Instructions to sequentialize memory accesses

Processors with relaxed or weak memory models permit Loads and Stores to
different addresses to be reordered, remove some/all extra dependencies
imposed by SC
• LL, LS, SL, SS

Need to provide memory fence instructions to force the serialization of


memory accesses

Examples of relaxed memory models:


• Total Store Order: LL, LS, SS, enforce SL with fence
• Partial Store Order: LL, LS, enforce SL, SS with fences
• Weak Ordering: enforce LL, LS, SL, SS with fences

Memory fences are expensive operations – mem instructions wait for all
relevant instructions in-flight to complete (including stores to retire – need
store acks)
However, cost of serialization only when it is required!
19
Using Memory Fences
tail head
Producer Consumer

Rtail Rtail Rhead R

Producer posting Item x: Consumer:


Load Rtail, (tail) Load Rhead, (head)
Store x, (Rtail) spin: Load Rtail, (tail)
MFenceSS if Rhead==Rtail goto spin
Rtail=Rtail+1 MFenceLL
Store Rtail, (tail) Load R, (Rhead)
Rhead=Rhead+1
ensures that tail ptr Store Rhead, (head)
ensures that R is
is not updated before process(R)
not loaded before
x has been stored
x has been stored
20
Mutual Exclusion Using Load/Store
A protocol based on two shared variables c1 and c2.
Initially, both c1 and c2 are 0 (not busy)

Process 1 Process 2
... ...
c1=1; c2=1;
L: if c2==1 then go to L L: if c1==1 then go to L
< critical section> < critical section>
c1=0; c2=0;

What is wrong? Deadlock!

21
Mutual Exclusion: second attempt
To avoid deadlock, let a process give up the reservation
(i.e. Process 1 sets c1 to 0) while waiting.

Process 1 Process 2
... ...
L: c1=1; L: c2=1;
if c2==1 then if c1==1 then
{ c1=0; go to L} { c2=0; go to L}
< critical section> < critical section>
c1=0 c2=0

• Deadlock is not possible but with a low probability


a livelock may occur.

• An unlucky process may never get to enter the


critical section    starvation

22
A Protocol for Mutual Exclusion
T. Dekker, 1966

A protocol based on 3 shared variables c1, c2 and turn.


Initially, both c1 and c2 are 0 (not busy)

Process 1 Process 2
... ...
c1=1; c2=1;
turn = 1; turn = 2;
L: if c2==1 && turn==1 L: if c1==1 && turn==2
then go to L then go to L
< critical section> < critical section>
c1=0; c2=0;

• turn == i ensures that only process i can wait


• variables c1 and c2 ensure mutual exclusion
Solution for n processes was given by Dijkstra
and is quite tricky!

23
N-process Mutual Exclusion
Lamport’s Bakery Algorithm
Process i
Initially num[j] = 0, for all j
Entry Code
choosing[i] = 1;
num[i] = max(num[0], …, num[N-1]) + 1;
choosing[i] = 0;
for(j = 0; j < N; j++) {
while( choosing[j] );
while( num[j] &&
( ( num[j] < num[i] ) ||
( num[j] == num[i] && j < i ) ) );
}

Exit Code
num[i] = 0;

24
Symmetric Multiprocessors

Processor Processor

CPU-Memory bus
bridge

I/O bus
Memory
I/O controller I/O controller I/O controller
symmetric
• All memory is equally far Graphics
away from all processors output
• Any processor can do any I/O Networks
(set up a DMA transfer)
25
Multidrop Memory Bus
Arbitration

Control

Address

Data

Clock

Main
Processor 1 Processor 2
Memory

26
Pipelined Memory Bus
Arbitration

Control

Address

Data

Clock

Main
Processor 1 Processor 2
Memory

27
Pipelined Memory Bus
P1
LD
0x1234abcd
0xDA7E0000

Arbitration

Control

Address

Data

Clock

Main 28
Processor 1 Processor 2
Memory Coherence in SMPs
CPU-1 CPU-2

A 100 cache-1 A 100 cache-2

CPU-Memory bus

A 100 memory

Suppose CPU-1 updates A to 200.


write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value

Do these stale values matter?


What is the view of shared memory for programming?
29
Write-back Caches & SC
prog T1 cache-1 memory cache-2 prog T2
ST 1, X X= 1 X=0 Y= LD Y, R1
Y=11 Y =10 Y’= ST R1, Y’
• T1 is executed ST 11, Y
X’= X= LD X, R2
Y’= X’= ST R2, X’
X= 1 X=0 Y=
• cache-1 writes back Y Y=11 Y =11 Y’=
X’= X=
Y’= X’=
X= 1 X=0 Y = 11
Y=11 Y =11 Y’= 11
• T2 executed X’= X=0
Y’= X’= 0
X= 1 X=1 Y = 11
• cache-1 writes back X Y=11 Y =11 Y’= 11
X’= X=0
Y’= X’= 0
X= 1 X=1 Y =11
• cache-2 writes back Y=11 Y =11 Y’=11
X’= 0 X=0
X’ & Y’ Y’=11 X’= 0
30
Write-through Caches & SC
cache-1 memory cache-2 prog T2
prog T1 LD Y, R1
ST 1, X X= 0 X=0 Y=
Y=10 Y =10 Y’= ST Y’, R1
ST 11, Y LD X, R2
X’= X=0
Y’= X’= ST X’,R2

X= 1 X=1 Y=
Y=11 Y =11 Y’=
• T1 executed X’= X=0
Y’= X’=

X= 1 X=1 Y = 11
• T2 executed Y=11 Y =11 Y’= 11
X’= 0 X=0
Y’=11 X’= 0

Write-through caches don’t preserve


sequential consistency either
31
Cache Coherence vs.
Memory Consistency
• A cache coherence protocol ensures that all writes by
one processor are eventually visible to other
processors, for one memory address
– i.e., updates are not lost
• A memory consistency model gives the rules on when a
write by one processor can be observed by a read on
another, across different addresses
– Equivalently, what values can be seen by a load
• A cache coherence protocol is not enough to ensure
sequential consistency
– But if sequentially consistent, then caches must be
coherent
• Combination of cache coherence protocol plus
processor memory reorder buffer implements a given
machine’s memory consistency model
32
Warmup: Parallel I/O
Memory Physical
Address (A) Bus Memory
Proc. Data (D) Cache

R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can D DMA
be the Bus Master and DISK
R/W
effect transfers

(DMA stands for “Direct Memory Access”, means the I/O device
can read/write memory autonomous from the CPU)
33
Problems with Parallel I/O
Cached portions
of page Physical
Memory Memory
Bus
Proc.
Cache
DMA transfers

DMA
DISK
Memory Disk: Physical memory may be
stale if cache copy is dirty

Disk Memory: Cache may hold stale data and not


see memory writes
34
Snoopy Cache Goodman & Ravishankar 1983
• Idea: Have cache watch (or snoop upon) DMA
transfers, and then “do the right thing”
• Snoopy cache tags are dual-ported

Used to drive Memory Bus


when Cache is Bus Master

A A
Tags and Snoopy read port
State attached to Memory
Proc. R/W R/W
Bus
Data
D (lines)

Cache

35
Shared Memory Multiprocessor
Memory
Bus

Snoopy
P1 Cache Physical
Memory
Snoopy
P2 Cache

P3 Snoopy DMA DISKS


Cache

Use snoopy mechanism to keep all processors’ view of


memory coherent
36
Update(Broadcast) vs. Invalidate
Snoopy Cache Coherence Protocols
• Write Update (Broadcast)
– Writes are broadcast and update all other cache
copies
• Write Invalidate
– Writes invalidate all other cache copies

37
Write Update (Broadcast) Protocols
write miss:
Broadcast on bus, other processors update
copies (in place)

read miss:
Memory is always up to date

38
Write Invalidate Protocols
write miss:
the address is invalidated in all other
caches before the write is performed

read miss:
if a dirty copy is found in some cache, a write-
back is performed before the memory is read

39
Cache State Transition Diagram
The MSI protocol

Each cache line has state bits M: Modified


S: Shared
Address tag I: Invalid
state
bits Write miss
(P1 gets line from memory)
P1 reads
Other processor reads M or writes
(P1 writes back)

Read miss Other processor


(P1 gets line from memory) intent to write
(P1 writes back)
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
40
Two Processor Example
(Reading and writing the same cache line)

P1 P1 reads
P1 reads
P2 reads, M or writes
P1 writes P1 writes back
P2 reads Write miss
P2 writes
P2 intent to write
P1 reads
P1 writes Read
miss
P2 writes S I
P1 writes P2 intent to write

P2 P1 reads,
P2 reads
P2 writes back M or writes
Write miss

P1 intent to write
Read
miss
S I
P1 intent to write
41
Observation
P1 reads
Other processor reads M or writes
P1 writes back Write miss

Other processor
intent to write

Read
miss
S I
Read by any Other processor
processor intent to write

• If a line is in the M state then no other cache can have a


copy of the line!
– Memory stays coherent, multiple differing copies cannot exist
42
MESI: An Enhanced MSI protocol
increased performance for private data (Illinois Protocol)

Each cache line has a tag M: Modified Exclusive


E: Exclusive but unmodified
Address tag S: Shared
state I: Invalid
bits
Write miss
P1 write P1 read
P1 write M E Read miss,
or read Other not shared
P1 intent
processor
Other processor reads to write
reads Other processor
P1 writes back intent to write
Other processor
Read miss, intent to write,
shared
P1 writes back
S I
Read by any Other processor
processor intent to write
Cache state in
processor P1
43
MOESI (Used in AMD Opteron)
Each cache line has a tag M: Modified Exclusive
O: Owned
Address tag E: Exclusive but unmodified
state S: Shared
bits I: Invalid
Write miss
P1 write P1 read
P1 write M E Read miss,
or read Other not shared
P1 intent
processor
Other processor reads to write
reads Other processor
P1 tracks write back intent to write
Other processor
Read miss, intent to write,
shared
P1 writes back
S I
Read by any Other processor Cache state in
processor intent to write
processor P1
O Read by any
44
P1 write processor
MESIF (Used by Intel Core i7)
Each cache line has a tag M: Modified Exclusive
E: Exclusive but unmodified
Address tag S: Shared
state I: Invalid
bits F: Forward
Write miss
P1 write P1 read
P1 write M E Read miss,
or read Other not shared
P1 intent
processor
Other processor reads to write
reads Other processor
P1 writes back intent to write
Other processor
Read miss, intent to write,
shared
P1 writes back
S/F I
Read by any Other processor
processor intent to write
Cache state in
processor P1
45
Scalability Limitations of Snooping
• Caches
– Bandwidth into caches
– Tags need to be dual ported or steal cycles for
snoops
– Need to invalidate all the way to L1 cache
• Bus
– Bandwidth
– Occupancy (As number of cores grows, atomically
utilizing bus becomes a challenge)

46
False Sharing
state blk addr data0 data1 ... dataN

A cache block contains more than one word

Cache-coherence is done at the block-level and


not word-level

Suppose M1 writes wordi and M2 writes wordk and


both words have the same block address.

What can happen?

47
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

48
Blackboard Example: Sequential
Consistency
Valid Not Valid
P1 P2 1 1 5 5
1 5 2 2 6 1
2 6 5 3 7 3
3 7 3 4 1 2
4 8 6 5 2 4
7 6 3 6
8 7 4 7
4 8 8 8

49
Analysis of Dekker’s Algorithm
... Process 1 ... Process 2
c1=1; c2=1;
Scenario 1

turn = 1; turn = 2;
L: if c2=1 & turn=1 L: if c1=1 & turn=2
then go to L then go to L
< critical section> < critical section>
c1=0; c2=0;

... Process 1 ... Process 2


c1=1; c2=1;
Scenario 2

turn = 1; turn = 2;
L: if c2=1 & turn=1 L: if c1=1 & turn=2
then go to L then go to L
< critical section> < critical section>
c1=0; c2=0;

50
Computer Architecture
ELE 475 / COS 475
Slide Deck 14: Interconnection
Networks
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Overview of Interconnection
Networks: Buses

Core Core Core Core

Core Core Core Core

Core Core Core Core

4
Overview of Interconnection
Networks: Point-to-point / Switched

Core Core Core Core

SW SW SW SW

Core Core Core Core

SW SW SW SW

Core Core Core Core

SW SW SW SW

5
Overview of Interconnection
Networks: Point-to-point / Switched

Core Core Core Core

SW SW SW SW

Core Core Core Core

SW SW SW SW

Core Core Core Core

SW SW SW SW

6
Explicit Message Passing
(Programming)
• Send(Destination, *Data)
• Receive(&Data)
• Receive(Source, &Data)

• Unicast (one-to-one)
• Multicast (one-to-multiple)
• Broadcast (one-to-all)

7
Message Passing Interface (MPI)
#include <stdio.h>
#include <assert.h>
#include <mpi.h>
int main (int argc, char **argv) {
int myid, numprocs, x, y;
int tag = 475;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
assert(numprocs == 2);
if(myid==0) {
x = 475;
MPI_Send(&x, 1, MPI_INT, 1, tag, MPI_COMM_WORLD);
MPI_Recv(&y, 1, MPI_INT, 1, tag, MPI_COMM_WORLD, &status);
printf(“received number: ELE %d A\n”, y);
}
else {
MPI_Recv(&y, 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &status);
y += 105;
MPI_Send(&y, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
}
MPI_Finalize();
exit(0);
}
8
Message Passing vs. Shared Memory
• Message Passing
– Memory is private
– Explicit send/receive to communicate
– Message contains data and synchronization
– Need to know Destination on generation of data (send)
– Easy for Producer-Consumer
• Shared Memory
– Memory is shared
– Implicit communication via loads and stores
– Implicit synchronization needed via Fences, Locks, and Flags
– No need to know Destination on generation of date (can store in
memory and user of data can pick up later)
– Easy for multiple threads accessing a shared table
– Needs Locks and critical sections to synchronize access

9
Shared Memory Tunneled over
Messaging
• Software
– Turn loads and stores into sends and receives
• Hardware
– Replace bus communications with messages sent
between cores and between cores and memory

Core Core Core Memory Core Core Core Memory

SW SW SW SW

10
Shared Memory Tunneled over
Messaging
• Software
– Turn loads and stores into sends and receives
• Hardware
– Replace bus communications with messages sent
between cores and between cores and memory

Core Core Core Memory Core Core Core Memory

SW SW SW SW

11
Messaging Tunneled over Shared
Memory
• Use software queues (FIFOs) with locks to
transmit data directly between cores by loads
and stores to memory

tail head
Producer Consumer

Rtail Rtail Rhead R

12
Interconnect Design
• Switching
• Topology
• Routing
• Flow Control

13
Anatomy of a Message

• Flit: flow control digit (Basic unit of flow control)


• Phit: physical transfer digit (Basic unit of data
transferred in one clock) 14
Switching
• Circuit Switched
• Store and Forward
• Cut-through
• Wormhole

15
Topology

16
Topology

17
Topology

18
Topology

19
Topology

20
Topology

21
Topology

22
Topology

23
Topology Parameters
• Routing Distance: Number of links between
two points
• Diameter: Maximum routing distance
between any two points
• Average Distance
• Minimum Bisection Bandwidth (Bisection
Bandwidth): The bandwidth of a minimal cut
though the network such that the network is
divided into two sets of nodes
• Degree of a Router
24
Topology Parameters
Diameter: 2√𝑁 - 2
Bisection Bandwidth: 2√𝑁
Degree of a Router: 5

25
Topology Influenced by Packaging
• Wiring grows as
N-1
• Physically hard to
pack into 3-space
(pack in sphere?)

26
Topology Influenced by Packaging
• Packing N dimensions in N-1
space leads to long wires
• Packing N dimensions in N-2
space leads to really long wires

27
Network Performance
• Bandwidth: The rate of data that can be transmitted
over the network (network link) in a given time
• Latency: The time taken for a message to be sent from
sender to receiver

• Bandwidth can affect latency


– Reduce congestion
– Messages take fewer Flits and Phits
• Latency can affect Bandwidth
– Round trip communication can be limited by latency
– Round trip flow-control can be limited by latency

28
Latency

29
Anatomy of Message Latency
T = Thead + L/b
Thead: Head Phit Latency, includes tC , tR , hop
count, and contention

Unloaded Latency:
T0 = HR * tR + HC * tC + L/b

30
Anatomy of Message Latency
T = Thead + L/b
Thead: Head Phit Latency, includes tC , tR , hop
count, and contention

Unloaded Latency:
T0 = HR * tR + HC * tC + L/b

Shorter routes Faster channels Wider channels or


Faster routers shorter messages 31
Interconnection Network Performance

32
Routing
• Oblivious (routing path independent of state
of network)
– Deterministic
– Non-Deterministic
• Adaptive (routing path depends on state of
network)

33
Flow Control
• Local (Link or hop based) Flow Control
• End-to-end (Long distance)

34
Deadlock
• Deadlock can occur if cycle possible in “Waits-
for” graph

35
Deadlock Example (Waits-for and
Holds analysis)

36
Deadlock Avoidance vs. Deadlock
Recovery
• Deadlock Avoidance
– Protocol designed to never deadlock
• Deadlock Recovery
– Allow Deadlock to occur and then resolve
deadlock usually through use of more buffering

37
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

38
39
40
41
42
Computer Architecture
ELE 475 / COS 475
Slide Deck 15: Directory Cache
Coherence
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Coherency Misses
1. True sharing misses arise from the communication of
data through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated
because some word in the block, other than the one
being read, is written into
• Invalidation does not cause a new value to be
communicated, but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
 miss would not occur if block size were 1 word

2
Example: True v. False Sharing v.
Hit?
• Assume x1 and x2 in same cache block.
P1 and P2 both read x1 and x2 before.

Time P1 P2 True, False, Hit? Why?


1 Write x1 True miss; invalidate x1 in P2
2 Read x2 False miss; x1 irrelevant to P2
3 Write x1 False miss; x1 irrelevant to P2
4 Write x2 False miss; x1 irrelevant to P2
5 Read x2 True miss; invalidate x2 in P1

3
MP Performance 4 Processor
Commercial Workload: OLTP, Decision Support (Database),
Search Engine
3.25
• True sharing and
3
false sharing Instruction
2.75

Memory cycles per instruction


Capacity/Conflict
unchanged going 2.5 Cold
from 1 MB to 8 MB 2.25 False Sharing
(L3 cache) 2 True Sharing

1.75
• Uniprocessor cache 1.5
misses 1.25
improve with 1
0.75
cache size increase
0.5
(Instruction,
Capacity/Conflict, 0.25
Compulsory) 0
1 MB 2 MB 4 MB 8 MB
Cache size
4
MP Performance 2MB Cache
Commercial Workload: OLTP, Decision Support
(Database), Search Engine
3
Instruction
Conflict/Capacity
• True sharing, 2.5

Memory cycles per instruction


Cold
false sharing False Sharing
2 True Sharing
increase going
from 1 to 8 1.5
CPUs
1

0.5

0
1 2 4 6 8
Processor count
5 5
Directory Coherence Motivation
• Snoopy protocols require every cache miss to
broadcast
– Requires large bus bandwidth, O(N)
– Requires large cache snooping bandwidth, O(N^2)
aggregate
• Directory protocols enable further scaling
– Directory can track all caches holding a memory block
and use point-to-point messages to maintain
coherence
– Communication done via scalable point-to-point
interconnect

6
Directory Cache Coherence
CPU CPU CPU CPU CPU CPU

Cache Cache Cache Cache Cache Cache

Interconnection Network

Directory Directory Directory Directory


Controller Controller Controller Controller

DRAM Bank DRAM Bank DRAM Bank DRAM Bank

7
Distributed Shared Memory
CPU CPU CPU
DRAM Cache DRAM Cache DRAM Cache
Bank Bank Bank

Directory Directory Directory

IO IO IO

Interconnection Network

IO IO IO
Directory Directory Directory

DRAM DRAM DRAM


Bank Cache Bank Cache Bank Cache
CPU CPU CPU
8
Multicore Chips in a Multi-chip System

CPUs CPUs CPUs


DRAM Cache DRAM Cache DRAM Cache
Bank Bank Bank

Directory Directory Directory

IO IO IO

Interconnection Network

IO IO IO
Directory Directory Directory

DRAM DRAM DRAM


Bank Cache Bank Cache Bank Cache
CPUs CPUs CPUs
9
Non-Uniform Memory Access (NUMA)
• Latency to access memory is different
depending on node accessing it and address
being accessed
• NUMA does not necessarily imply ccNUMA

10
Address to Home Directory
High Order Bits Determine Home (Directory)
Physical Address
32 0
Home Node Index Offset

– OS can control placement in NUMA


– Homes can become hotspots
Low Order Bits Determine Home (Directory)
Physical Address
32 0
Index Home Node Offset
– OS looses control on placement
– Load balanced well
11
Basic Full-Map Directory
State Sharers/Owner
State: State that the directory believes
S 1011001100011
cache line is in {Shared, Uncached,
U xxxxxxxxxxxxxxxx Exclusive, (Pending)}
U xxxxxxxxxxxxxxxx Sharers/Owner: If State == Shared:
Bit vector of all of the nodes in system
E 0001000000000
that have line in shared state.
… If State == Modified:
Denotes which node has line in cache
exclusively

12
Cache State Transition Diagram
The MSI protocol

Each cache line has state bits M: Modified


S: Shared
Address tag I: Invalid
state
bits Write miss
(P1 gets line from memory)
P1 reads
Other processor reads M or writes
(P1 writes back)

Read miss Other processor


(P1 gets line from memory) intent to write
(P1 writes back)
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
13
Cache State Transition Diagram
For Directory Coherence

M: Modified
S: Shared
I: Invalid
Write miss, P1 Send Write
Miss Message (Waits for
reply before transition)
P1 reads
Receive Read Miss Message M or writes
(P1 writes back)
Invalidate
Message (P1
Read miss, P1 Send Read writes back,
Miss Message (Waits for Reply after)
reply before transition)
S I
Read by P1 Invalidate
Message Cache state in
(Reply) processor P1
14
Cache State Transition Diagram
For Directory Coherence

M: Modified
S: Shared
I: Invalid
Write miss, P1 Send Write
Miss Message (Waits for
reply before transition)
P1 reads
Receive Read Miss Message M or writes
(P1 writes back) Writeback
Invalidate (Notify Directory)
Message (P1
Read miss, P1 Send Read writes back,
Miss Message (Waits for Reply after)
reply before transition)
S I
Read by P1 Invalidate
Message Cache state in
(Reply) processor P1
15
Notify Directory
Directory State Transition Diagram
U: Uncached
S: Shared
E: Exclusive Write Miss P:
Fetch/Invalidate
From E Node,
Data Value Reply
Sharers = {P}
Read Miss P: E
Fetch from E Node,
Data Value Reply Data Write-Back P:
Sharers = {P} Sharers = {}

Write Miss P:
Data Value Reply
Sharers = {P}
S U
Read Miss P: Read Miss P:
Data Value Reply Data Value Reply
Sharers = Sharers = {P} State of Cache
Sharers +{P} Line in Directory16
Message Types

From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved. 17
Multiple Logical Communication
Channels Needed
• Responses queued behind requests can lead
to deadlock
• Many different message types, need to
determine which message type can create
more messages
• Segregate flows onto different logical/physical
channels

18
Memory Ordering Point
• Just like in bus based snooping protocol, need to
guarantee that state transitions are atomic
• Directory used as ordering point
– Whichever message reaches home directory first wins
– Other requests on same cache line given negative
acknowledgement (NACK)
• NACK causes retry from other node
• Forward progress guarantee needed
– After node acquires line, need to commit at least one
memory operation before transitioning invalidating
line

19
Scalability of Directory Sharer List
Storage
• Full-Map Directory (Bit per cache)

• Limited Pointer (Keep list of base-2 encoded


sharers)
Overflow 0x26 0x10

• LimitLess (Software assistance)


Overflow 0x26 0x10

20
Beyond Simple Directory Coherence
• On-chip coherence (Leverage fast on-chip
communications to speed up or simplify
protocol)
• Cache Only Memory Architectures (COMA)
• Large scale directory systems (Scalability of
directory messages and sharer list storage)

21
SGI UV 1000 (Origin Descendant)
Maximum Memory: 16TB
Maximum Processors: 256
Maximum Cores: 2560
Topology: 2D Torus

22
Image Credit: SGI
TILE64Pro
• 64 Cores DDR2 Controller DDR2 Controller

• 2D Mesh 10 Gig

SerDes

SerDes
• 4 Memory PCIe Enet
(XAUI)

Controllers
• 3 Memory General
Purpose
Gigabit
Enet
Networks I/O Gigabit
Enet
– Divide different
flows of traffic
SerDes

10 Gig

SerDes
• Each node can be
PCIe Enet
(XAUI)

a home
DDR2 Controller DDR2 Controller

23
Beyond ELE 475
• Computer Architecture Research
– International Symposium on Computer Architecture (ISCA)
– International Symposium on Microarchitecture (MICRO)
– Architectural Support for Programing Languages and
Operating Systems (ASPLOS)
– International Symposium on High Performance Computer
Architecture (HPCA)
• Build some chips / FPGA
• Parallel Computer Architecture
• ELE 580A Parallel Computation (Princeton Only)
– Graduate Level, Using Primary Sources

24

You might also like