Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
Onur 447 Spring15 Lecture17 Memoryhierarchyandcaches Afterlecture
Computer Architecture
Lecture 17: Memory Hierarchy and Caches
n HW 4: March 18
n Exam: March 20
2
Announcements
n Please turn in your feedback form: Very Important
3
IA-64: A “Complicated” VLIW ISA
Recommended reading:
Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro 2000.
EPIC – Intel IA-64 Architecture
n Gets rid of lock-step execution of instructions within a VLIW
instruction
n Idea: More ISA support for static scheduling and parallelization
q Specify dependencies within and between VLIW instructions
(explicitly parallel)
+ No lock-step execution
+ Static reordering of stores and loads + dynamic checking
-- Hardware needs to perform dependency checking (albeit aided by
software)
-- Other disadvantages of VLIW still exist
n IA-64 Instruction
q Fixed-length 41 bits long
q Contains three 7-bit register specifiers
q Contains a 6-bit field for specifying one of the 64 one-bit
predicate registers
6
IA-64 Instruction Bundles and Groups
n Groups of instructions can be
executed safely in parallel
q Marked by “stop bits”
7
Template Bits
n Specify two things
q Stop information: Boundary of independent instructions
q Functional unit information: Where should each instruction be routed
8
Three Things That Hinder Static Scheduling
n Dynamic events (static unknowns)
n Branch direction
n Load hit miss status
n Memory address
9
Non-Faulting Loads and Exception Propagation in IA-64
n Idea: Support unsafe code motion
ld.s r1=[a]
inst 1 inst 1
inst 2 unsafe inst 2
…. code ….
motion br
br
n ld.a (advanced load) starts the monitoring of any store to the same
address as the advanced load
n If no aliasing has occurred since ld.a, ld.c is a NOP
n If aliasing has occurred, ld.c re-loads from memory
12
Aggressive ST-LD Reordering in IA-64
n Idea: Reorder LD/STs in the presence of unknown address
q Load and its use
13
What We Covered So Far in 447
n ISA à Single-cycle Microarchitectures
n Pipelining
n Out-of-Order Execution
15
Agenda for the Rest of 447
n The memory hierarchy
n Caches, caches, more caches (high locality, high bandwidth)
n Virtualizing the memory hierarchy
n Main memory: DRAM
n Main memory control, scheduling
n Memory latency tolerance techniques
n Non-volatile memory
n Multiprocessors
n Coherence and consistency
n Interconnection networks
n Multi-core issues
16
Readings for Today and Next Lecture
n Memory Hierarchy and Caches
17
Memory (Programmer’s View)
18
Abstraction: Virtual vs. Physical Memory
n Programmer sees virtual memory
q Can assume the memory is “infinite”
n Reality: Physical memory size is much smaller than what
the programmer assumes
n The system (system software + hardware, cooperatively)
maps virtual memory addresses are to physical memory
q The system automatically manages the physical memory
space transparently to the programmer
20
Idealism
Pipeline
Instruction (Instruction Data
Supply Supply
execution)
DRAM BANKS
Memory in a Modern System
DRAM INTERFACE
DRAM MEMORY
CORE 1
CORE 3
CONTROLLER
L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2
CORE 2
CORE 0
SHARED L3 CACHE
Ideal Memory
n Zero access time (latency)
n Infinite capacity
n Zero cost
n Infinite bandwidth (to support multiple accesses in parallel)
24
The Problem
n Ideal memory’s requirements oppose each other
n Bigger is slower
q Bigger à Takes longer to determine the location
25
Memory Technology: DRAM
n Dynamic random access memory
n Capacitor charge state indicates stored value
q Whether the capacitor is charged or discharged indicates
storage of 1 or 0
q 1 capacitor
q 1 access transistor
row enable
n Capacitor leaks through the RC path
q DRAM cell loses charge over time
_bitline
q DRAM cell needs to be refreshed
26
Memory Technology: SRAM
n Static random access memory
n Two cross coupled inverters store a single bit
q Feedback path enables the stored value to persist in the “cell”
q 4 transistors for storage
q 2 transistors for access
row select
_bitline
bitline
27
Memory Bank Organization and Operation
n Read access sequence:
4. Decode column
address & select subset
of row
• Send to output
5. Precharge bit-lines
• For next access
28
SRAM (Static Random Access Memory)
Read Sequence
row select 1. address decode
2. drive row select
3. selected bit-cells drive bitlines
_bitline
bitline
n SRAM
q Faster access (no capacitor)
q Lower density (6T cell)
q Higher cost
q No need for refresh
q Manufacturing compatible with logic process (no capacitor)
31
The Problem
n Bigger is slower
q SRAM, 512 Bytes, sub-nanosec
q SRAM, KByte~MByte, ~nanosec
q DRAM, Gigabyte, ~50 nanosec
q Hard Disk, Terabyte, ~10 millisec
33
The Memory Hierarchy
backup
everything
big
but
slow
here
34
Memory Hierarchy
n Fundamental tradeoff
q Fast memory: small
q Large memory: slow
n Idea: Memory hierarchy
Hard Disk
Main
CPU Cache Memory
RF (DRAM)
35
Locality
n One’s recent past is a very good predictor of his/her near
future.
36
Memory Locality
n A “typical” program has a lot of locality in memory
references
q typical programs are composed of “loops”
37
Caching Basics: Exploit Temporal Locality
n Idea: Store recently accessed data in automatically
managed fast memory (called cache)
n Anticipation: the data will be accessed again soon
38
Caching Basics: Exploit Spatial Locality
n Idea: Store addresses adjacent to the recently accessed
one in automatically managed fast memory
q Logically divide memory into equal size blocks
q Fetch to cache the accessed block in its entirety
n Anticipation: nearby data will be accessed soon
39
The Bookshelf Analogy
n Book in your hand
n Desk
n Bookshelf
n Boxes at home
n Boxes in storage
40
Caching in a Pipelined Design
n The cache needs to be tightly integrated into the pipeline
q Ideally, access in 1-cycle so that dependent operations do not
stall
n High frequency pipeline à Cannot make the cache large
q But, we want a large cache AND a pipelined design
n Idea: Cache hierarchy
Main
Level 2 Memory
CPU Level1 Cache (DRAM)
RF Cache
41
A Note on Manual vs. Automatic Management
n Manual: Programmer manages data movement across levels
-- too painful for programmers on substantial programs
q “core” vs “drum” memory in the 50’s
n You don’t need to know how big the cache is and how it works to
write a “correct” program! (What if you want a “fast” program?)
42
Automatic Management in Memory Hierarchy
n Wilkes, “Slave Memories and Dynamic Storage Allocation,”
IEEE Trans. On Electronic Computers, 1965.
q hi + mi = 1
n Thus
Ti = hi·ti + mi·(ti + Ti+1)
Ti = ti + mi ·Ti+1
n Keep mi low
q increasing capacity Ci lowers mi, but beware of increasing ti
q lower mi by smarter management (replacement::anticipate what you
don’t need, prefetching::anticipate what you will need)
49
Caching Basics
n Block (line): Unit of storage in the cache
q Memory is logically divided into cache blocks that map to
locations in the cache
Address
Tag Store Data Store
Hit/miss? Data