0% found this document useful (0 votes)

27 views

Associative Mapping

The document discusses different types of cache associativity including fully associative, set associative, and direct mapped caches. It also discusses cache replacement policies and multilevel cache hierarchies with levels such as L1, L2, and L3 caches.

Uploaded by

Lekshmi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Associative Mapping

Uploaded by

Lekshmi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Associative Caches

Associative Caches
 Fully associative
 Allow a given block to go in any cache entry
 Requires all entries to be searched at once
 Comparator per entry (expensive)
 n-way set associative
 Each set contains n entries
 Block number determines which set
 (Block number) modulo (#Sets in cache)
 Search all entries in a given set at once
 n comparators (less expensive)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2
Associative Cache Example

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3

Spectrum of Associativity
 For a cache with 8 entries

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4

Associativity Example
 Compare 4-block caches
 Direct mapped, 2-way set associative,
fully associative
 Block access sequence: 0, 8, 0, 6, 8
 Direct mapped
Block Cache Hit/miss Cache content after access
address index
0 1 2 3

0 0 miss Mem[0]

8 0 miss Mem[8]

0 0 miss Mem[0]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5
6 2 miss Mem[0] Mem[6]
Associativity Example
 2-way
Block
address
set
Cacheassociative
index
Hit/miss Cache content after access

Set 0 Set 1

0 0 miss Mem[0]

8 0 miss Mem[0] Mem[8]

0 0 hit Mem[0] Mem[8]

 Fully
6
associative
0 miss Mem[0] Mem[6]

8 0 miss Mem[8] Mem[6]

Block Hit/miss Cache content after access
address

0 miss Mem[0]

8 miss Mem[0] Mem[8]

0 hit Mem[0] Mem[8]

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6
How Much Associativity
 Increased associativity decreases miss
rate
 But with diminishing returns
 Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
 1-way: 10.3%
 2-way: 8.6%
 4-way: 8.3%
 8-way: 8.1%

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

Set Associative Cache Organization

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8

Replacement Policy
 Direct mapped: no choice
 Set associative
 Prefer non-valid entry, if there is one
 Otherwise, choose among entries in the set
 Least-recently used (LRU)
 Choose the one unused for the longest time
 Simple for 2-way, manageable for 4-way, too hard
beyond that
 Random
 Gives approximately the same performance
as LRU for high associativity

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9

Multilevel Caches
 Primary cache attached to CPU
 Small, but fast
 Level-2 cache services misses from
primary cache
 Larger, slower, but still faster than main
memory
 Main memory services L-2 cache misses
 Some high-end systems include L-3 cache

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10

Multilevel Cache Example
 Suppose we have a processor with a base CPI
of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a
main memory access time of 100 ns, including
all the miss handling. Suppose miss rate per
instruction at the primary cache is 2%. How
much faster will the processor be if we add a
secondary cache that has a 5ns access time for
either a hit or a miss & is large enough to reduce
the miss rate to main memory to 0.5% ?

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11

Multilevel Cache Example
 Given
 CPU base CPI = 1, clock rate = 4GHz
 Miss rate/instruction = 2%
 Main memory access time = 100ns
 With just primary cache
 Miss penalty = 100ns/0.25ns = 400 cycles
 Effective CPI = 1 + 0.02 × 400 = 9

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12

Example (cont.)
 Now add L-2 cache
 Access time = 5ns
 Global miss rate to main memory = 0.5%
 Primary miss with L-2 hit
 Penalty = 5ns/0.25ns = 20 cycles
 CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
 Performance ratio = 9/3.4 = 2.6

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13

Multilevel Cache Considerations
 Primary cache
 Focus on minimal hit time
 L-2 cache
 Focus on low miss rate to avoid main memory
access
 Hit time has less overall impact
 Results
 L-1 cache usually smaller than a single cache
 L-1 block size smaller than L-2 block size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14

Virtual Memory
Virtual Memory
 Use main memory as a “cache” for
secondary (disk) storage
 Managed jointly by CPU hardware and the
operating system (OS)
 Programs share main memory
 Each gets a private virtual address space
holding its frequently used code and data
 Protected from other programs
 CPU and OS translate virtual addresses to
physical addresses
 VM “block” is called a page
 VM translation “miss” is called a page fault
Virtual Memory
Consider collection of programs are running at once on a
computer
Total memory required by all programs may be larger than
amount of main memory available on computer
Yet, only a fraction of this memory is actively used at any point in
time
In Main memory – only active portions of the many programs are
running, just as cache contains only active portion of one
program
To allow multiple programs to share the same memory, ensuring
that a program can only read & write the portions of main
memory that has been assigned to it.
Also can't know which programs share the memory with other
programs when we compile them
Virtual Memory
 Programs sharing the memory change dynamically while the
programs are running
 B'coz of this dynamic interaction, compile each program into
its own address space – ie. Seperate range of memory
locations accessible only to this program
 Virtual memory implements the translation of a program's
address space to physical addresses.
– This translation process enforces protection of a program's
address space from other programs
 VM allows a single user program to exceed the size of primary
memory.
– Formerly, if program became too large for memory, it was
up to the programmer to make it fit
Virtual Memory
 Programmers divided programs into pieces(overlays) & then
identified the pieces that were mutually exclusive.
 Program never tried to access an overlay that was not loaded
& that the overlays loaded never exceeded the total size of the
memory
– Overlays were organised as modules, each containing both
code & data
 Overlaying one module with another – achieved by calls
between the procedures
Address Translation
 Fixed-size pages (e.g., 4K)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20

Address Translation
 This fig. Shows the virtually addressed memory with pages
mapped to main memory
 This is called as address mapping / address translation
– Eg. today 2 memory hierarchy levels controlled by virtual
memory are DRAMs & magnetic disks
 Working Principle :
– VM simplifies loading the program for execution by
providing relocation
– Relocation maps the virtual addresses used by a program
to different physical addresses before the addresses are
used to access memory
– This Relocation allows us to load the program anywhere in
main memory
Address Translation
 Advantages :
– Eleminating the need to find a contiguous block of memory
to allocate to a program
– Formerly, relocation problems required special hardware &
special hardware & special support in the operating system
• Today, virtual memory also provides this function
– In virtual memory, the address is broken into a virtual page
number & a page offset
Translation Using a Page Table

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23

Translation Using a Page Table
 This fig. Shows translation of virtual page
number to a physical page number
 Physical page number : upper portion of
physical address
 Page offset : lower portion of physical
address
 No of bits in page offset field gives the
page size
Translation Using a Page Table
 No. of pages addressable with the virtual
address need not match the no. of pages
addressable with the physical address.
 Larger no. of virtual pages than physical
pages gives the illusion of an unbounded
amount of virtual memory
 When there is a miss in virtual memory, it
is page fault.
Page Fault Penalty
 On page fault, the page must be fetched
from disk
 Takes millions of clock cycles
 Handled by OS code
 Try to minimize page fault rate
 Fully associative placement
 Smart replacement algorithms

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

Page Tables
 Stores placement information
 Array of page table entries, indexed by virtual
page number
 Page table register in CPU points to page
table in physical memory
 If page is present in memory
 PTE stores the physical page number
 Plus other status bits (referenced, dirty, …)
 If page is not present
 PTE can refer to location in swap space on
disk
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27
Mapping Pages to Storage

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

Mapping Pages to Storage
 Page table : It resides in memory . In VM
systems, locate pages by using a table that
indexes the memory
 Page table is indexed with page number from the virtual
address to discover the corresponding physical page
number
 Each program has its own page table, which maps virtual
address space of that program to main memory
Mapping Pages to Storage
 Page Table register : To indicate the location

of page table in memory, hardware

includes a register that points to the start
of page table.
Replacement and Writes
 To reduce page fault rate, prefer least-
recently used (LRU) replacement
 Reference bit (aka use bit) in PTE set to 1 on
access to page
 Periodically cleared to 0 by OS
 A page with reference bit = 0 has not been
used recently
 Disk writes take millions of cycles
 Block at once, not individual locations
 Write through is impractical
 Use write-back
 Dirty bit in PTE set when page is written
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31
Translation-LookAside Buffer
Fast Translation Using a TLB
 Page tables here are stored in main memory
 Address translation would appear to require
extra memory references
 One memory access to obtain the physical address
 Second memory access to get the data


Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

Fast Translation Using a TLB
 How to improve access performance?
– But access to page tables has good locality
 So use a fast cache of PTEs within the CPU
 Called a Translation Look-aside Buffer (TLB)
 Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100
cycles for miss, 0.01%–1% miss rate
 Misses could be handled by hardware or software

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

Fast Translation Using a TLB

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

TLB Misses
 If page is in memory
 Load the PTE from memory and retry
 Could be handled in hardware
 Can get complex for more complicated page table
structures
 Or in software
 Raise a special exception, with optimized handler
 If page is not in memory (page fault)
 OS handles fetching the page and updating
the page table
 Then restart the faulting instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36
TLB Miss Handler
 TLB miss indicates
 Page present, but PTE not in TLB
 Page not preset
 Must recognize TLB miss before
destination register overwritten
 Raise exception
 Handler copies PTE from memory to TLB
 Then restarts instruction
 If page not present, page fault will occur

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

Page Fault Handler
 Use faulting virtual address to find PTE
 Locate page on disk
 Choose page to replace
 If dirty, write to disk first
 Read page into memory and update page
table
 Make process runnable again
 Restart from faulting instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38

TLB and Cache Interaction
 If cache tag uses
physical address
 Need to translate
before cache lookup
 Alternative: use virtual
address tag
 Complications due to
aliasing
 Different virtual
addresses for shared
physical address

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39

Memory Protection
 Different tasks can share parts of their
virtual address spaces
 But need to protect against errant access
 Requires OS assistance
 Hardware support for OS protection
 Privileged supervisor mode (aka kernel mode)
 Privileged instructions
 Page tables and other state information only
accessible in supervisor mode
 System call exception (e.g., syscall in MIPS)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40
§5.8 A Common Framework for Memory Hierarchies
The Memory Hierarchy
The BIG Picture
 Common principles apply at all levels of
the memory hierarchy
 Based on notions of caching
 At each level in the hierarchy
 Block placement
 Finding a block
 Replacement on a miss
 Write policy
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41
Block Placement
 Determined by associativity
 Direct mapped (1-way associative)
 One choice for placement
 n-way set associative
 n choices within a set
 Fully associative
 Any location
 Higher associativity reduces miss rate
 Increases complexity, cost, and access time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42

Finding a Block

Associativity Location method Tag comparisons

Direct mapped Index 1

n-way set Set index, then search n

associative entries within the set

Fully associative Search all entries #entries

Full lookup table 0

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43

 Hardware caches
 Reduce comparisons to reduce cost
 Virtual memory
 Full table lookup makes full associativity feasible
 Benefit in reduced miss rate
Replacement
 Choice of entry to replace on a miss
 Least recently used (LRU)
 Complex and costly hardware for high associativity
 Random
 Close to LRU, easier to implement
 Virtual memory
 LRU approximation with hardware support

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45

Write Policy
 Write-through
 Update both upper and lower levels
 Simplifies replacement, but may require write
buffer
 Write-back
 Update upper level only
 Update lower level when block is replaced
 Need to keep more state
 Virtual memory
 Only write-back is feasible, given disk write
latency

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46

Sources of Misses
 Compulsory misses (aka cold start misses)
 First access to a block
 Capacity misses
 Due to finite cache size
 A replaced block is later accessed again
 Conflict misses (aka collision misses)
 In a non-fully associative cache
 Due to competition for entries in a set
 Would not occur in a fully associative cache of
the same total size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47

Cache Design Trade-offs

Design change Effect on miss rate Negative performance

effect

Increase cache size Decrease capacity May increase access

misses time

Increase associativity Decrease conflict May increase access

misses time

Increase block size Decrease compulsory Increases miss

misses penalty. For very large
block size, may
increase miss rate
due to pollution.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48

§5.9 Using a Finite State Machine to Control A Simple Cache
Cache Control
 Example cache characteristics
 Direct-mapped, write-back, write allocate
 Block size: 4 words (16 bytes)
 Cache size: 16 KB (1024 blocks)
 32-bit byte addresses
 Valid bit and dirty bit per block
 Blocking cache
 CPU waits until access is complete

31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49

Interface Signals

Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready

Multiple cycles
per access

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50

Finite State Machines
 Use an FSM to
sequence control steps
 Set of states, transition
on each clock edge
 State values are binary
encoded
 Current state stored in a
register
 Next state
= fn (current state,
current inputs)
 Control output signals
= fo (current state)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51
Cache Controller FSM

Could partition
into separate
states to
reduce clock
cycle time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52

§5.10 Parallelism and Memory Hierarchies: Cache Coherence
Cache Coherence Problem
 Suppose two CPU cores share a physical
address space
 Write-through caches

Time Event CPU A’s CPU B’s Memory

step cache cache

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 53
Coherence Defined
 Informally: Reads return most recently
written value
 Formally:
 P writes X; P reads X (no intervening writes)
 read returns written value
 P1 writes X; P2 reads X (sufficiently later)
 read returns written value
 c.f. CPU B reading X after step 3 in example
 P1 writes X, P2 writes X
 all processors see writes in the same order
 End up with the same final value for X

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54

Cache Coherence Protocols
 Operations performed by caches in
multiprocessors to ensure coherence
 Migration of data to local caches
 Reduces bandwidth for shared memory
 Replication of read-shared data
 Reduces contention for access
 Snooping protocols
 Each cache monitors bus reads/writes
 Directory-based protocols
 Caches and memory record sharing status of
blocks in a directory
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 55
Invalidating Snooping Protocols
 Cache gets exclusive access to a block
when it is to be written
 Broadcasts an invalidate message on the bus
 Subsequent read in another cache misses
 Owning cache supplies updated value

CPU activity Bus activity CPU A’s CPU B’s Memory

cache cache

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56
Memory Consistency
 When are writes seen by other processors
 “Seen” means a read returns the written value
 Can’t be instantaneously
 Assumptions
 A write completes only when all processors have seen
it
 A processor does not reorder writes with other
accesses
 Consequence
 P writes X then writes Y
 all processors that see new Y also see new X
 Processors can reorder reads, but not writes

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 57

§5.13 The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies
Multilevel On-Chip Caches

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 58

2-Level TLB Organization

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 59

Supporting Multiple Issue
 Both have multi-banked caches that allow
multiple accesses per cycle assuming no
bank conflicts
 Core i7 cache optimizations
 Return requested word first
 Non-blocking cache
 Hit under miss
 Miss under miss
 Data prefetching

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 60

§5.14 Going Faster: Cache Blocking and Matrix Multiply
DGEMM
 Combine cache blocking and subword
parallelism

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61

§5.15 Fallacies and Pitfalls
Pitfalls
 Byte vs. word addressing
 Example: 32-byte direct-mapped cache,
4-byte blocks
 Byte 36 maps to block 1
 Word 36 maps to block 4
 Ignoring memory system effects when
writing or generating code
 Example: iterating over rows vs. columns of
arrays
 Large strides result in poor locality

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62

Pitfalls
 In multiprocessor with shared L2 or L3
cache
 Less associativity than cores results in conflict
misses
 More cores  need to increase associativity
 Using AMAT to evaluate performance of
out-of-order processors
 Ignores effect of non-blocked accesses
 Instead, evaluate performance by simulation

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63

Pitfalls
 Extending address range using segments
 E.g., Intel 80286
 But a segment is not always big enough
 Makes address arithmetic complicated
 Implementing a VMM on an ISA not
designed for virtualization
 E.g., non-privileged instructions accessing
hardware resources
 Either extend ISA, or require guest OS not to
use problematic instructions
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64
§5.16 Concluding Remarks
Concluding Remarks
 Fast memories are small, large memories are
slow
 We really want fast, large memories 
 Caching gives this illusion 
 Principle of locality
 Programs use a small part of their memory space
frequently
 Memory hierarchy
 L1 cache  L2 cache  …  DRAM memory
 disk
 Memory system design is critical for
multiprocessors