Week12 Updated
Week12 Updated
© Course Director-EECS2021
CSC
CSC 258
258
© Course Director-EECS2021
This Week’s Learning Goals
1. Describe how the introduction of caches can be used to reduce
the effective memory latency.
3
Week’s Plan
• Associativity
• Memory Hierarchy Performance
• Dependability (Time permitting)
• Memory System Summary
• Virtual Memory and TLB (Time permitting)
4
Recap:
Last week, we looked at locality and the memory hierarchy.
- Locality is the big idea: it allows us to build high-performance memory systems
based on predicting what data we will need.
- We introduced the idea of the memory wall: that increases in processor
performance were making memory further away.
Finally, we examined a direct mapped cache, and this week, we’re going to build
on that design.
5
Activity: Direct-Mapped Caching
Assume you have a direct mapped cache that stores 4 blocks, each containing 4 words.
a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.
b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
6
Activity QUE: Direct-Mapped Caching
Assume you have a direct mapped cache that stores 4 blocks, each containing 4 words.
a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.
b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
Block Cache Hit/miss Showing Tag ; Cache lines content after access
address index 0 1 2 3
Activity ANS: Direct-Mapped Caching
Assume you have a direct mapped cache that stores 4 blocks, each containing 4 words.
a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.
b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
Block Cache Hit/miss Showing Tag ; Cache lines content after access
address index 0 1 2 3
8 0 M 10; MEM[8]
9 1 M 10; MEM[8] 10; MEM[9]
8 0 H 10; MEM[8] 10; MEM[9]
4 0 M 01; MEM[4] 10; MEM[9]
8 0 M 10; MEM[8] 10; MEM[9]
9 1 H 10; MEM[8] 10; MEM[9]
A 2 M 10; MEM[8] 10; MEM[9] 10; MEM[A]
B 3 M 10; MEM[8] 10; MEM[9] 10; MEM[A] 10; MEM[B]
4 0 M 01; MEM[4] 10; MEM[9] 10; MEM[A] 10; MEM[B]
6 2 M 01; MEM[4] 10; MEM[9] 01; MEM[6] 10; MEM[B]
A 2 M 01; MEM[4] 10; MEM[9] 10; MEM[A] 10; MEM[B]
B 3 H 01; MEM[4] 10; MEM[9] 10; MEM[A] 10; MEM[B]
Follow-up: Performance Evaluation
a) What is the miss rate of the sequence? With a 100 cycle miss penalty, what is AMAT?
b) Is that miss rate a fair evaluation of the performance of the cache? Why or why not?
c) Using that miss rate and assuming (1) a CPI of 1 without memory stalls, (2) 40%
load/stores, and (3) a 100 cycle miss penalty, what is the CPI with memory stalls using
this cache?
9
CSC 258
10
Associativity
If two blocks hash to the same value, they can’t both be stored. To reduce
the impact of this, caches are often associative.
A direct mapped cache has associativity 1: a block can be placed in only one
place in the cache.
A 2-way set associative cache can store two blocks that hash to the same
value: there are two places that a block may be placed in the cache.
In a fully associative cache hash, a block can be placed in any location in the
cache.
12
Associative Caches
Fully associative
• Allow a given block to go in any cache entry
• Requires all entries to be searched at once
• Comparator per entry (expensive)
13
Associative Cache Example
14
Spectrum of Associativity
For a cache with 8 entries
15
Associativity Example
Compare 4-block caches for 3 kinds of configuration below:
Direct mapped
18
4-way Set Associative Cache Design
19
Associative Doesn’t Always
Mean “Better”
Designers need to balance the amount of associativity with the
expected workload.
20
Example: Spec2000
Increased associativity decreases miss rate … with diminishing returns
21
Cache Eviction
Every load brings in a block.
• Each cache has a finite size.
• It can store some maximum number of blocks.
• Based on associativity, it can store a set number of
blocks with a specific hash.
• Every time a load is performed from memory, the block
must be stored.
• This means that another block might need to be evicted.
22
Replacement Policy
Direct mapped: there is no choice, so no policy is needed
Set associative
Prefer to evict non-valid or empty entry, if there is one
Otherwise, we must choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
• LRU is hard, so we use approximations!
Random
Gives approximately the same performance as LRU for high associativity
23
Activity: Associative Caching
Assume you have a 2-way set associative cache that stores a total of 4 blocks, each
containing 4 words. LRU is used to evict items.
a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.
b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
24
Activity ANS: Associative Caching
Assume you have a 2-way set associative cache that stores a total of 4 blocks, each
containing 4 words. LRU is used to evict items.
a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.
b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
Block Cache Hit/miss Showing Tag ; Cache lines content after access
address index 0 1
8 0 M 100; MEM[8]
9 1 M 100; MEM[8] 100; MEM[9]
8 0 H 100; MEM[8] 100; MEM[9]
4 0 M 100; MEM[8] 010; MEM[4] 100; MEM[9]
8 0 H 100; MEM[8] 010; MEM[4] 100; MEM[9]
9 1 H 100; MEM[8] 010; MEM[4] 100; MEM[9]
A 0 M 100; MEM[8] 101; MEM[A] 100; MEM[9]
B 1 M 100; MEM[8] 101; MEM[A] 100; MEM[9] 101; MEM[B]
4 0 M 010; MEM[4] 101; MEM[A] 100; MEM[9] 101; MEM[B]
6 0 M 010; MEM[4] 011; MEM[6] 100; MEM[9] 101; MEM[B]
A 0 M 101; MEM[A] 011; MEM[6] 100; MEM[9] 101; MEM[B]
B 1 H 101; MEM[A] 011; MEM[6] 100; MEM[9] 101; MEM[B]
Follow-up: Performance Evaluation
Based on the hits and misses from the breakout:
a) What is the miss rate of the sequence? With a 100 cycle miss penalty, what is AMAT?
b) Is that miss rate a fair evaluation of the performance of the cache? Why or why not?
c) Using that miss rate and assuming (1) a CPI of 1 without memory stalls, (2) 40%
load/stores, and (3) a 100 cycle miss penalty, what is the CPI with memory stalls using
this cache?
How does this compare with the direct mapped cache from earlier?
26
Activity: Associative Caching
Earlier, I mentioned that, “Some workloads actually benefit from less associativity.”
For the direct mapped and associative caches we’ve seen so far ….
b) Generate a new sequence that performs much better on the direct mapped cache.
27
CSC 258
28
Multilevel Caches
Modern memory systems are composed of a sequence of caches.
• Level-1: Primary cache attached to CPU is small and fast
• Level-2: Services misses from primary cache and stores more
• Main memory services Level-2 cache misses
29
Multilevel Cache Considerations
L-1 cache
The focus is on minimizing hit time.
L-2 cache
The focus is on reducing miss rate to avoid the penalty of a
main memory access.
Hit time has less overall impact: it is less than main memory
access.
30
Writeback Policy
31
Multilevel Cache Example
Given …
CPU base CPI = 1, clock rate = 4GHz
Miss rate/instruction = 2%
Main memory access time = 100ns
32
Multilevel Cache Example, Continued
Total CPI = Hit time + Primary stalls per instruction + Secondary stalls per instruction
= 1 + 2% x 20 + 0.5% x 400
Hence, CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
33
CSC 258
34
Dependability
Performance is only one consideration when designing a memory system.
We also need to consider dependability.
35
Dependability
If a value is stored at an address, then when that address is loaded, the same
value will be returned.
36
Dependability Measures
Reliability: mean time to failure (MTTF)
Service interruption: mean time to repair (MTTR)
Mean time between failures
MTBF = MTTF + MTTR
Availability = MTTF / (MTTF + MTTR)
Improving Availability
Increase MTTF: fault avoidance, fault tolerance, fault forecasting
Reduce MTTR: improved tools and processes for diagnosis and repair
37
Increasing Dependability
Most memory systems today use error detecting codes to find and correct
single-bit (and even multi-bit) errors in stored data.
38
Cache Coherence
We will not discuss coherence in detail in this course. It’s a huge topic. But be
aware that it is the major issue that makes caching in parallel systems difficult.
Cache coherence refers to the uniformity of shared data stored across the
memory system.
• Example: When a value is stored, how is the stored value propagated back to
memory and/or other caches?
39
CSC 258
40
Types of Misses:
Designing with the Three C’s
Compulsory misses
First access to a block
Capacity misses
Occurs when a block that was replaced is accessed later
Caused by limited cache size
Conflict misses (or collision misses)
Occurs when two blocks are competing for space in the cache and evict
each other
Would not occur in a fully associative cache of the same total size
41
Cache Design Trade-offs
42
Activity: Designing a Cache
You are designing a processor and currently have a 2-way, 32-entry set-associative cache
that stores 8-word blocks.
The processor design is currently being tested. What is your response when you are told
that the following things are occurring?
a)A benchmark never completely fills the cache but has a large number of misses; the
same lines appear to be reloaded again and again.
b)A benchmark is reading data sequentially from a file, and miss rates are high: it appears
to be loading one line at a time and not reusing data.
c) Tricky: On a very short benchmark, miss rates are very high – near 50%.
43
Virtual Memory
• Use main memory as a “cache” for secondary (disk)
storage
• Managed jointly by CPU hardware and the operating
system (OS)
• Programs share main memory
• Each gets a private virtual address space holding its
frequently used code and data
• Protected from other programs
• CPU and OS translate virtual addresses to physical
addresses
• VM “block” is called a page
• VM translation “miss” is called a page fault
44
Address Translation
Fixed-size pages (e.g., 4K)
45
Page Fault Penalty
• On page fault, the page must be fetched from disk
• Takes millions of clock cycles
• Handled by OS code
• Try to minimize page fault rate
• Fully associative placement
• Smart replacement algorithms
46
Page Tables
• Stores placement information
• Array of page table entries, indexed by virtual page
number
• Page table register in CPU points to page table in physical
memory
• If page is present in memory
• PTE stores the physical page number
• Plus other status bits (referenced, dirty, …)
• If page is not present
• PTE can refer to location in swap space on disk
47
Replacement and Writes
• To reduce page fault rate, prefer least-recently used
(LRU) replacement
• Reference bit (aka use bit) in PTE set to 1 on access to
page
• Periodically cleared to 0 by OS
• A page with reference bit = 0 has not been used recently
• Disk writes take millions of cycles
• Block at once, not individual locations
• Write through is impractical
• Use write-back
• Dirty bit in PTE set when page is written
48
Fast Translation Using a TLB
• Address translation would appear to require
extra memory references
• One to access the PTE
• Then the actual memory access
• But access to page tables has good locality
• So, use a fast cache of PTEs within the CPU
• Called a Translation Look-aside Buffer (TLB)
• Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles
for miss, 0.01%–1% miss rate
• Misses could be handled by hardware or software
49
TLB Misses
• If page is in memory
• Load the PTE from memory and retry
• Could be handled in hardware
• Can get complex for more complicated page table
structures
• Or in software
• Raise a special exception, with optimized handler
• If page is not in memory (page fault)
• OS handles fetching the page and updating the page
table
• Then restart the faulting instruction
50
CSC 258
52
Quiz-like Question 2
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is direct-mapped and stores 2 blocks of two words. It uses a FIFO
eviction policy.
53
Quiz-like Question 3
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is 2-way set associative and stores 4 words. It uses a FIFO
eviction policy.
54
Quiz-like Question 4
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is 2-way set associative and stores 4 words. It uses an LRU
eviction policy.
55
Quiz-like Question 5
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is fully associative and stores 4 words. It uses a FIFO eviction
policy.
56
Coming Up
57
Don’t Forget!
• Practice problems
• #5.7*, 5.8, 5.10*, 5.11*, 5.12*
• #1.12*, 1.13*, 1.14*, 1.15*
• Final exam review (based on ch1,2,3,4,5) is
available.
58
All the best for
Exams!
59
Thank you
60