0% found this document useful (0 votes)
16 views60 pages

Week12 Updated

The document outlines the learning goals and weekly plan for CSC 258, focusing on cache memory, its configurations, and performance evaluation. It discusses concepts such as locality, memory hierarchy, direct-mapped and associative caching, and the impact of cache configurations on system performance. Additionally, it covers multilevel caches, write policies, and includes activities for practical understanding of the topics discussed.

Uploaded by

pramitha rm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views60 pages

Week12 Updated

The document outlines the learning goals and weekly plan for CSC 258, focusing on cache memory, its configurations, and performance evaluation. It discusses concepts such as locality, memory hierarchy, direct-mapped and associative caching, and the impact of cache configurations on system performance. Additionally, it covers multilevel caches, write policies, and includes activities for practical understanding of the topics discussed.

Uploaded by

pramitha rm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CSC 258

© Course Director-EECS2021
CSC
CSC 258
258

© Course Director-EECS2021
This Week’s Learning Goals
1. Describe how the introduction of caches can be used to reduce
the effective memory latency.

2. Describe how a sequence of memory requests will be handled by


a cache, factoring in both line size and associativity.

3. Quantify the impact of different cache configurations on latency


and overall system performance

3
Week’s Plan
• Associativity
• Memory Hierarchy Performance
• Dependability (Time permitting)
• Memory System Summary
• Virtual Memory and TLB (Time permitting)

4
Recap:
Last week, we looked at locality and the memory hierarchy.
- Locality is the big idea: it allows us to build high-performance memory systems
based on predicting what data we will need.
- We introduced the idea of the memory wall: that increases in processor
performance were making memory further away.

Finally, we examined a direct mapped cache, and this week, we’re going to build
on that design.

5
Activity: Direct-Mapped Caching

Assume you have a direct mapped cache that stores 4 blocks, each containing 4 words.

a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.

b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8

6
Activity QUE: Direct-Mapped Caching
Assume you have a direct mapped cache that stores 4 blocks, each containing 4 words.

a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.

b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
Block Cache Hit/miss Showing Tag ; Cache lines content after access
address index 0 1 2 3
Activity ANS: Direct-Mapped Caching
Assume you have a direct mapped cache that stores 4 blocks, each containing 4 words.

a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.

b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
Block Cache Hit/miss Showing Tag ; Cache lines content after access
address index 0 1 2 3
8 0 M 10; MEM[8]
9 1 M 10; MEM[8] 10; MEM[9]
8 0 H 10; MEM[8] 10; MEM[9]
4 0 M 01; MEM[4] 10; MEM[9]
8 0 M 10; MEM[8] 10; MEM[9]
9 1 H 10; MEM[8] 10; MEM[9]
A 2 M 10; MEM[8] 10; MEM[9] 10; MEM[A]
B 3 M 10; MEM[8] 10; MEM[9] 10; MEM[A] 10; MEM[B]
4 0 M 01; MEM[4] 10; MEM[9] 10; MEM[A] 10; MEM[B]
6 2 M 01; MEM[4] 10; MEM[9] 01; MEM[6] 10; MEM[B]
A 2 M 01; MEM[4] 10; MEM[9] 10; MEM[A] 10; MEM[B]
B 3 H 01; MEM[4] 10; MEM[9] 10; MEM[A] 10; MEM[B]
Follow-up: Performance Evaluation

Based on the hits and misses from the breakout:

a) What is the miss rate of the sequence? With a 100 cycle miss penalty, what is AMAT?

b) Is that miss rate a fair evaluation of the performance of the cache? Why or why not?

c) Using that miss rate and assuming (1) a CPI of 1 without memory stalls, (2) 40%
load/stores, and (3) a 100 cycle miss penalty, what is the CPI with memory stalls using
this cache?

9
CSC 258

10
Associativity

Most caches use some form of hashing.


The caches are smaller than the memory they are caching from, so they
can’t store everything!

If two blocks hash to the same value, they can’t both be stored. To reduce
the impact of this, caches are often associative.
A direct mapped cache has associativity 1: a block can be placed in only one
place in the cache.
A 2-way set associative cache can store two blocks that hash to the same
value: there are two places that a block may be placed in the cache.
In a fully associative cache hash, a block can be placed in any location in the
cache.

12
Associative Caches
Fully associative
• Allow a given block to go in any cache entry
• Requires all entries to be searched at once
• Comparator per entry (expensive)

n-way set associative


• Each set contains n entries
• Block number determines which set
• (Block number in memory) modulo (#Sets in cache)
• Search all entries in a given set at once
• n comparators (less expensive)

13
Associative Cache Example

14
Spectrum of Associativity
For a cache with 8 entries

15
Associativity Example
Compare 4-block caches for 3 kinds of configuration below:

• Direct mapped; 2-way set associative; fully associative


• Given the Block access sequence: 0, 8, 0, 6, 8

Direct mapped

Block Cache Hit/miss Cache content after access


address index 0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]

INDEX = Block address MOD no. of cache blocks


16
Associativity Example
2-way set associative

Block Cache Hit/miss Cache content after access


address index Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]

SET/INDEX = Block address MOD no. of cache blocks


 Fully associative
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]

INDEX = Not Applicable


17
The Cost of Associativity

• An associative cache is larger and slower than a direct mapped cache


that stores the same amount of data.

• To implement associativity, we have to search more locations to


determine whether a block is in the cache.
• We must compare the tag at each possible location (N comparators,
instead of just 1).
• We must steer the data from the correct location to the output (a larger
mux).

18
4-way Set Associative Cache Design

19
Associative Doesn’t Always
Mean “Better”
Designers need to balance the amount of associativity with the
expected workload.

The overhead in the previous slide pushes us to reduce associativity.


Workloads where multiple blocks with the same set are needed at the
same time push associativity up.
Some workloads actually benefit from less associativity.

There isn’t a single correct answer. It depends on the workload and


context of the cache.

20
Example: Spec2000
Increased associativity decreases miss rate … with diminishing returns

Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000


1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%

21
Cache Eviction
Every load brings in a block.
• Each cache has a finite size.
• It can store some maximum number of blocks.
• Based on associativity, it can store a set number of
blocks with a specific hash.
• Every time a load is performed from memory, the block
must be stored.
• This means that another block might need to be evicted.

22
Replacement Policy
Direct mapped: there is no choice, so no policy is needed
Set associative
Prefer to evict non-valid or empty entry, if there is one
Otherwise, we must choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
• LRU is hard, so we use approximations!
Random
Gives approximately the same performance as LRU for high associativity

23
Activity: Associative Caching

Assume you have a 2-way set associative cache that stores a total of 4 blocks, each
containing 4 words. LRU is used to evict items.

a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.

b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8

24
Activity ANS: Associative Caching
Assume you have a 2-way set associative cache that stores a total of 4 blocks, each
containing 4 words. LRU is used to evict items.

a) For an 8-bit address, indicate which bits indicate the block and which indicate the tag.

b) Given the following sequence of addresses, indicate the number of hits and misses
and the final state of the cache:
80, 90, 84, 40, 84, 90, A4, B8, 44, 64, A4, B8
Block Cache Hit/miss Showing Tag ; Cache lines content after access
address index 0 1
8 0 M 100; MEM[8]
9 1 M 100; MEM[8] 100; MEM[9]
8 0 H 100; MEM[8] 100; MEM[9]
4 0 M 100; MEM[8] 010; MEM[4] 100; MEM[9]
8 0 H 100; MEM[8] 010; MEM[4] 100; MEM[9]
9 1 H 100; MEM[8] 010; MEM[4] 100; MEM[9]
A 0 M 100; MEM[8] 101; MEM[A] 100; MEM[9]
B 1 M 100; MEM[8] 101; MEM[A] 100; MEM[9] 101; MEM[B]
4 0 M 010; MEM[4] 101; MEM[A] 100; MEM[9] 101; MEM[B]
6 0 M 010; MEM[4] 011; MEM[6] 100; MEM[9] 101; MEM[B]
A 0 M 101; MEM[A] 011; MEM[6] 100; MEM[9] 101; MEM[B]
B 1 H 101; MEM[A] 011; MEM[6] 100; MEM[9] 101; MEM[B]
Follow-up: Performance Evaluation
Based on the hits and misses from the breakout:

a) What is the miss rate of the sequence? With a 100 cycle miss penalty, what is AMAT?

b) Is that miss rate a fair evaluation of the performance of the cache? Why or why not?

c) Using that miss rate and assuming (1) a CPI of 1 without memory stalls, (2) 40%
load/stores, and (3) a 100 cycle miss penalty, what is the CPI with memory stalls using
this cache?

How does this compare with the direct mapped cache from earlier?

26
Activity: Associative Caching

Earlier, I mentioned that, “Some workloads actually benefit from less associativity.”

For the direct mapped and associative caches we’ve seen so far ….

a) Generate a sequence that performs much better on the associative cache.

b) Generate a new sequence that performs much better on the direct mapped cache.

27
CSC 258

28
Multilevel Caches
Modern memory systems are composed of a sequence of caches.
• Level-1: Primary cache attached to CPU is small and fast
• Level-2: Services misses from primary cache and stores more
• Main memory services Level-2 cache misses

Some high-end systems include L-3 cache


Other caches exist in the system that are not part of the memory caching
system: TLB, for example.

29
Multilevel Cache Considerations

L-1 cache
The focus is on minimizing hit time.
L-2 cache
The focus is on reducing miss rate to avoid the penalty of a
main memory access.
Hit time has less overall impact: it is less than main memory
access.

30
Writeback Policy

When data is stored in a cache, it also needs to be written to


other caches and to memory.
• Write-through
• Update both upper and lower levels on every write
• Simplifies replacement, but may require write buffer
• Write-back
• Update upper level only
• Update lower level when block is replaced
• Need to keep more state, but more efficient

31
Multilevel Cache Example
Given …
CPU base CPI = 1, clock rate = 4GHz
Miss rate/instruction = 2%
Main memory access time = 100ns

With a single cache ..


Miss penalty = 100ns/0.25ns = 400 cycles
Effective CPI = 1 + 0.02 × 400 = 9 (!)

32
Multilevel Cache Example, Continued

Next, add a Level-2 cache with …


Access time = 5ns
Global miss rate / instruction to main memory = 0.5%

Primary miss with Level-2 hit


Penalty = 5ns/0.25ns = 20 cycles
Primary miss with Level-2 miss
Still 400 cycles

Total CPI = Hit time + Primary stalls per instruction + Secondary stalls per instruction
= 1 + 2% x 20 + 0.5% x 400
Hence, CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4

Performance ratio = 9/3.4 = 2.6

33
CSC 258

34
Dependability
Performance is only one consideration when designing a memory system.
We also need to consider dependability.

A system is dependable if the service is delivered as specified.


For example: “If a value is stored at an address, then when that address is
loaded, the same value will be returned.”

35
Dependability
If a value is stored at an address, then when that address is loaded, the same
value will be returned.

This seems easy, but consider:


• Hardware failures
• Faults due to interference
• Slightly different issue: Maintaining consistency across caches

36
Dependability Measures
Reliability: mean time to failure (MTTF)
Service interruption: mean time to repair (MTTR)
Mean time between failures
MTBF = MTTF + MTTR
Availability = MTTF / (MTTF + MTTR)
Improving Availability
Increase MTTF: fault avoidance, fault tolerance, fault forecasting
Reduce MTTR: improved tools and processes for diagnosis and repair

37
Increasing Dependability
Most memory systems today use error detecting codes to find and correct
single-bit (and even multi-bit) errors in stored data.

In important applications, data can be replicated – stored across multiple


devices so that a failure in one device does not cause a loss of data.
Consider RAID systems

38
Cache Coherence
We will not discuss coherence in detail in this course. It’s a huge topic. But be
aware that it is the major issue that makes caching in parallel systems difficult.

Cache coherence refers to the uniformity of shared data stored across the
memory system.
• Example: When a value is stored, how is the stored value propagated back to
memory and/or other caches?

39
CSC 258

40
Types of Misses:
Designing with the Three C’s
Compulsory misses
First access to a block
Capacity misses
Occurs when a block that was replaced is accessed later
Caused by limited cache size
Conflict misses (or collision misses)
Occurs when two blocks are competing for space in the cache and evict
each other
Would not occur in a fully associative cache of the same total size

41
Cache Design Trade-offs

Design change Effect on miss rate Negative


performance effect

Increase cache size Decrease capacity May increase access


misses time

Increase associativity Decrease conflict May increase access


misses time

Increase block size Decrease compulsory Increases miss


misses penalty. For very large
block size, may
increase miss rate due
to pollution.

42
Activity: Designing a Cache

You are designing a processor and currently have a 2-way, 32-entry set-associative cache
that stores 8-word blocks.

The processor design is currently being tested. What is your response when you are told
that the following things are occurring?

a)A benchmark never completely fills the cache but has a large number of misses; the
same lines appear to be reloaded again and again.
b)A benchmark is reading data sequentially from a file, and miss rates are high: it appears
to be loading one line at a time and not reusing data.
c) Tricky: On a very short benchmark, miss rates are very high – near 50%.

43
Virtual Memory
• Use main memory as a “cache” for secondary (disk)
storage
• Managed jointly by CPU hardware and the operating
system (OS)
• Programs share main memory
• Each gets a private virtual address space holding its
frequently used code and data
• Protected from other programs
• CPU and OS translate virtual addresses to physical
addresses
• VM “block” is called a page
• VM translation “miss” is called a page fault

44
Address Translation
Fixed-size pages (e.g., 4K)

45
Page Fault Penalty
• On page fault, the page must be fetched from disk
• Takes millions of clock cycles
• Handled by OS code
• Try to minimize page fault rate
• Fully associative placement
• Smart replacement algorithms

46
Page Tables
• Stores placement information
• Array of page table entries, indexed by virtual page
number
• Page table register in CPU points to page table in physical
memory
• If page is present in memory
• PTE stores the physical page number
• Plus other status bits (referenced, dirty, …)
• If page is not present
• PTE can refer to location in swap space on disk

47
Replacement and Writes
• To reduce page fault rate, prefer least-recently used
(LRU) replacement
• Reference bit (aka use bit) in PTE set to 1 on access to
page
• Periodically cleared to 0 by OS
• A page with reference bit = 0 has not been used recently
• Disk writes take millions of cycles
• Block at once, not individual locations
• Write through is impractical
• Use write-back
• Dirty bit in PTE set when page is written

48
Fast Translation Using a TLB
• Address translation would appear to require
extra memory references
• One to access the PTE
• Then the actual memory access
• But access to page tables has good locality
• So, use a fast cache of PTEs within the CPU
• Called a Translation Look-aside Buffer (TLB)
• Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles
for miss, 0.01%–1% miss rate
• Misses could be handled by hardware or software

49
TLB Misses
• If page is in memory
• Load the PTE from memory and retry
• Could be handled in hardware
• Can get complex for more complicated page table
structures
• Or in software
• Raise a special exception, with optimized handler
• If page is not in memory (page fault)
• OS handles fetching the page and updating the page
table
• Then restart the faulting instruction

50
CSC 258

Important practice questions… 51


Quiz-like Question 1
Simulate the performance of a cache on the following (hex) address
loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is direct-mapped and stores 4 words. It uses a FIFO eviction
policy.

For more details, please check Example cache configurations available in


the FINAL EXAM practice questions in the PRACTICE MODULE on eClass.

52
Quiz-like Question 2
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is direct-mapped and stores 2 blocks of two words. It uses a FIFO
eviction policy.

53
Quiz-like Question 3
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is 2-way set associative and stores 4 words. It uses a FIFO
eviction policy.

54
Quiz-like Question 4
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is 2-way set associative and stores 4 words. It uses an LRU
eviction policy.

55
Quiz-like Question 5
Simulate the performance of a cache on the following (hex) address loads.
40 48 4c 40 50 58 5c 40 60 48 4c 44 40 60 58 5c
The cache is fully associative and stores 4 words. It uses a FIFO eviction
policy.

56
Coming Up
57
Don’t Forget!

• Practice problems
• #5.7*, 5.8, 5.10*, 5.11*, 5.12*
• #1.12*, 1.13*, 1.14*, 1.15*
• Final exam review (based on ch1,2,3,4,5) is
available.

58
All the best for
Exams!
59
Thank you 
60

You might also like