0% found this document useful (0 votes)
3 views

CA_Lecture_08

The lecture discusses the performance gap between processors and memory, emphasizing the need for fast, inexpensive, and large memory solutions. It explains the concepts of locality, memory hierarchy, cache memory, and various cache types, including direct mapped and associative caches. Additionally, it covers cache performance metrics, replacement policies, and the impact of memory access patterns on overall system performance.

Uploaded by

Umar Ahmed Sani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CA_Lecture_08

The lecture discusses the performance gap between processors and memory, emphasizing the need for fast, inexpensive, and large memory solutions. It explains the concepts of locality, memory hierarchy, cache memory, and various cache types, including direct mapped and associative caches. Additionally, it covers cache performance metrics, replacement policies, and the impact of memory access patterns on overall system performance.

Uploaded by

Umar Ahmed Sani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Computer Architecture and Operating

Systems
Lecture 8: Memory and Caches

Andrei Tatarnikov
[email protected]
@andrewt0301
Processor-Memory
Performance Gap
 Computer performance depends on:
 Processor performance
 Memory performance

2
Memory Challenge
 Make memory appear as fast as processor
 Ideal memory:
 Fast
 Cheap (inexpensive)
 Large (capacity)

But can only choose two!


3
Memory Technology
 Static RAM (SRAM)
 0.5 – 2.5 ns, $500 – $1000 per GB
 Dynamic RAM (DRAM)
 50 – 70 ns, $10 – $20 per GB
 Flash Memory
 5 000 – 50 000 ns, $0.75 – $1.00 per GB
 Magnetic Disk
 5 000 000 – 20 000 000 ns, $0.05 – $0.1 per GB
 Ideal Memory
 Access time of SRAM
 Capacity and cost/GB of disk 4
Locality
No need for large memory to access it fast
Just exploit locality
 Temporal Locality:
 Locality in time
 If data used recently, likely to use it again soon
 How to exploit: keep recently accessed data in higher levels of
memory hierarchy
 Spatial Locality:
 Locality in space
 If data used recently, likely to use nearby data soon
 How to exploit: when access data, bring nearby data into 5
higher levels of memory hierarchy too
Taking Advantage of Locality
 Memory hierarchy
 Store everything on disk
 Copy recently accessed (and nearby) items from disk
to smaller DRAM memory
 Main memory
 Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
 Cache memory attached to CPU
6
Memory Hierarchy
 Personal mobile
device

 Laptop or
desktop

 Server
7
How It Works?
 Block (aka line): unit of copying
 May be multiple words Processor

 If accessed data is present in upper level


 Hit: access satisfied by upper level L1

 Hit ratio: hits/accesses


 If accessed data is absent L2
 Miss: block copied from lower level
 Time taken: miss penalty
 Miss ratio: misses/accesses = 1 – hit ratio Memory
 Then accessed data supplied from upper level
8
Hits and Misses
 On cache hit, CPU proceeds normally
 On cache miss
 Stall the CPU pipeline
 Fetch block from next level of hierarchy
 Instruction cache miss
 Restart instruction fetch
 Data cache miss
 Complete data access
9
Miss Types
 Compulsory: first time data accessed

 Capacity: cache too small to hold all data of interest

 Conflict: data of interest maps to a location in cache


mapped to different data

10
Memory Performance
 Hit: data found in that level of memory hierarchy
 Miss: data not found (must go to next level)
 Hit Rate = # hits / # memory accesses = 1 – Miss Rate
 Miss Rate = # misses / # memory accesses = 1 – Hit Rate
 Average memory access time (AMAT): average time for
processor to access data
 AMAT = tcache + MRcache[tMM + MRMM(tVM)]
11
Cache Memory
 Cache memory
 The level of the memory hierarchy closest to the CPU
 Given accesses X1, …, Xn–1, Xn

 How do we know if the


data is present?
 Where do we look?

12
Direct Mapped Cache
 Location determined by address
 Direct mapped: only one choice
 (Block address) modulo (#Blocks in cache)
Cache
 #Blocks is a power of 2
Memory
 Use low-order address bits

13
Tags and Valid Bits
 How do we know which particular block is stored in a
cache location?
 Store block address as well as the data
 Actually, only need the high-order bits
 Called the tag
 What if there is no data in a location?
 Valid bit: 1 = present, 0 = not present
 Initially 0
14
Direct Mapped Cache Example
 8-blocks, 1 word/block, direct mapped
 Initial state

Index V Tag Data


000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N

15
Direct Mapped Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110

Index V Tag Data


000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
16
Direct Mapped Cache Example
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010

Index V Tag Data


000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
17
Direct Mapped Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010

Index V Tag Data


000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
18
Direct Mapped Cache Example
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
16 10 000 Hit 000

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
19
Direct Mapped Cache Example
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
20
Associative Caches
 Fully associative
 Allow a given block to go in any cache entry
 Requires all entries to be searched at once
 Comparator per entry (expensive)
 n-way set associative
 Each set contains n entries
 Block number determines which set
 (Block number) modulo (#Sets in cache)
 Search all entries in a given set at once
 n comparators (less expensive) 21
Associative Cache Examples

22
Spectrum of Associativity
 For a cache with 8 entries

23
Associativity Example
 Compare 4-block caches
 Direct mapped, 2-way set associative, fully associative
 Block access sequence: 0, 8, 0, 6, 8
 Direct mapped
Block Cache Hit/miss Cache content after access
address index 0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
24
Associativity Example
 2-way set associative
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
 Fully associative
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
25
How Much Associativity
 Increased associativity decreases miss rate
 But with diminishing returns
 Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
 1-way: 10.3%
 2-way: 8.6%
 4-way: 8.3%
 8-way: 8.1%
26
Replacement Policy
 Direct mapped
 No choice
 Set associative
 Prefer non-valid entry, if there is one
 Otherwise, choose among entries in the set
 Least-recently used (LRU)
 Choose the one unused for the longest time
 Simple for 2-way, manageable for 4-way, too hard beyond that
 Random
 Gives approximately the same performance as LRU for
high associativity 27
Write-Through
 On data-write hit, could just update the block in cache
 But then cache and memory would be inconsistent
 Write through: also update memory
 But makes writes take longer
 e.g., if base CPI = 1, 10% of instructions are stores, write
to memory takes 100 cycles
 Effective CPI = 1 + 0.1×100 = 11
 Solution: write buffer
 Holds data waiting to be written to memory
 CPU continues immediately
 Only stalls on write if write buffer is already full 28
Write-Back
 Alternative: On data-write hit, just update the block in
cache
 Keep track of whether each block is dirty

 When a dirty block is replaced


 Write it back to memory
 Can use a write buffer to allow replacing block to be read first
29
Write Allocation
 What should happen on a write miss?
 Alternatives for write-through
 Allocate on miss: fetch the block
 Write around: don’t fetch the block
 Since programs often write a whole block before reading it
(e.g., initialization)
 For write-back
 Usually fetch the block
30
Multilevel Caches
 Primary cache attached to CPU
 Small, but fast

 Level-2 cache services misses from primary cache


 Larger, slower, but still faster than main memory

 Main memory services L-2 cache misses

 Some high-end systems include L-3 cache 31


Measuring Cache
Performance
 Components of CPU time
 Program execution cycles
 Includes cache hit time
 Memory stall cycles
 Mainly from cache misses
 With simplifying assumptions:
Memory Accesses
Memory Stall Cycles = × Miss Rate × Miss Penalty
Program
Instructions Misses
= × × Miss Penalty
Program Instructions
32
Cache Performance Example
 Given
 I-cache miss rate = 2%
 D-cache miss rate = 4%
 Miss penalty = 100 cycles
 Base CPI (ideal cache) = 2
 Load & stores are 36% of instructions
 Miss cycles per instruction
 I-cache: 0.02 × 100 = 2
 D-cache: 0.36 × 0.04 × 100 = 1.44
 Actual CPI = 2 + 2 + 1.44 = 5.44
 Ideal CPU is 5.44/2 =2.72 times faster
33
Average Access Time
 Hit time is also important for performance
 Average memory access time (AMAT)
 AMAT = Hit time + Miss rate × Miss penalty
 Example
 CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20
cycles, I-cache miss rate = 5%
 AMAT = 1 + 0.05 × 20 = 2ns
 2 cycles per instruction

34
Overal Performance Summary
 When CPU performance increased
 Miss penalty becomes more significant
 Decreasing base CPI
 Greater proportion of time spent on memory stalls
 Increasing clock rate
 Memory stalls account for more CPU cycles
 Can’t neglect cache behavior when evaluating system
performance

35
Example: How Caches Affect Performance

Matrix Multiplication
Loop order: i, j, k Loop order: i, k, j Loop order: j, k, i
for (int i= 0; i < n; i++) { for (int i= 0; i < n; i++) { for (int j= 0; j < n; j++) {
for (int j= 0; j < n; j++) for (int k= 0; k < n; k++) for (int k= 0; k < n; k++)
{ { {
for (int k= 0; k < n; k+ for (int j= 0; j < n; j+ for (int i= 0; i < n; i+
+) { +) { +) {
C[i][j]+= A[i][k]*B[k] C[i][j]+= A[i][k]*B[k] C[i][j]+= A[i][k]*B[k]
[j]; [j]; [j];
} } }
} } }
}
Running time: }
Running time: }
Running time:
13.714264 sec. 2.739385 sec. 19.074106 sec.
Performance: Performance: Performance:
~ 153 MFLOPS ~ 795 MFLOPS ~ 113 MFLOPS
36
Memory Access Patterns
Loop order: i, j, k Loop order: i, k, j Loop order: j, k, i

A A A

B B B

C C C
37
Any Questions?
.text
__start: addi t1, zero, 0x18
addi t2, zero, 0x21
cycle: beq t1, t2, done
slt t0, t1, t2
bne t0, zero, if_less
nop
sub t1, t1, t2
j cycle
nop
if_less: sub t2, t2, t1
j cycle
done: add t3, t1, zero

38

You might also like