CA_Lecture_08
CA_Lecture_08
Systems
Lecture 8: Memory and Caches
Andrei Tatarnikov
[email protected]
@andrewt0301
Processor-Memory
Performance Gap
Computer performance depends on:
Processor performance
Memory performance
2
Memory Challenge
Make memory appear as fast as processor
Ideal memory:
Fast
Cheap (inexpensive)
Large (capacity)
Laptop or
desktop
Server
7
How It Works?
Block (aka line): unit of copying
May be multiple words Processor
10
Memory Performance
Hit: data found in that level of memory hierarchy
Miss: data not found (must go to next level)
Hit Rate = # hits / # memory accesses = 1 – Miss Rate
Miss Rate = # misses / # memory accesses = 1 – Hit Rate
Average memory access time (AMAT): average time for
processor to access data
AMAT = tcache + MRcache[tMM + MRMM(tVM)]
11
Cache Memory
Cache memory
The level of the memory hierarchy closest to the CPU
Given accesses X1, …, Xn–1, Xn
12
Direct Mapped Cache
Location determined by address
Direct mapped: only one choice
(Block address) modulo (#Blocks in cache)
Cache
#Blocks is a power of 2
Memory
Use low-order address bits
13
Tags and Valid Bits
How do we know which particular block is stored in a
cache location?
Store block address as well as the data
Actually, only need the high-order bits
Called the tag
What if there is no data in a location?
Valid bit: 1 = present, 0 = not present
Initially 0
14
Direct Mapped Cache Example
8-blocks, 1 word/block, direct mapped
Initial state
15
Direct Mapped Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110
22
Spectrum of Associativity
For a cache with 8 entries
23
Associativity Example
Compare 4-block caches
Direct mapped, 2-way set associative, fully associative
Block access sequence: 0, 8, 0, 6, 8
Direct mapped
Block Cache Hit/miss Cache content after access
address index 0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
24
Associativity Example
2-way set associative
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Fully associative
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
25
How Much Associativity
Increased associativity decreases miss rate
But with diminishing returns
Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%
26
Replacement Policy
Direct mapped
No choice
Set associative
Prefer non-valid entry, if there is one
Otherwise, choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
Simple for 2-way, manageable for 4-way, too hard beyond that
Random
Gives approximately the same performance as LRU for
high associativity 27
Write-Through
On data-write hit, could just update the block in cache
But then cache and memory would be inconsistent
Write through: also update memory
But makes writes take longer
e.g., if base CPI = 1, 10% of instructions are stores, write
to memory takes 100 cycles
Effective CPI = 1 + 0.1×100 = 11
Solution: write buffer
Holds data waiting to be written to memory
CPU continues immediately
Only stalls on write if write buffer is already full 28
Write-Back
Alternative: On data-write hit, just update the block in
cache
Keep track of whether each block is dirty
34
Overal Performance Summary
When CPU performance increased
Miss penalty becomes more significant
Decreasing base CPI
Greater proportion of time spent on memory stalls
Increasing clock rate
Memory stalls account for more CPU cycles
Can’t neglect cache behavior when evaluating system
performance
35
Example: How Caches Affect Performance
Matrix Multiplication
Loop order: i, j, k Loop order: i, k, j Loop order: j, k, i
for (int i= 0; i < n; i++) { for (int i= 0; i < n; i++) { for (int j= 0; j < n; j++) {
for (int j= 0; j < n; j++) for (int k= 0; k < n; k++) for (int k= 0; k < n; k++)
{ { {
for (int k= 0; k < n; k+ for (int j= 0; j < n; j+ for (int i= 0; i < n; i+
+) { +) { +) {
C[i][j]+= A[i][k]*B[k] C[i][j]+= A[i][k]*B[k] C[i][j]+= A[i][k]*B[k]
[j]; [j]; [j];
} } }
} } }
}
Running time: }
Running time: }
Running time:
13.714264 sec. 2.739385 sec. 19.074106 sec.
Performance: Performance: Performance:
~ 153 MFLOPS ~ 795 MFLOPS ~ 113 MFLOPS
36
Memory Access Patterns
Loop order: i, j, k Loop order: i, k, j Loop order: j, k, i
A A A
B B B
C C C
37
Any Questions?
.text
__start: addi t1, zero, 0x18
addi t2, zero, 0x21
cycle: beq t1, t2, done
slt t0, t1, t2
bne t0, zero, if_less
nop
sub t1, t1, t2
j cycle
nop
if_less: sub t2, t2, t1
j cycle
done: add t3, t1, zero
38