Supplemental Material On Cache From ECE-341 Memory
Supplemental Material On Cache From ECE-341 Memory
Memory Systems
2
Memory system building blocks Types of Memory
Read-Only Memory
ROM: read-only memory, “factory” programmed, dense, cheap
PROM: programmable read-only memory (“burn” once)
EPROM: erasable PROM
EAPROM: electrically alterable ROM (~1000 times)
Read/Write Memory
• Physical Characteristics: static vs dynamic
volatile vs nonvolatile
destructive read vs nondestructive read
removable vs permanent
• Logical Organization: addressed associative; sequential access
3
Disk Access Example
• Assume
512B sector
15,000 rpm (Revolution Per Minute)
4ms average seek time
100MB/s transfer rate seek
0.2ms controller overhead
idle disk
• Average read time
4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms
4
Disks approximation
5
Disks approximation
6
Disks approximation
Read/write cylinder of
sectors as atomic action
7
Disks approximation
8
Non-volatile Storage
Disk Flash
• Rotating magnetic storage • More expensive, faster (100-1000x) than disk
• not direct access
• Smaller, lower power, more robust than disk
• Transfer data in sector chunks
• mechanical startup time • Less expensive, slower than DRAM
• move head to track
• wait for sector to rotate
• read data • Wears out after 1000s of accesses
• Slow: request queue managed by
operating system
9
Memory access time differences
• SRAM 0.5ns - 2.5ns 500 ps 5x102 ps
• DRAM 50ns - 70ns 50,000 ps 5x104 ps
• Disk 5ms - 20ms 5,000,000,000 ps 5x109 ps
10
Memory System Objectives
• Provide lots of storage Make the memory as big as possible
• Make the system fast Build a system with the fastest memory
• Make the system cheap Use the cheapest memory
Use as little memory as possible
11
Memory Interleaving
• Split main memory into a number of physically separate components called
modules or banks
bank bank bank bank
0 1 2 3
12
Memory Interleaving High vs Low Order Base-10 example
100 word memory using 10 banks
0 1 2 3 4 5 6 7 8 9
Low-order digit interleaving
offset bank
13
Memory addresses
? 256M x 4 256M x 4 8
256M x 4 256M x 4
Sel Sel
Data Bus
01234567
Bank Select
3 high-order address bits 28 low-order address bits
(decoded)
Address Bus
15
2 GB memory using 16 256M-by-4-bit RAM chips
Where is byte 0?
Where is byte 1? 0 1
Where is byte 256M?
2 3
4 5
6 7
• If it takes 100ps to fetch one byte, how long for two bytes (say bytes 0 and 1)?
17
2 GB memory using 16 256M-by-4-bit RAM chips
18
Hierarchical Storage Organization
registers
CPU can reference
access time decreases L1 cache
directly
L2 cache
cache
L3 cache
access speed increases
secondary
storage 210 kilobyte
220 megabyte
230 gigabyte
240 terabyte
250 petabyte 19
Hierarchical Storage Organization
registers
CPU can reference
access time decreases directly
cache
access speed increases
secondary
storage 210 kilobyte
220 megabyte
230 gigabyte
240 terabyte
250 petabyte 20
Memory Management
Simple storage analogy
• 10,000 books at library (main memory)
• Checkout Max: 10 (cache)
• Which do you keep at home?
• The ones you need most frequently in the future
• Another example:
var A: array[1..500] of integer; (A is a 1x500 array)
var B: array[1..500] of integer; (B is a 500x500 array)
var C: array[1..500,…, 1..500] of integer; (C is a 1x500 arry)
A = BC;
• Hopefully
• B is stored in a cache
• The instructions for the loop are stored in a cache
22
Principle of Locality Non-uniform Access
• Temporal Locality
• referenced item likely to be referenced again in the near future
• variables are reused
lw $5, $8(147)
add $2, $5, $2
sw $2, $8(147)
• Spatial Locality
• data near referenced item likely to be referenced in the near future
• move through array
top: addi $8, $8, 4
lw $5, $8(147)
add $5, $5, $2
bne $5, $9, top
M[x] in M[x] in
free slot?
cache? cache?
cache write slot to
“hit”
memory
move M[x]
to cache
main memory
24
Memory Management Caches
25
Memory Management Caches
Significant spatial locality: bring in block of consecutive addressed words
• Terminology:
• Given 2n words in main memory
group into “blocks” of l words
there are M (=2n/l) logical blocks
0
1
a b c d
{ b
c
d
l
2
C …
i-2
i-1
Note: for simplicity, we’ll use word addresses and ignore possible byte offset address bits 26
Memory Management Caches
2n words of memory
Consider a computer with: M=
𝑙 𝑤𝑜𝑟𝑑𝑠 𝑝𝑒𝑟 𝑐𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒
• a main memory of 64kw and a cache of 1kw
• a block size (l) of 8 words
{
a
b l
c
a b c d d
0
1
2
C …
i-2
i-1
l 27
Mapping Function Main Memory Addressing
Example: A = 29
0000000000011101 Ã = 29 div 8 = 3 word in 4th block
(like shifting right 3 bits)
Block # Word #
29 % 8 = 5 6th word in block
28
Mapping Function Cache Addressing
M number of main memory blocks
l number of words in a block/line
memory block number word C number of cache lines
15 .... 3 2 1 0
à memory block number
• There are 26 memory blocks that map to the same place! cache:
block size:
1kw
8w
memory blocks: 8192
cache lines: 128
29
Mapping Function Cache Presence and Placement
• Tag bits: keep a tag that will uniquely identify which memory block currently
occupies each cache line
• Valid bit: indicate a cache line has a meaningful entry
(remember, we’re still
byte offset
memory address ignoring byte address bits)
Tag Index
data
hit line word
offset
valid tag data words
30
Mapping Function Direct Mapping
line offset
00000 a 10000 q index valid tag 00 01 10 11
00001 b 10001 r 0 0
00010 c 10010 s
1 0
00011 d 10011 t
00100 e 10100 u Consider read sequence: 2,3,22,1,20,4
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
31
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2 “miss”
00010 c 10010 s 00010
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 0
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
32
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2
00010 c 10010 s 00 0 10
00011 d 10011 t offset
00100 e 10100 u line(index)
00101 f 10101 v tag
00110 g 10110 w 0 0
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
33
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2
00010 c 10010 s 00 0 10
00011 d 10011 t offset
00100 e 10100 u line(index)
00101 f 10101 v tag
00110 g 10110 w 0 0
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
34
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2
00010 c 10010 s 00 0 10
00011 d 10011 t offset
00100 e 10100 u line(index)
00101 f 10101 v tag
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
35
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 3 “hit”
00010 c 10010 s 00 0 11
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
36
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 22 “miss”
00010 c 10010 s 10 1 10
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
37
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 22
00010 c 10010 s 10 1 10
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
38
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 1 “hit”
00010 c 10010 s 00 0 01
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
39
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 20 “hit”
00010 c 10010 s 10 1 00
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F
40
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 4 “miss”
00010 c 10010 s 00 1 00
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
0 1 00 a b c d
01101 n 11101 D
o E 1 1 00 e f g h
01110 11110
01111 p 11111 F
41
Example: 8-bit addresses 00 000 x...x
8 word blocks Direct-mapped cache
8 cache lines
00
11 100 x...x
11
42
Example: 8-bit addresses
8 word blocks Direct-mapped cache
8 cache lines
cache tags main memory blocks
00 01 10 11 ( tags)
000 00 00 000 x...x
001 10 10 001 x...x
010 01 01 010 x...x
cache
011 11 11 100 x...x
line
number 100
101 10 10 101 x...x
110 00 00 110 x...x
111
offset
line(index)
tag
43
Mapping Function Direct Mapping
Index bits Tag hit && Valid
Example: Loop L calls procedure P, each mapped into the same cache line
• L and P would alternately replace each other through the loop, providing no cache benefit
11 000 x...x
11 100 x...x
11
45
Example: 8-bit addresses 00 000 x...x
8 word blocks Direct-mapped cache
8 cache lines
00
cache tags
Main
000 00 Memory
001 10 Blocks
010
cache address:
line
011 01
number 100
10 001 000
101
110 offset
10 001 x...x
111 line(index)
one refill line tag 10
11 100 x...x
11
46
Example: 8-bit addresses
8 word blocks Fully associative cache
8 cache lines main memory
cache tags
....................
address:
00000 000
Offset one refill line
tag
Fully associative cache: a block can be placed in any location in the cache
47
Mapping Functions Associative Mapping
• Any block can be in any slot in the cache (no index bits)
• Check tags in all cache lines to see if required block is in the cache
• Tags are larger (line/index bits now part of tag)
• Comparisons done in parallel (too slow if sequential)
48
Example: 8-bit addresses
8 word blocks
8 cache lines
Two-way set-associative cache
00 101 .........
offset
set (was line/index)
tag
49
Direct-mapped cache
cache tags main memory
000
001
cache 010
line 011
number 100
101
110
111
....................
block offset
direct mapping cache
tag cache line word
memory block number word
high order bits .... low order bits
fully associative cache
tag word
CPU
54
Cache Replacements and Writes
cache miss
Replacement Strategies: cache hit
55
Cache Consistency Writes: potential problems
cache miss
Multiple CPU system using a shared memory and bus cache hit
• Caches reduce memory/bus “bottlenecks”
• CPU write to cache other caches wrong/stale
• DMA may also write to memory that is cached
Possible solutions?
write-through
• any cache write also updates shared memory (and other caches?)
write-back
• write to cache line also sets a line update bit (dirty bit)
• when modified line is replaced, the entire line is copied back to shared memory
write-once
• only first write to line is write-through
• caches watch bus transactions, invalidating copies of modified line
• when modified line is replaced, entire line copied to memory
56
Example System DECStation 3100
cache miss
• MIPS R2000 processor cache hit
• pipeline similar to textbook pipeline Example: gcc program execution
• 32 bit words • 11% of the instructions are writes
• write-through policy write to
memory on
• Separate data & instruction cache • one word blocks every write
• each 64KB (16K words)
• one word blocks (lines)
• 16 bit tag, 14 bit index • 11% of time 10 cycle penalty (1.1)
• effective CPI = 1.2 +1.1 = 2.3
• Write-through policy
• To limit the penalty, a 4 word deep
• Performance write buffer was added to the system
• CPI without cache miss: 1.2 • Analysis indicated buffer size of 4 is
• 10 cycle penalty for a cache miss sufficient so that stalls are rare
57
Example System Intrinsity FastMATH
cache miss
• Embedded MIPS processor • Performance cache hit
59
Performance byte transfer rates
cache miss
Examples: cache hit
1 cycle to send address
• Regular memory: 1 word bus, 1 word-wide memory 15 cycle DRAM access
1 cycle to send word
cache block of 4 words
CPU Mem
1 word
60
Performance byte transfer rates
cache miss
Examples: cache hit
1 cycle to send address
• Wide memory: 2 word bus, 2 word-wide memory 15 cycle DRAM access
1 cycle to send word
cache block of 4 words
CPU Mem
2 words
61
Performance byte transfer rates
cache miss
Examples: cache hit
1 cycle to send address
• Interleaved memory: 4 banks of 1 word wide memory 15 cycle DRAM access
1 cycle to send word
Mem cache block of 4 words
Mem
CPU
Mem
1 word
Mem
62
Performance System
cache miss
• Write through case cache hit
𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 = 𝑒𝑥𝑒𝑐 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 + 𝑠𝑡𝑎𝑙𝑙 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 × 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑠𝑡𝑎𝑙𝑙 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 = × 𝑚𝑖𝑠𝑠 𝑟𝑎𝑡𝑒 × 𝑚𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑝𝑟𝑜𝑔𝑟𝑎𝑚
𝑤𝑟𝑖𝑡𝑒𝑠
𝑤𝑟𝑖𝑡𝑒 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠 = × 𝑤𝑟𝑖𝑡𝑒 𝑚𝑖𝑠𝑠 𝑟𝑎𝑡𝑒 × 𝑤𝑟𝑖𝑡𝑒 𝑚𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 + 𝑏𝑢𝑓𝑓𝑒𝑟 𝑠𝑡𝑎𝑙𝑙𝑠
𝑝𝑟𝑜𝑔𝑟𝑎𝑚
63
Performance System Example
cache miss
• Assume cache hit
• Instruction cache miss rate: 5%
• Data cache miss rate: 10%
• CPI without memory stalls: 4
• Miss penalty: 12 cycles
• Mix load/store instructions: 33%
• What if the CPU is made faster, but the memory stays the same?
• reduce CPI or increase CPU clock rate
64
Performance System Example
cache miss
• Assume cache hit
• I-cache miss rate: 2%
• D-cache miss rate: 4%
• Miss penalty: 100 cycles
• Base CPI (ideal cache): 2
• Load & stores: 36% of instructions
65
Performance Amdahl’s Law
cache miss
As one performance factor is improved, the others become more significant cache hit
• Example:
Cut non-stall CPI from 4 to 2
Retain 1 stall cycle per instruction
• Original design stalled 1 of 5 cycles (20%), now 1 of 3 cycles (33%)
• If instead double clock rate, same effect (higher miss penalty)
• Consider improving the CPU clock rate and reducing the CPI
• double hit!
lower CPI: more pronounced impact of stalls
higher clock rate: higher miss penalty
66
Performance Average Access Time
cache miss
• Hit time is also important for performance cache hit
67
Multilevel Caches
cache miss
• Primary cache attached to CPU: fast, but small cache hit
68
Multilevel Caches Example
cache miss
cache hit
• Given • Primary cache miss with L-2 hit
• CPU base CPI: 1
• Penalty = 5ns/0.25ns = 20 cycles
• Clock rate: 4GHz
• Miss rate/instruction: 2%
• Primary and L-2 cache miss
• Main memory access: 100ns
• Penalty: 400+20 cycles
• With just primary cache Primary cache miss Level 2 cache miss
• CPI = 1 + (2% × 20) + (0.5% × 400)
• Miss penalty: 100ns/0.25ns = 400 cycles
• Effective CPI: 1 + (0.02 × 400) = 9 = 3.4
Base CPI + (Miss rate × Miss penalty)
• Resulting effect:
• L-1 cache usually smaller than cache in a single cache level CPU
• L-1 block size smaller than L-2 block size
70
Performance with Advanced CPUs
• Out-of-order CPUs can execute instructions during cache miss
• Pending store stays in load/store unit rdy op dest opnd1 opnd2
71
Performance with Advanced CPUs Radix Sort vs Quicksort
72
Matrix Multiply
X =
73
Matrix Multiply C=A*B
innermost
loop
middle
loop
74
Software optimization blocked algorithms
75
Matrix Multiply
X =
inner loops
76
DGEMM Access Pattern
i outer loop --- fixed in inner loops
• C, A, and B arrays older accesses j middle loop
k innermost loop
new accesses
C A B
78
Blocked DGEMM Access Pattern