0% found this document useful (0 votes)
3 views

Supplemental Material On Cache From ECE-341 Memory

sxdsdfsf

Uploaded by

mhasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Supplemental Material On Cache From ECE-341 Memory

sxdsdfsf

Uploaded by

mhasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Chapter 5.1-5.

Memory Systems

space takes space


speed is expensive
1
Memory system building blocks
• Differing characteristics
• speed
• size
• cost
• power
• volatile/non volatile
• direct/sequential access

Type Access Speed Cost per GB


Static Random Access Memory (SRAM) Volatile 0.5ns - 2.5ns $2000 – $5000
faster Dynamic Random Access Memory (DRAM) Volatile 50ns - 70ns $20 – $75 cheaper
Flash Non volatile 5µs - 50µs $.75 – $1.00
Magnetic disk Non volatile 5ms - 20ms $0.05 – $0.20

2
Memory system building blocks Types of Memory

Read-Only Memory
ROM: read-only memory, “factory” programmed, dense, cheap
PROM: programmable read-only memory (“burn” once)
EPROM: erasable PROM
EAPROM: electrically alterable ROM (~1000 times)

Read/Write Memory
• Physical Characteristics: static vs dynamic
volatile vs nonvolatile
destructive read vs nondestructive read
removable vs permanent
• Logical Organization: addressed associative; sequential access

Flash, Disk, Optical, Tape...

3
Disk Access Example
• Assume
512B sector
15,000 rpm (Revolution Per Minute)
4ms average seek time
100MB/s transfer rate seek
0.2ms controller overhead
idle disk
• Average read time
4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms

4
Disks approximation

• Moving-Head Disk Storage


• platters stacked on spindle
• r/w head for each platter surface (side)
• boom/arm moves in/out to desired cylinder of tracks

5
Disks approximation

Possible physical sector numbering

6
Disks approximation

Read/write cylinder of
sectors as atomic action

7
Disks approximation

Interleaved access example


read first sector
process first sector
(might skip multiple sectors)
read second sector

8
Non-volatile Storage
Disk Flash
• Rotating magnetic storage • More expensive, faster (100-1000x) than disk
• not direct access
• Smaller, lower power, more robust than disk
• Transfer data in sector chunks
• mechanical startup time • Less expensive, slower than DRAM
• move head to track
• wait for sector to rotate
• read data • Wears out after 1000s of accesses
• Slow: request queue managed by
operating system

9
Memory access time differences
• SRAM 0.5ns - 2.5ns 500 ps 5x102 ps
• DRAM 50ns - 70ns 50,000 ps 5x104 ps
• Disk 5ms - 20ms 5,000,000,000 ps 5x109 ps

10
Memory System Objectives
• Provide lots of storage Make the memory as big as possible
• Make the system fast Build a system with the fastest memory
• Make the system cheap Use the cheapest memory
Use as little memory as possible

• Ideal System: speed of SRAM, cost and capacity of disk


• Target solution: make a cheap, large, slow system appear like fast system
• On Chip Cache using fast SRAMs
• Primary Memory using DRAMs
• Secondary Memory using magnetic disks, etc

11
Memory Interleaving
• Split main memory into a number of physically separate components called
modules or banks
bank bank bank bank
0 1 2 3

memory port controller

CPU CPU channel channel

• Goal: hide slow memory by making several memory requests


simultaneously, each to a different module

High-order interleaving: consecutive addresses stored in the same physical module


Low-order interleaving: consecutive addresses stored in consecutive modules

12
Memory Interleaving High vs Low Order Base-10 example
100 word memory using 10 banks

High-order digit interleaving


bank offset

0 1 2 3 4 5 6 7 8 9
Low-order digit interleaving
offset bank

13
Memory addresses

• Consider building a 2 GB, byte-addressable memory


• 31 address lines
• 8 data lines
31
2GB
8

• Build using 256M x 4 bit RAM modules/chips


• Chip size denoted by its depth x width
e.g. 4Gbit chip 4Gx1
• Data byte requires a pair of chips

? 256M x 4 256M x 4 8

• How many address lines needed per chip?


• How many chips needed for a 2GB memory?
14
4 high order bits 4 low order bits

256M x 4 256M x 4

Sel Sel

Data Bus
01234567
Bank Select
3 high-order address bits 28 low-order address bits
(decoded)
Address Bus
15
2 GB memory using 16 256M-by-4-bit RAM chips
Where is byte 0?
Where is byte 1? 0 1
Where is byte 256M?

2 3

4 5

6 7

Address Bus Data Bus


3 high-order addr bits 28 low-order addr bits
16
2 GB memory using 16 256M-by-4-bit RAM chips

• If it takes 100ps to fetch one byte, how long for two bytes (say bytes 0 and 1)?

17
2 GB memory using 16 256M-by-4-bit RAM chips

• What if 3 high order bits were the low order bits?


• Where is byte 0 and 1?

18
Hierarchical Storage Organization
registers
CPU can reference
access time decreases L1 cache
directly
L2 cache
cache
L3 cache
access speed increases

cost per bit increases primary


storage
capacity decreases code and data
must first move
Solid State Drive to primary storage
“disk”
prefetch cache

secondary
storage 210 kilobyte
220 megabyte
230 gigabyte
240 terabyte
250 petabyte 19
Hierarchical Storage Organization
registers
CPU can reference
access time decreases directly
cache
access speed increases

cost per bit increases primary


storage
capacity decreases code and data
must first move
to primary storage

secondary
storage 210 kilobyte
220 megabyte
230 gigabyte
240 terabyte
250 petabyte 20
Memory Management
Simple storage analogy
• 10,000 books at library (main memory)
• Checkout Max: 10 (cache)
• Which do you keep at home?
• The ones you need most frequently in the future

Memory and Software


• Consider the following program fragment:
sum = 0;
for i = 1 to 500 do
sum := sum + x[i];

• The variable sum is accessed 1001 times


• The variable i is accessed 1000 times
• Each x[i] is accessed once

 Store sum and i in registers (but not x[i] values)


21
Memory Management Memory and Software

• Another example:
var A: array[1..500] of integer; (A is a 1x500 array)
var B: array[1..500] of integer; (B is a 500x500 array)
var C: array[1..500,…, 1..500] of integer; (C is a 1x500 arry)
A = BC;

• Each B[i]will be accessed 500 times

( example for dimensions 1..5 )

• Hopefully
• B is stored in a cache
• The instructions for the loop are stored in a cache
22
Principle of Locality Non-uniform Access

• Temporal Locality
• referenced item likely to be referenced again in the near future
• variables are reused
lw $5, $8(147)
add $2, $5, $2
sw $2, $8(147)

• Spatial Locality
• data near referenced item likely to be referenced in the near future
• move through array
top: addi $8, $8, 4
lw $5, $8(147)
add $5, $5, $2
bne $5, $9, top

• Programs exhibit both temporal and spatial locality


23
Memory Management Caches

Cache memory keeps a copy of a subset of main memory


read M[x] write M[x]

M[x] in M[x] in
free slot?
cache? cache?
cache write slot to
“hit”
memory

move M[x]
to cache

read M[x] to write M[x] write M[x]


CPU to cache to memory

read cache operation write cache operation

main memory
24
Memory Management Caches

• Presence: How do we know whether the item is in cache memory?


• Placement: Where in the cache would a value be?
• Replacement: What part of the cache must be deleted to make room for
the new items?
• Writes: How are writes handled?

• Efficiency: expect spatial locality


• rather than fetching one word at a time, fetch a collection of contiguous words
• single lookup saves time
• use banks with low order interleaving

25
Memory Management Caches
Significant spatial locality: bring in block of consecutive addressed words

• Terminology:
• Given 2n words in main memory
group into “blocks” of l words
there are M (=2n/l) logical blocks

• A cache has “slots” or “lines” for C blocks (C << M)


a

0
1
a b c d
{ b
c
d
l

2
C …
i-2
i-1

Note: for simplicity, we’ll use word addresses and ignore possible byte offset address bits 26
Memory Management Caches
2n words of memory
Consider a computer with: M=
𝑙 𝑤𝑜𝑟𝑑𝑠 𝑝𝑒𝑟 𝑐𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒
• a main memory of 64kw and a cache of 1kw
• a block size (l) of 8 words

M (number of main memory blocks) = 64kw/8w = 8192


C (number of cache lines or cache slots) = 1kw/8w = 128

{
a
b l
c
a b c d d
0
1
2
C …
i-2
i-1

l 27
Mapping Function Main Memory Addressing

• Given a 16 bit word address A = A15A14 ... A3A2A1A0 main memory:


cache:
64kw
1kw
the word can be found in memory block à = A15A14 ... A3 block size:
memory blocks:
8w
8192
cache lines: 128

block number word


15 .... 3 2 1 0

Example: A = 29
0000000000011101 Ã = 29 div 8 = 3 word in 4th block
(like shifting right 3 bits)

Block # Word #
29 % 8 = 5 6th word in block

28
Mapping Function Cache Addressing
M number of main memory blocks
l number of words in a block/line
memory block number word C number of cache lines
15 .... 3 2 1 0
à memory block number

• Requested word stored in cache line at à mod C

• with C = 128 (27)


l = 8 (23)
this is the lower 7 bits of à : A9,..., A3

cache line word


15 .... 10 9 .... 3 2 1 0

main memory: 64kw

• There are 26 memory blocks that map to the same place! cache:
block size:
1kw
8w
memory blocks: 8192
cache lines: 128
29
Mapping Function Cache Presence and Placement

• Tag bits: keep a tag that will uniquely identify which memory block currently
occupies each cache line
• Valid bit: indicate a cache line has a meaningful entry
(remember, we’re still
byte offset
memory address ignoring byte address bits)

Tag Index
data
hit line word
offset
valid tag data words

30
Mapping Function Direct Mapping

line offset
00000 a 10000 q index valid tag 00 01 10 11
00001 b 10001 r 0 0
00010 c 10010 s
1 0
00011 d 10011 t
00100 e 10100 u Consider read sequence: 2,3,22,1,20,4
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

31
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2 “miss”
00010 c 10010 s 00010
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 0
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

32
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2
00010 c 10010 s 00 0 10
00011 d 10011 t offset
00100 e 10100 u line(index)
00101 f 10101 v tag
00110 g 10110 w 0 0
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

33
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2
00010 c 10010 s 00 0 10
00011 d 10011 t offset
00100 e 10100 u line(index)
00101 f 10101 v tag
00110 g 10110 w 0 0
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

34
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 2
00010 c 10010 s 00 0 10
00011 d 10011 t offset
00100 e 10100 u line(index)
00101 f 10101 v tag
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

35
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 3 “hit”
00010 c 10010 s 00 0 11
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

36
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 22 “miss”
00010 c 10010 s 10 1 10
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z
01010 k 11010 A
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

37
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 22
00010 c 10010 s 10 1 10
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w 0 1 00 a b c d
00111 h 10111 x 1 0
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

38
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 1 “hit”
00010 c 10010 s 00 0 01
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

39
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 20 “hit”
00010 c 10010 s 10 1 00
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
01101 n 11101 D
01110 o 11110 E
01111 p 11111 F

40
Mapping Function Direct Mapping
2,3,22,1,20,4
00000 a 10000 q
00001 b 10001 r 4 “miss”
00010 c 10010 s 00 1 00
00011 d 10011 t
00100 e 10100 u
00101 f 10101 v
00110 g 10110 w
00111 h 10111 x
01000 i 11000 y
01001 j 11001 z 0 1 00 a b c d
01010 k 11010 A
1 1 10 u v w x
01011 l 11011 B
01100 m 11100 C
0 1 00 a b c d
01101 n 11101 D
o E 1 1 00 e f g h
01110 11110
01111 p 11111 F

41
Example: 8-bit addresses 00 000 x...x
8 word blocks Direct-mapped cache
8 cache lines
00

Main 00 110 x...x

cache tags Memory


Blocks 01 010 x...x
000 00
001 address: 01
10
010 01 00 000 000
cache
011 11
line
number 100
offset
10 001 x...x
101 10 line(index)
110 00
tag 10
111 10 101 x...x

one refill line

11 100 x...x
11

42
Example: 8-bit addresses
8 word blocks Direct-mapped cache
8 cache lines
cache tags main memory blocks
00 01 10 11 ( tags)
000 00 00 000 x...x
001 10 10 001 x...x
010 01 01 010 x...x
cache
011 11 11 100 x...x
line
number 100
101 10 10 101 x...x
110 00 00 110 x...x
111

one refill line one refill line


address:
00 000 000

offset
line(index)
tag

43
Mapping Function Direct Mapping
Index bits Tag hit && Valid

• A given block can only be in one specific slot


• This is inflexible: only one of M/C blocks can be stored in the cache memory at a time
main memory: 64kw memory blocks: 8192
cache: 1kw cache lines: 128 8192/128 = 64
block size: 8w

Example: Loop L calls procedure P, each mapped into the same cache line
• L and P would alternately replace each other through the loop, providing no cache benefit

• More expensive alternatives...


44
Example: 8-bit addresses 00 000 x...x
8 word blocks Direct-mapped cache
8 cache lines
00
cache tags
Main
000 00 Memory 01 000 x...x
001
Blocks
010
cache address:
line
011 01
number 100
00 000 000
101
110 offset 10 000 x...x
111 line(index)
one refill line tag 10

11 000 x...x

11 100 x...x
11

45
Example: 8-bit addresses 00 000 x...x
8 word blocks Direct-mapped cache
8 cache lines
00
cache tags
Main
000 00 Memory
001 10 Blocks
010
cache address:
line
011 01
number 100
10 001 000
101
110 offset
10 001 x...x
111 line(index)
one refill line tag 10

11 100 x...x
11

46
Example: 8-bit addresses
8 word blocks Fully associative cache
8 cache lines main memory
cache tags

....................

one refill line

address:
00000 000
Offset one refill line

tag

Fully associative cache: a block can be placed in any location in the cache
47
Mapping Functions Associative Mapping

• Any block can be in any slot in the cache (no index bits)
• Check tags in all cache lines to see if required block is in the cache
• Tags are larger (line/index bits now part of tag)
• Comparisons done in parallel (too slow if sequential)

• More flexible and results in a higher number of “hits”


• But, requires much more complex circuitry and thus more expensive

Set Associative Mapping


Combination of direct and associative methods
• Create groups (sets) of j slots in the cache
• The set number of A is found by à mod j

48
Example: 8-bit addresses
8 word blocks
8 cache lines
Two-way set-associative cache

cache tags Main Memory

00 101 .........

01 000 001 .........


10 001 111 ..........
11 000 .........

000 001 ............ 111


(tags)
address:
000 00 000

offset
set (was line/index)
tag
49
Direct-mapped cache
cache tags main memory
000
001
cache 010
line 011
number 100
101
110
111

one refill line one refill line


00 01 10 11 (tags)

Fully associative cache


cache tags main memory

....................

Two-way set-associative cache


cache tags main memory
00 .................
01 .................
10 .................
11 .................
000 001 .................... 111 (tags)
50
Memory Management Caches

• First two questions 1.


2.
Presence?
Placement?
• Cache presence: How do we know whether the item is in cache memory? 3. Replacement?
• Placement: Where in the cache would a value be? 4. Writes?

block offset
direct mapping cache
tag cache line word
memory block number word
high order bits .... low order bits
fully associative cache
tag word

set associative cache


tag cache set word

Which is the best?


performance
Note: for simplicity, ignoring cost
possible byte offset address bits 51
Memory Management Performance
cache miss
• Definitions cache hit

miss rate On cache miss


M[x] in N N write line to
free line?
cache? memory Stall the CPU pipeline
Y miss penalty Fetch block from next level of memory hierarchy
hit rate
Y
move M[x] to
cache
Instruction cache miss: restart instruction fetch
Data cache miss: complete data access
read M[x] to read M[x] to
CPU
hit time CPU

access time: = hit time + ( miss rate x miss penalty )


= hit time + ( (1 – hit rate) x miss penalty )

A memory system’s behavior is a major factor in determining performance


52
Memory Management Performance
cache miss
Caching strategies leverage program’s temporal and spatial locality cache hit

• Larger blocks should reduce miss rate (spatial locality)


• But what if keep cache size fixed? words words
• Larger blocks  fewer lines lines
• More competition  increased miss rate
• Larger blocks  pollution

• Larger blocks increase miss penalty


bank bank bank bank
• Can override benefit of reduced miss rate 0 1 2 3

• Early restart and critical-word-first can help Bus

CPU

access time: hit time + (miss rate x miss penalty) 53


Memory Management Caches
cache miss
cache hit

• Cache presence: How do we know whether the item is in cache memory?


• Placement: Where in the cache would a value be?
• Replacement: What part of the cache must be deleted to make room for
the new items?
• Writes: How are writes handled?

54
Cache Replacements and Writes
cache miss
Replacement Strategies: cache hit

• Direct Mapping: only one place to put it


2 way set associative cache
• Associative/Set Associative 00
• FIFO 01
10
• Least Recently Used 11

• Least Frequently Used (hard to implement)


tags
ways
Memory Writes:
• Write word to the cache  corresponding word in memory incorrect
• If one word per line, just write to cache
• If (line size > one word) need to first read the line into the cache

55
Cache Consistency Writes: potential problems
cache miss
Multiple CPU system using a shared memory and bus cache hit
• Caches reduce memory/bus “bottlenecks”
• CPU write to cache  other caches wrong/stale
• DMA may also write to memory that is cached

Possible solutions?
write-through
• any cache write also updates shared memory (and other caches?)
write-back
• write to cache line also sets a line update bit (dirty bit)
• when modified line is replaced, the entire line is copied back to shared memory
write-once
• only first write to line is write-through
• caches watch bus transactions, invalidating copies of modified line
• when modified line is replaced, entire line copied to memory
56
Example System DECStation 3100
cache miss
• MIPS R2000 processor cache hit
• pipeline similar to textbook pipeline Example: gcc program execution
• 32 bit words • 11% of the instructions are writes
• write-through policy write to
memory on
• Separate data & instruction cache • one word blocks every write
• each 64KB (16K words)
• one word blocks (lines)
• 16 bit tag, 14 bit index • 11% of time 10 cycle penalty (1.1)
• effective CPI = 1.2 +1.1 = 2.3
• Write-through policy
• To limit the penalty, a 4 word deep
• Performance write buffer was added to the system
• CPI without cache miss: 1.2 • Analysis indicated buffer size of 4 is
• 10 cycle penalty for a cache miss sufficient so that stalls are rare

57
Example System Intrinsity FastMATH
cache miss
• Embedded MIPS processor • Performance cache hit

• 12-stage pipeline • 1 bus cycle for address transfer


• Data and instruction cache • 15 bus cycles per DRAM access
• Each 16KB • 1 bus cycle per data transfer
256 lines × 16 words/line
• Write-through or write-back

• SPEC2000 miss rates


• I-cache: 0.4%
• D-cache: 11.4%
• Weighted average: 3.24%
• not worth it to instead form a single cache:
3.18% miss rate, but would lose benefit of
concurrent accesses
58
Performance byte transfer rates
cache miss
Cache / memory relationship Memory cache hit

bank bank bank


0 1 …. n
Assume a system with:
• 1 clock cycle to send address
• 15 clock cycle DRAM access
• 1 clock cycle to send a word of data Bus

• cache block of 4 words


buffer
• Each word is 4 byte (32 bit)
cache
CPU
4

59
Performance byte transfer rates
cache miss
Examples: cache hit
1 cycle to send address
• Regular memory: 1 word bus, 1 word-wide memory 15 cycle DRAM access
1 cycle to send word
cache block of 4 words
CPU Mem
1 word

miss penalty: 1 + 4x15 + 4x1 = 65 clock cycles


address DRAM access transfer

transfer rate: 4x4 bytes / 65 = 0.25 bytes per cycle

60
Performance byte transfer rates
cache miss
Examples: cache hit
1 cycle to send address
• Wide memory: 2 word bus, 2 word-wide memory 15 cycle DRAM access
1 cycle to send word
cache block of 4 words
CPU Mem
2 words

miss penalty: 1 + 2x15 + 2x1 = 33 clock cycles


address DRAM access transfer

transfer rate: 16/33 = 0.48 bytes per cycle

61
Performance byte transfer rates
cache miss
Examples: cache hit
1 cycle to send address
• Interleaved memory: 4 banks of 1 word wide memory 15 cycle DRAM access
1 cycle to send word
Mem cache block of 4 words
Mem
CPU
Mem
1 word
Mem

miss penalty: 1 + 1x15 + 4x1 = 20 clock cycles


address DRAM access transfer

transfer rate: 16/20 = 0.75 bytes per cycle

Even more beneficial on writes

62
Performance System
cache miss
• Write through case cache hit

𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 = 𝑒𝑥𝑒𝑐 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 + 𝑠𝑡𝑎𝑙𝑙 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 × 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒

𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑠𝑡𝑎𝑙𝑙 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 = × 𝑚𝑖𝑠𝑠 𝑟𝑎𝑡𝑒 × 𝑚𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑝𝑟𝑜𝑔𝑟𝑎𝑚

Examining read and write separately:


𝑟𝑒𝑎𝑑𝑠
𝑟𝑒𝑎𝑑 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠 = × 𝑟𝑒𝑎𝑑 𝑚𝑖𝑠𝑠 𝑟𝑎𝑡𝑒 × 𝑟𝑒𝑎𝑑 𝑚𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑝𝑟𝑜𝑔𝑟𝑎𝑚

𝑤𝑟𝑖𝑡𝑒𝑠
𝑤𝑟𝑖𝑡𝑒 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠 = × 𝑤𝑟𝑖𝑡𝑒 𝑚𝑖𝑠𝑠 𝑟𝑎𝑡𝑒 × 𝑤𝑟𝑖𝑡𝑒 𝑚𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 + 𝑏𝑢𝑓𝑓𝑒𝑟 𝑠𝑡𝑎𝑙𝑙𝑠
𝑝𝑟𝑜𝑔𝑟𝑎𝑚

63
Performance System Example
cache miss
• Assume cache hit
• Instruction cache miss rate: 5%
• Data cache miss rate: 10%
• CPI without memory stalls: 4
• Miss penalty: 12 cycles
• Mix load/store instructions: 33%

• How much better is a perfect/no penalty cache system?


Instruction miss cycles = IC x 5% x 12 = .6 IC
Data miss cycles = IC x 33% x 10% x 12 = .4 IC
= 1 stall cycle expected per instruction
CPI with memory stalls: 5 (4+1)
Perfect vs penalty cache system: 5/4 = 1.25

• What if the CPU is made faster, but the memory stays the same?
• reduce CPI or increase CPU clock rate

64
Performance System Example
cache miss
• Assume cache hit
• I-cache miss rate: 2%
• D-cache miss rate: 4%
• Miss penalty: 100 cycles
• Base CPI (ideal cache): 2
• Load & stores: 36% of instructions

• Miss cycles per instruction


• I-cache: 2% × 100 = 2
• D-cache: 36% × 4% × 100 = 1.44

• Actual CPI: 2 + (2+1.44) = 5.44

• Ideal CPU is 5.44/2 = 2.72 times faster

65
Performance Amdahl’s Law
cache miss
As one performance factor is improved, the others become more significant cache hit

• Faster CPU clock rate or lower CPI


=> Increases fraction of time there will be memory stalls
Eventually stalls become the most significant factor

• Example:
Cut non-stall CPI from 4 to 2
Retain 1 stall cycle per instruction
• Original design stalled 1 of 5 cycles (20%), now 1 of 3 cycles (33%)
• If instead double clock rate, same effect (higher miss penalty)

• Consider improving the CPU clock rate and reducing the CPI
• double hit!
lower CPI: more pronounced impact of stalls
higher clock rate: higher miss penalty

66
Performance Average Access Time
cache miss
• Hit time is also important for performance cache hit

• Average memory access time (AMAT)


• AMAT = Hit time + Miss time
= Hit time + (Miss rate × Miss penalty)
• Example
• CPU clock: 1ns
• hit time: 1 cycle
• miss penalty: 20 cycles
• I-cache miss rate: 5%

• AMAT = 1 + (0.05 × 20) = 2ns

• 2 cycles per instruction

67
Multilevel Caches
cache miss
• Primary cache attached to CPU: fast, but small cache hit

• Level-2 cache services misses from primary cache


• Larger, slower, but still faster than main memory
• Some high-end systems include L-3 cache
• Main memory services last level cache misses

68
Multilevel Caches Example
cache miss
cache hit
• Given • Primary cache miss with L-2 hit
• CPU base CPI: 1
• Penalty = 5ns/0.25ns = 20 cycles
• Clock rate: 4GHz
• Miss rate/instruction: 2%
• Primary and L-2 cache miss
• Main memory access: 100ns
• Penalty: 400+20 cycles

• With just primary cache Primary cache miss Level 2 cache miss
• CPI = 1 + (2% × 20) + (0.5% × 400)
• Miss penalty: 100ns/0.25ns = 400 cycles
• Effective CPI: 1 + (0.02 × 400) = 9 = 3.4
Base CPI + (Miss rate × Miss penalty)

• Performance ratio = 9/3.4 = 2.6


• Add L-2 cache
• Access time: 5ns
• Global miss rate to memory: 0.5%
69
Multilevel Caches Considerations
cache miss
cache hit
• Primary cache
• Focus on minimal hit time minimize hit time

• L-2 cache minimize miss rate

• Focus on low miss rate to avoid main memory access


• Hit time has less overall impact

• Resulting effect:
• L-1 cache usually smaller than cache in a single cache level CPU
• L-1 block size smaller than L-2 block size

70
Performance with Advanced CPUs
• Out-of-order CPUs can execute instructions during cache miss
• Pending store stays in load/store unit rdy op dest opnd1 opnd2

• Dependent instructions wait in reservation stations


• instructions with no dependencies continue

• Effect of miss depends on program data flow


• Much harder to analyze mathematically
• Evaluation using a software simulation model

71
Performance with Advanced CPUs Radix Sort vs Quicksort

For large arrays, radix sort


But actual performance shows
requires fewer operations
more clock cycles/item!
per item sorted

72
Matrix Multiply

X =

73
Matrix Multiply C=A*B

DGEMM: double precision general matrix multiply

• Example (page 215): multiplying two 32x32 matrices example

for each row


for each column middle
loop
compute sum-of-products innermost
loop

innermost
loop

middle
loop

74
Software optimization blocked algorithms

• Dramatically improve performance by minimizing memory blocks reloads


• seek full utilization of data stored in cache line before it is replaced
• Matrix multiplication example
• store arrays so memory access is sequential
• different than utilizing row-major or column-major order!
for (int j = 0; j < n; ++j) Inner loops of DPGMM algorithm
{ Matrix multiply of a row of A
double cij = C[i+j*n]; Read one row of A, repeatedly
for( int k = 0; k < n; k++ ) Read all NxN elements of B
cij += A[i+k*n] * B[k+j*n]; Write one row of C
C[i+j*n] = cij;

}

75
Matrix Multiply

X =

inner loops

76
DGEMM Access Pattern
i outer loop --- fixed in inner loops
• C, A, and B arrays older accesses j middle loop
k innermost loop
new accesses

C A B

i: row (fixed) i: row k: row


not yet touched
j: column k: column j: column

If cache can hold all three NxN matrices, all is well


Otherwise, miss penalty will become a key factor
77
Software optimization blocked algorithms
1 #define BLOCKSIZE 32
Algorithm rewritten to compute
2 void do_block (int n, int si, int sj, int sk, double *A, double
submatrix that will fit in cache
3 *B, double *C)
4 {
• blocking factor: BLOCKSIZE
5 for (int i = si; i < si+BLOCKSIZE; ++i)
6 for (int j = sj; j < sj+BLOCKSIZE; ++j)
7 {
8 double cij = C[i+j*n]; /* cij = C[i][j] */
9 for( int k = sk; k < sk+BLOCKSIZE; k++ )
10 cij += A[i+k*n] * B[k+j*n]; /* cij+=A[i][k]*B[k][j] */
11 C[i+j*n] = cij;/* C[i][j] = cij */
12 }
13 }
14 void dgemm (int n, double* A, double* B, double* C)
15 {
16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )
17 for ( int si = 0; si < n; si += BLOCKSIZE )
18 for ( int sk = 0; sk < n; sk += BLOCKSIZE )
19 do_block(n, si, sj, sk, A, B, C);
20 }

78
Blocked DGEMM Access Pattern

GFLOPS: Giga Floating Point Operation per Second


79

You might also like