Lec17-cache-3
Lec17-cache-3
Spring 2022
Overview
1 Example 1
2 Example 2
3 Example 3
4 Performance Issues
2/28
Example 1
Cache Example
short A[10][4];
int sum = 0; • Assume separate instruction and data caches
int j, i;
double mean; • So we consider only the data
// forward loop • Cache has space for 8 blocks
for (j = 0; j <= 9; j++)
sum += A[j][0];
• A block contains one word (byte)
mean = sum / 10.0;
• A[10][4] is an array of words located at
// backward loop
7A00-7A27 in row-major order
for (i = 9; i >= 0; i--)
A[i][0] = A[i][0]/mean;
4/28
Cache Example
Memory word
address in hex Memory word address in binary Array Contents (40 elements)
(7A00) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 A[0][0]
(7A01) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 A[0][1]
(7A02) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 A[0][2]
(7A03) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 A[0][3]
(7A04) 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 A[1][0]
(7A24) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 0 A[9][0]
(7A25) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 A[9][1]
(7A26) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 A[9][2]
(7A27) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 A[9][3]
5/28
Direct Mapping
• Least significant 3-bits of address determine location
• No replacement algorithm is needed in Direct Mapping
• When i == 9 and i == 8, get a cache hit (2 hits in total)
• Only 2 out of the 8 cache positions used
• Very inefficient cache utilization
j=0 j=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 i=9 i=8 i=7 i=6 i=5 i=4 i=3 i=2 i=1 i=0
0 A[0][0] A[0][0] A[2][0] A[2][0] A[4][0] A[4][0] A[6][0] A[6][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[6][0] A[6][0] A[4][0] A[4][0] A[2][0] A[2][0] A[0][0]
1
2
Cache 3
Block
number 4 A[1][0] A[1][0] A[3][0] A[3][0] A[5][0] A[5][0] A[7][0] A[7][0] A[9][0] A[9][0] A[9][0] A[7][0] A[7][0] A[5][0] A[5][0] A[3][0] A[3][0] A[1][0] A[1][0]
5
6
7
j=0 j=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 i=9 i=8 i=7 i=6 i=5 i=4 i=3 i=2 i=1 i=0
0 A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[0][0]
1 A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[1][0] A[1][0]
2 A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0]
Cache 3 A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0]
Block
number 4 A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0]
5 A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0]
6 A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0]
7 A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0]
Tags not shown but are needed; LRU Counters not shown but are needed.
7/28
Set Associative Mapping
• Since all accessed blocks have even addresses (7A00, 7A04, 7A08, ...), only
half of the cache is used, i.e. they all map to set 0
• LRU replacement policy: get hits for i = 9, 8, 7 and 6
• Random replacement would have better average performance
• If i loop was a forward one, we would get no hits!
j=0 j=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 i=9 i=8 i=7 i=6 i=5 i=4 i=3 i=2 i=1 i=0
0 A[0][0] A[0][0] A[0][0] A[0][0] A[4][0] A[4][0] A[4][0] A[4][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[4][0] A[4][0] A[4][0] A[4][0] A[0][0]
1 A[1][0] A[1][0] A[1][0] A[1][0] A[5][0] A[5][0] A[5][0] A[5][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[5][0] A[5][0] A[5][0] A[5][0] A[1][0] A[1][0]
Set 0
2 A[2][0] A[2][0] A[2][0] A[2][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[2][0] A[2][0] A[2][0]
Cache 3 A[3][0] A[3][0] A[3][0] A[3][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[3][0] A[3][0] A[3][0] A[3][0]
Block
number 4
5
Set 1
6
7
Tags not shown but are needed; LRU Counters not shown but are needed.
8/28
Comments on the Example
• In practice,
• Low hit rates like in the example is very rare.
• Usually Set-Associative with LRU replacement scheme is used.
• Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache
memory
9/28
Example 2
Question:
How many total bits are required for a direct-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?
11/28
Question:
How many total bits are required for a direct-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?
Answer:
• In a 32-bit address CPU, 16 KiB is 4096 words.
• With a block size of 4 words, there are 1024 blocks.
• Each block has 4 × 32 or 128 bits of data plus a tag, which is (32 − 10 − 2 − 2) = 18
bits, plus a valid bit.
• Thus, the total cache size is 210 × (4 × 32 + 18 + 1) = 210 × 147 bits.
147
• the total number of bits in the cache is about 1.15 = times as many as needed
32 × 4
just for the storage of the data.
11/28
Question:
How many total bits are required for an associated-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?
12/28
Question:
How many total bits are required for an associated-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?
Answer:
• In a 32-bit address CPU, 16 KiB is 4096 words.
• With a block size of 4 words, there are 1024 blocks.
• Each block has 4 × 32 or 128 bits of data plus a tag, which is (32 − 2 − 2) = 28 bits,
plus a valid bit.
• Thus, the total cache size is 210 × (4 × 32 + 28 + 1) = 210 × 157 bits.
157
• the total number of bits in the cache is about 1.27 = times as many as needed
32 × 4
just for the storage of the data.
12/28
Example 3
Question
We have designed a 64-bit address direct-mapped cache, and the bits of address used to
access the cache are as shown below:
15/28
Performance Issues
Q1: Where A Block Be Placed in Upper Level?
17/28
Q1: Where A Block Be Placed in Upper Level?
17/28
Q3: Which Entry Should Be Replaced on a Miss?
• Direct mapped: only one choice
• Set associative or fully associative:
• Random
• LRU (Least Recently Used)
Note that:
• For a 2-way set associative, random replacement has a miss rate 1.1× than LRU
• For high level associativity (4-way), LRU is too costly
18/28
Q4: What Happen On A Write?
• Write-Through:
• The information is written in both the block in cache & the block in lower level
of memory
• Combined with write buffer, so write waits can be eliminated
•
L
L:
• :
• Write-Back:
• The information is written only to the block in cache
• The modification is written to lower level, only when the block is replaced
• Need dirty bit: tracks whether the block is clean or not
• Virtual memory always use write-back
•
L
:
•
L
:
19/28
Q4: What Happen On A Write?
• Write-Through:
• The information is written in both the block in cache & the block in lower level
of memory
• Combined with write buffer, so write waits can be eliminated
•
L
L: read misses don’t result in writes
• : easier to implement
• Write-Back:
• The information is written only to the block in cache
• The modification is written to lower level, only when the block is replaced
• Need dirty bit: tracks whether the block is clean or not
• Virtual memory always use write-back
•
L
:
•
L
:
19/28
Q4: What Happen On A Write?
• Write-Through:
• The information is written in both the block in cache & the block in lower level
of memory
• Combined with write buffer, so write waits can be eliminated
•
L
L: read misses don’t result in writes
• : easier to implement
• Write-Back:
• The information is written only to the block in cache
• The modification is written to lower level, only when the block is replaced
• Need dirty bit: tracks whether the block is clean or not
• Virtual memory always use write-back
•
L
: write with speed of cache
•
L
: repeated writes require only one write to lower level
19/28
Performance Consideration
Performance
How fast machine instructions can be brought into the processor and how fast they can be
executed.
• Two key factors are performance and cost, i.e., price/performance ratio.
• For a hierarchical memory system with cache, the processor is able to access
instructions and data more quickly when the data wanted are in the cache.
• Therefore, the impact of a cache on performance is dependent on the hit and miss
rates.
20/28
Cache Hit Rate and Miss Penalty
• High hit rates over 0.9 are essential for high-performance computers.
• A penalty is incurred because extra time is needed to bring a block of data from a
slower unit to a faster one in the hierarchy.
Miss Penalty
Total access time seen by the processor when a miss occurs.
21/28
Miss Penalty
22/28
Average Memory Access Time
h × C + (1 − h) × M
• h: hit rate
• M: miss penalty
• C: cache access time
23/28
Question: Memory Access Time Example
• Assume 8 cycles to read a single memory word;
• 15 cycles to load a 8-word block from main memory (previous example);
• cache access time = 1 cycle
• For every 100 instructions, statistically 30 instructions are data read/ write
• Instruction fetch: 100 memory access: assume hit rate = 0.95
• Data read/ write: 30 memory access: assume hit rate = 0.90
Calculate: (1) Execution cycles without cache; (2) Execution cycles with cache.
24/28
Caches on Processor Chips
• In high-performance processors, two levels of caches are normally used, L1 and L2.
• L1 must be very fast as they determine the memory access time seen by the processor.
• L2 cache can be slower, but it should be much larger than the L1 cache to ensure a
high hit rate. Its speed is less critical because it only affects the miss penalty of the L1
cache.
• Average access time on such a system:
h1 · C1 + (1 − h1 ) · [h2 · C2 + (1 − h2 ) · M]
25/28
Larger Block Size
• , If all items in a larger block are needed in a computation, it is better to load these
items into the cache in a single miss.
• / Larger blocks are effective only up to a certain size, beyond which too many items
will remain unused before the block is replaced.
• / Larger blocks take longer time to transfer and thus increase the miss penalty.
26/28
Miss Rate v.s. Block Size v.s. Cache Size
Miss rate goes up if the block size becomes a significant fraction of the cache size
because the number of blocks that can be held in the same size cache is smaller
(increasing capacity misses)
27/28
Enhancement
Write buffer:
• Read request is served first.
• Write request stored in write buffer first and sent to memory whenever there is no
read request.
• The addresses of a read request should be compared with the addresses of the write
buffer.
Prefetch:
• Prefetch data into the cache before they are needed, while the processor is busy
executing instructions that do not result in a read miss.
• Prefetch instructions can be inserted by the programmer or the compiler.
Load-through Approach
• Instead of waiting the whole block to be transferred, the processor resumes execution
as soon as the required word is loaded in the cache.
28/28