0% found this document useful (0 votes)
3 views

Lec17-cache-3

The document discusses cache memory organization and design, focusing on examples of direct-mapped, associative, and set-associative cache mapping techniques. It includes performance comparisons and calculations for cache size requirements based on different configurations. The content is aimed at enhancing understanding of cache efficiency and memory access patterns in computer organization.

Uploaded by

2105019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lec17-cache-3

The document discusses cache memory organization and design, focusing on examples of direct-mapped, associative, and set-associative cache mapping techniques. It includes performance comparisons and calculations for cache size requirements based on different configurations. The content is aimed at enhancing understanding of cache efficiency and memory access patterns in computer organization.

Uploaded by

2105019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CENG 3420

Computer Organization & Design


Lecture 17: Cache-3 Discussions
Bei Yu
CSE Department, CUHK
[email protected]

(Textbook: Chapters 5.3–5.4)

Spring 2022
Overview

1 Example 1

2 Example 2

3 Example 3

4 Performance Issues

2/28
Example 1
Cache Example

short A[10][4];
int sum = 0; • Assume separate instruction and data caches
int j, i;
double mean; • So we consider only the data
// forward loop • Cache has space for 8 blocks
for (j = 0; j <= 9; j++)
sum += A[j][0];
• A block contains one word (byte)
mean = sum / 10.0;
• A[10][4] is an array of words located at
// backward loop
7A00-7A27 in row-major order
for (i = 9; i >= 0; i--)
A[i][0] = A[i][0]/mean;

4/28
Cache Example

Memory word
address in hex Memory word address in binary Array Contents (40 elements)

(7A00) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 A[0][0]
(7A01) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 A[0][1]
(7A02) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 A[0][2]
(7A03) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 A[0][3]
(7A04) 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 A[1][0]

(7A24) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 0 A[9][0]
(7A25) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 A[9][1]
(7A26) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 A[9][2]
(7A27) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 A[9][3]

Tag for Direct Mapped


8 blocks in cache, 3 bits encodes cache block number
Tag for Set-Associative
4 blocks/ set, 2 cache sets, 1 bit encodes cache set number
Tag for Associative

To simplify discussion: 16-bit word (byte) address; i.e. 1 word = 1 byte.

5/28
Direct Mapping
• Least significant 3-bits of address determine location
• No replacement algorithm is needed in Direct Mapping
• When i == 9 and i == 8, get a cache hit (2 hits in total)
• Only 2 out of the 8 cache positions used
• Very inefficient cache utilization

Content of data cache after loop pass: (time line)

j=0 j=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 i=9 i=8 i=7 i=6 i=5 i=4 i=3 i=2 i=1 i=0

0 A[0][0] A[0][0] A[2][0] A[2][0] A[4][0] A[4][0] A[6][0] A[6][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[6][0] A[6][0] A[4][0] A[4][0] A[2][0] A[2][0] A[0][0]
1
2
Cache 3
Block
number 4 A[1][0] A[1][0] A[3][0] A[3][0] A[5][0] A[5][0] A[7][0] A[7][0] A[9][0] A[9][0] A[9][0] A[7][0] A[7][0] A[5][0] A[5][0] A[3][0] A[3][0] A[1][0] A[1][0]

5
6
7

Tags not shown but are needed.


6/28
Associative Mapping

• LRU replacement policy: get cache hits for i = 9, 8, . . . , 2


• If i loop was a forward one, we would get no hits!

Content of data cache after loop pass: (time line)

j=0 j=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 i=9 i=8 i=7 i=6 i=5 i=4 i=3 i=2 i=1 i=0

0 A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[0][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[0][0]

1 A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[1][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[1][0] A[1][0]

2 A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0] A[2][0]

Cache 3 A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0] A[3][0]
Block
number 4 A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0] A[4][0]

5 A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0] A[5][0]

6 A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0]

7 A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0]

Tags not shown but are needed; LRU Counters not shown but are needed.
7/28
Set Associative Mapping
• Since all accessed blocks have even addresses (7A00, 7A04, 7A08, ...), only
half of the cache is used, i.e. they all map to set 0
• LRU replacement policy: get hits for i = 9, 8, 7 and 6
• Random replacement would have better average performance
• If i loop was a forward one, we would get no hits!

Content of data cache after loop pass: (time line)

j=0 j=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 i=9 i=8 i=7 i=6 i=5 i=4 i=3 i=2 i=1 i=0

0 A[0][0] A[0][0] A[0][0] A[0][0] A[4][0] A[4][0] A[4][0] A[4][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[8][0] A[4][0] A[4][0] A[4][0] A[4][0] A[0][0]
1 A[1][0] A[1][0] A[1][0] A[1][0] A[5][0] A[5][0] A[5][0] A[5][0] A[9][0] A[9][0] A[9][0] A[9][0] A[9][0] A[5][0] A[5][0] A[5][0] A[5][0] A[1][0] A[1][0]
Set 0
2 A[2][0] A[2][0] A[2][0] A[2][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[6][0] A[2][0] A[2][0] A[2][0]
Cache 3 A[3][0] A[3][0] A[3][0] A[3][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[7][0] A[3][0] A[3][0] A[3][0] A[3][0]
Block
number 4
5
Set 1
6
7

Tags not shown but are needed; LRU Counters not shown but are needed.
8/28
Comments on the Example

• In this example, Associative is best, then Set-Associative, lastly Direct Mapping.

• What are the advantages and disadvantages of each scheme?

• In practice,
• Low hit rates like in the example is very rare.
• Usually Set-Associative with LRU replacement scheme is used.

• Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache
memory

9/28
Example 2
Question:
How many total bits are required for a direct-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?

11/28
Question:
How many total bits are required for a direct-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?

Answer:
• In a 32-bit address CPU, 16 KiB is 4096 words.
• With a block size of 4 words, there are 1024 blocks.
• Each block has 4 × 32 or 128 bits of data plus a tag, which is (32 − 10 − 2 − 2) = 18
bits, plus a valid bit.
• Thus, the total cache size is 210 × (4 × 32 + 18 + 1) = 210 × 147 bits.
147
• the total number of bits in the cache is about 1.15 = times as many as needed
32 × 4
just for the storage of the data.

11/28
Question:
How many total bits are required for an associated-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?

12/28
Question:
How many total bits are required for an associated-mapped cache with 16 KiB of data and
4-word blocks, assuming a 32-bit address?

Answer:
• In a 32-bit address CPU, 16 KiB is 4096 words.
• With a block size of 4 words, there are 1024 blocks.
• Each block has 4 × 32 or 128 bits of data plus a tag, which is (32 − 2 − 2) = 28 bits,
plus a valid bit.
• Thus, the total cache size is 210 × (4 × 32 + 28 + 1) = 210 × 157 bits.
157
• the total number of bits in the cache is about 1.27 = times as many as needed
32 × 4
just for the storage of the data.

12/28
Example 3
Question
We have designed a 64-bit address direct-mapped cache, and the bits of address used to
access the cache are as shown below:

Table: Bits of the address to use in accessing the cache

Tag Index Offset


63-10 9-5 4-0

1 What is the block size of the cache in words?


2 Find the ratio between total bits required for such a cache design implementation
over the data storage bits.
3 Beginning from power on, the following byte-addressed cache references are
recorded as shown below.

Table: Recored byte-addressed cache references

Hex 00 04 10 84 E8 A0 400 1E 8C C1C B4 884


Dec 0 4 16 132 232 160 1024 30 140 3100 180 2180

Find the hit ratio.


14/28
1 Each cache block consists of four 8-byte words. The total offset is 5 bits. Three of
those 5 bits is the word offset (the offset into an 8-byte word). The remaining two bits
are the block offset. Two bits allows us to enumerate 22 = 4 words.
2 The ratio is 1.21. The cache stores a total of
32lines × 4words/block × 8bytes/word = 1024bytes = 8192bits. In addition to the
data, each line contains 54 tag bits and 1 valid bit. Thus, the total bits required is
8192 + 54 × 32 + 1 × 32 = 9952 bits.
4
3 The hit ratio is 12 = 33%

15/28
Performance Issues
Q1: Where A Block Be Placed in Upper Level?

Scheme name # of sets Blocks per set


Direct mapped # of blocks 1
# of blocks
Set associative Associativity Associativity
Fully associative 1 # of blocks

17/28
Q1: Where A Block Be Placed in Upper Level?

Scheme name # of sets Blocks per set


Direct mapped # of blocks 1
# of blocks
Set associative Associativity Associativity
Fully associative 1 # of blocks

Q2: How Is Entry Be Found?


Scheme name Location method # of comparisons
Direct mapped Index 1
Set associative Index the set; compare set’s tags Degree of associativity
Fully associative Compare all tags # of blocks

17/28
Q3: Which Entry Should Be Replaced on a Miss?
• Direct mapped: only one choice
• Set associative or fully associative:
• Random
• LRU (Least Recently Used)

Note that:
• For a 2-way set associative, random replacement has a miss rate 1.1× than LRU
• For high level associativity (4-way), LRU is too costly

18/28
Q4: What Happen On A Write?
• Write-Through:
• The information is written in both the block in cache & the block in lower level
of memory
• Combined with write buffer, so write waits can be eliminated

L
L:
• :

• Write-Back:
• The information is written only to the block in cache
• The modification is written to lower level, only when the block is replaced
• Need dirty bit: tracks whether the block is clean or not
• Virtual memory always use write-back

L
:

L
:

19/28
Q4: What Happen On A Write?
• Write-Through:
• The information is written in both the block in cache & the block in lower level
of memory
• Combined with write buffer, so write waits can be eliminated

L
L: read misses don’t result in writes
• : easier to implement

• Write-Back:
• The information is written only to the block in cache
• The modification is written to lower level, only when the block is replaced
• Need dirty bit: tracks whether the block is clean or not
• Virtual memory always use write-back

L
:

L
:

19/28
Q4: What Happen On A Write?
• Write-Through:
• The information is written in both the block in cache & the block in lower level
of memory
• Combined with write buffer, so write waits can be eliminated

L
L: read misses don’t result in writes
• : easier to implement

• Write-Back:
• The information is written only to the block in cache
• The modification is written to lower level, only when the block is replaced
• Need dirty bit: tracks whether the block is clean or not
• Virtual memory always use write-back

L
: write with speed of cache

L
: repeated writes require only one write to lower level

19/28
Performance Consideration

Performance
How fast machine instructions can be brought into the processor and how fast they can be
executed.

• Two key factors are performance and cost, i.e., price/performance ratio.
• For a hierarchical memory system with cache, the processor is able to access
instructions and data more quickly when the data wanted are in the cache.
• Therefore, the impact of a cache on performance is dependent on the hit and miss
rates.

20/28
Cache Hit Rate and Miss Penalty

• High hit rates over 0.9 are essential for high-performance computers.

• A penalty is incurred because extra time is needed to bring a block of data from a
slower unit to a faster one in the hierarchy.

• During that time, the processor is stalled.

• The waiting time depends on the details of the cache operation.

Miss Penalty
Total access time seen by the processor when a miss occurs.

21/28
Miss Penalty

Example: Consider a computer with the following parameters:


Access times to the cache and the main memory are t and 10t respectively. When a cache
miss occurs, a block of 8 words will be transferred from the MM to the cache. It takes 10t
to transfer the first word of the block and the remaining 7 words are transferred at a rate
of one word per t seconds.

• Miss penalty = t + 10t + 7 × t + t


• First t: Initial cache access that results in a miss.
• Last t: Move data from the cache to the processor.

22/28
Average Memory Access Time

h × C + (1 − h) × M

• h: hit rate
• M: miss penalty
• C: cache access time

• High cache hit rates (> 90%) are essential


• Miss penalty must also be reduced

23/28
Question: Memory Access Time Example
• Assume 8 cycles to read a single memory word;
• 15 cycles to load a 8-word block from main memory (previous example);
• cache access time = 1 cycle
• For every 100 instructions, statistically 30 instructions are data read/ write
• Instruction fetch: 100 memory access: assume hit rate = 0.95
• Data read/ write: 30 memory access: assume hit rate = 0.90

Calculate: (1) Execution cycles without cache; (2) Execution cycles with cache.

24/28
Caches on Processor Chips

• In high-performance processors, two levels of caches are normally used, L1 and L2.
• L1 must be very fast as they determine the memory access time seen by the processor.
• L2 cache can be slower, but it should be much larger than the L1 cache to ensure a
high hit rate. Its speed is less critical because it only affects the miss penalty of the L1
cache.
• Average access time on such a system:

h1 · C1 + (1 − h1 ) · [h2 · C2 + (1 − h2 ) · M]

• h1 (h2 ): the L1 (L2) hit rate


• C1 the access time of L1 cache,
• C2 the miss penalty to transfer data from L2 cache to L1
• M: the miss penalty to transfer data from MM to L2 and then to L1.

25/28
Larger Block Size

• Take advantage of spatial locality.

• , If all items in a larger block are needed in a computation, it is better to load these
items into the cache in a single miss.

• / Larger blocks are effective only up to a certain size, beyond which too many items
will remain unused before the block is replaced.

• / Larger blocks take longer time to transfer and thus increase the miss penalty.

• Block sizes of 16 to 128 bytes are most popular.

26/28
Miss Rate v.s. Block Size v.s. Cache Size

Miss rate goes up if the block size becomes a significant fraction of the cache size
because the number of blocks that can be held in the same size cache is smaller
(increasing capacity misses)

27/28
Enhancement

Write buffer:
• Read request is served first.
• Write request stored in write buffer first and sent to memory whenever there is no
read request.
• The addresses of a read request should be compared with the addresses of the write
buffer.
Prefetch:
• Prefetch data into the cache before they are needed, while the processor is busy
executing instructions that do not result in a read miss.
• Prefetch instructions can be inserted by the programmer or the compiler.
Load-through Approach
• Instead of waiting the whole block to be transferred, the processor resumes execution
as soon as the required word is loaded in the cache.

28/28

You might also like