0% found this document useful (0 votes)
46 views24 pages

13-Chapter5 Cache MEM P3

The document discusses techniques for improving cache performance such as reducing the memory access time. It describes exploiting the memory hierarchy through optimizations like reducing the hit time using small and simple caches, reducing the miss rate using larger cache sizes and higher associativity, and reducing the miss penalty using multilevel caches. The document provides examples and equations for calculating cache hit rates, miss rates, memory stall cycles, and estimating miss penalties.

Uploaded by

Rayan Hdada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views24 pages

13-Chapter5 Cache MEM P3

The document discusses techniques for improving cache performance such as reducing the memory access time. It describes exploiting the memory hierarchy through optimizations like reducing the hit time using small and simple caches, reducing the miss rate using larger cache sizes and higher associativity, and reducing the miss penalty using multilevel caches. The document provides examples and equations for calculating cache hit rates, miss rates, memory stall cycles, and estimating miss penalties.

Uploaded by

Rayan Hdada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Cash Memory

Part III

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 1


Next . . .

❑ Random Access Memory and its Structure

❑ Memory Hierarchy and the need for Cache Memory

❑ The Basics of Caches

❑ Cache Performance and Memory Stall Cycles

❑ Improving Cache Performance

❑ Multilevel Caches

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 2


Hit Rate and Miss Rate
❑ Hit Rate = Hits / (Hits + Misses)
❑ Miss Rate = Misses / (Hits + Misses)
❑ I-Cache Miss Rate = Miss rate in the Instruction Cache
❑ D-Cache Miss Rate = Miss rate in the Data Cache
❑ Example:
❑ Out of 1000 instructions fetched, 150 missed in the I-Cache
❑ 25% are load-store instructions, 50 missed in the D-Cache
❑ What are the I-cache and D-cache miss rates?
❑ I-Cache Miss Rate = 150 / 1000 = 15%
❑ D-Cache Miss Rate = 50 / (25% × 1000) = 50 / 250 = 20%

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 3


Memory Stall Cycles
❑ The processor stalls on a Cache miss
❑ When fetching instructions from the Instruction Cache (I-
cache)
❑ When loading or storing data into the Data Cache (D-cache)
Memory stall cycles = Combined Misses  Miss Penalty
❑ Miss Penalty: clock cycles to process a cache miss
Combined Misses = I-Cache Misses + D-Cache Misses
I-Cache Misses = I-Count × I-Cache Miss Rate
D-Cache Misses = LS-Count × D-Cache Miss Rate
LS-Count (Load & Store) = I-Count × LS Frequency
❑ Cache misses are often reported per thousand instructions

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 4


Memory Stall Cycles Per Instruction
❑ Memory Stall Cycles Per Instruction =
I-Cache Miss Rate × Miss Penalty +
LS Frequency × D-Cache Miss Rate × Miss Penalty
❑ Combined Misses Per Instruction =
I-Cache Miss Rate + LS Frequency × D-Cache Miss Rate
❑ Therefore, Memory Stall Cycles Per Instruction =
Combined Misses Per Instruction × Miss Penalty
❑ Miss Penalty is assumed equal for I-cache & D-cache
❑ Miss Penalty is assumed equal for Load and Store

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 5


Example on Memory Stall Cycles
❑ Consider a program with the given characteristics
❑ Instruction count (I-Count) = 106 instructions
❑ 30% of instructions are loads and stores
❑ D-cache miss rate is 5% and I-cache miss rate is 1%
❑ Miss penalty is 100 clock cycles for instruction and data caches
❑ Compute combined misses per instruction and memory stall cycles
❑ Combined misses per instruction in I-Cache and D-Cache
❑ 1% + 30%  5% = 0.025 combined misses per instruction
❑ Equal to 25 misses per 1000 instructions
❑ Memory stall cycles
❑ 0.025  100 (miss penalty) = 2.5 stall cycles per instruction
❑ Total memory stall cycles = 106  2.5 = 2,500,000

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 6


CPU Time with Memory Stall Cycles

CPU Time = I-Count × CPIMemoryStalls × Clock Cycle

CPIMemoryStalls = CPIPerfectCache + Mem Stalls per Instruction

❑ CPIPerfectCache = CPI for ideal cache (no cache misses)

❑ CPIMemoryStalls = CPI in the presence of memory stalls

❑ Memory stall cycles increase the CPI

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 7


Example on CPI with Memory Stalls
❑ A processor has CPI of 1.5 without any memory stalls
❑ Average cache miss rate is 2% for instruction and data
❑ 50% of instructions are loads and stores
❑ Cache miss penalty is 100 clock cycles for I-cache and D-cache
❑ What is the impact on the CPI?
Instruction data
❑ Answer:
Mem Stalls per Instruction = 0.02×100 + 0.5×0.02×100 = 3
CPIMemoryStalls = 1.5 + 3 = 4.5 cycles per instruction
CPIMemoryStalls / CPIPerfectCache = 4.5 / 1.5 = 3

Processor is 3 times slower due to memory stall cycles


CPINoCache = 1.5 + (1 + 0.5) × 100 = 151.5 (a lot worse)

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 8


Designing Memory to Support Caches

CPU CPU Wide:


CPU, Mux: 1 word
Multiplexer Cache, Bus, Memory: N words
One-word-wide Memory Organization

Cache
Alpha: 256 bits
Bus Cache Ultra SPARC: 512 bits
Bus

Memory CPU

Wide Memory Organization Cache


Memory One Word Wide: Bus
CPU, Cache, Bus, and Memory
have word width: 32 or 64 bits Memory Memory Memory Memory
Interleaved: bank 0 bank 1 bank 2 bank 3
CPU, Cache, Bus: 1 word Interleaved Memory Organization
Memory: N independent banks

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 9


Memory Interleaving
❑ Memory interleaving is more flexible than wide access
❑ A block address is sent only once to all memory banks

❑ Words of a block are distributed (interleaved) across all

banks
❑ Banks are accessed in parallel

❑ Words are transferred one at a time on each bus cycle

Bus All banks access CPU


cycle same block address

word 3 (bank 3) Cache


block address

word 2 (bank 2)
Bus
word 1 (bank 1)
Memory Memory Memory Memory
word 0 (bank 0)
Time bank 0 bank 1 bank 2 bank 3
Interleaved Memory Organization

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 10


Estimating the Miss Penalty
❑ Timing Model: Assume the following …
❑ 1 memory bus cycle to send address
❑ 15 memory bus cycles for DRAM access time
❑ 1 memory bus cycle to send data
❑ Cache Block is 4 words
❑ One-Word-Wide Memory Organization
Miss Penalty = 1 + 4 ×15 + 4 × 1 = 65 memory bus cycles
❑ Wide Memory Organization (2-word wide)
Miss Penalty = 1 + 2 ×15 + 2 × 1 = 33 memory bus cycles
❑ Interleaved Memory Organization (4 banks)
Miss Penalty = 1 + 1 ×15 + 4 × 1 = 20 memory bus cycles

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 11


Next . . .

❑ Random Access Memory and its Structure

❑ Memory Hierarchy and the need for Cache


Memory

❑ The Basics of Caches

❑ Cache Performance and Memory Stall Cycles

❑ Improving Cache Performance

❑ Multilevel Caches

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 12


Improving Cache Performance
❑ Average Memory Access Time (AMAT)
AMAT = Hit time + Miss rate * Miss penalty

❑ Used as a framework for optimizations


❑ Reduce the Hit time
❑ Small and simple caches
❑ Reduce the Miss Rate
❑ Larger cache size, higher associativity, and larger block
size
❑ Reduce the Miss Penalty
❑ Multilevel caches

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 13


Small and Simple Caches
❑ Hit time is critical: affects the processor clock rate
❑ Fast clock cycle demands small and simple L1 cache

designs
❑ Small cache reduces the indexing time and hit time
❑ Indexing a cache represents a time consuming portion

❑ Tag comparison also adds to this hit time

❑ Direct-mapped overlaps tag check with data transfer


❑ Associative cache uses additional mux and increases hit

time
❑ Size of L1 caches has not increased much
❑ L1 caches are the same size on Alpha 21264 and 21364

❑ Same also on UltraSparc II and III, AMD K6 and Athlon

❑ Reduced from 16 KB in Pentium III to 8 KB in Pentium 4

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 14


Larger Size and Higher Associativity
❑ Cache misses:

❑ Compulsory misses are those misses caused by the first reference to a


datum

❑ Capacity misses are those misses that occur regardless of associativity or


block size, solely due to the finite size of the cache

❑ Conflict misses are those misses that could have been avoided, had the
cache not evicted an entry earlier.

❑ Increasing cache size reduces capacity misses and conflict misses

❑ Larger cache size spreads out references to more blocks

❑ Drawbacks: longer hit time and higher cost

❑ Larger caches are especially popular as 2nd level caches

❑ Higher associativity also improves miss rates

❑ Eight-way set associative is as effective as a fully associative

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 15


Miss rate versus cache size on the Integer
portion of SPEC CPU2000

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 16


Larger Block Size

❑ Simplest way to reduce miss rate is to increase block size


❑ However, it increases conflict misses if cache is small
25%
Increased Conflict Misses

20% Reduced
Compulsory 1K
Misses 64-byte blocks
Miss Rate

15% 4K are common in


L1 caches
10% 16K
128-byte block
64K are common in
5% L2 caches
256K
0%
256
32

64

128
16

Block Size (bytes)

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 17


Next . . .
❑ Random Access Memory and its Structure

❑ Memory Hierarchy and the need for Cache


Memory

❑ The Basics of Caches

❑ Cache Performance and Memory Stall Cycles

❑ Improving Cache Performance

❑ Multilevel Caches

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 18


Multilevel Caches

❑ Top level cache should be kept small to


❑ Keep pace with processor speed
I-Cache D-Cache

❑ Adding another cache level Unified L2 Cache


❑ Can reduce the memory gap

❑ Can reduce memory bus loading Main Memory

❑ Local miss rate


❑ Number of misses in a cache / Memory accesses to this cache

❑ Miss RateL1 for L1 cache, and Miss RateL2 for L2 cache

❑ Global miss rate


Number of misses in a cache / Memory accesses generated by
CPU
Miss RateL1 for L1 cache, and Miss RateL1  Miss RateL2 for L2
cache

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 19


Multilevel Cache Policies
❑ Multilevel Inclusion

❑ L1 cache data is always present in L2 cache

❑ A miss in L1, but a hit in L2 copies block from L2 to L1

❑ A miss in L1 and L2 brings a block into L1 and L2

❑ A write in L1 causes data to be written in L1 and L2

❑ Typically, write-through policy is used from L1 to L2

❑ Typically, write-back policy is used from L2 to main memory

❑ To reduce traffic on the memory bus

❑ A replacement or invalidation in L2 must be propagated to L1

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 20


Multilevel Cache Policies – cont’d
❑ Multilevel exclusion
❑ L1 data is never found in L2 cache – Prevents wasting space
❑ Cache miss in L1, but a hit in L2 results in a swap of blocks
❑ Cache miss in both L1 and L2 brings the block into L1 only
❑ Block replaced in L1 is moved into L2
❑ Example: AMD Athlon
❑ Same or different block size in L1 and L2 caches
❑ Choosing a larger block size in L2 can improve performance
❑ However different block sizes complicates implementation
❑ Pentium 4 has 64-byte blocks in L1 and 128-byte blocks in L2

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 21


Two-Level Cache Performance – 1/2
❑ Average Memory Access Time:
AMAT = Hit TimeL1 + Miss RateL1  Miss PenaltyL1
❑ Miss Penalty for L1 cache in the presence of L2 cache
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2  Miss PenaltyL2
❑ Average Memory Access Time with a 2nd Level cache:
AMAT = Hit TimeL1 + Miss RateL1 
(Hit TimeL2 + Miss RateL2  Miss PenaltyL2)
❑ Memory Stall Cycles per Instruction =
Memory Access per Instruction × (AMAT – Hit TimeL1)

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 22


Two-Level Cache Performance – 2/2

❑ Average memory stall cycles per instruction =


Memory Access per Instruction × Miss RateL1 ×
(Hit TimeL2 + Miss RateL2 × Miss PenaltyL2)
❑ Average memory stall cycles per instruction =
Misses per instructionL1 × Hit TimeL2 +
Misses per instructionL2 × Miss PenaltyL2
❑ Misses per instructionL1 =
MEM access per instruction × Miss RateL1
❑ Misses per instructionL2 =
MEM access per instruction × Miss RateL1 × Miss RateL2

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 23


Example on Two-Level Caches
❑ Problem:
❑ Miss RateL1 = 4%, Miss RateL2 = 25%
❑ Hit time of L1 cache is 1 cycle and of L2 cache is 10 cycles
❑ Miss penalty from L2 cache to memory is 100 cycles
❑ Memory access per instruction = 1.25 (25% data accesses)
❑ Compute AMAT and memory stall cycles per instruction
❑ Solution:
AMAT = 1 + 4% × (10 + 25% × 100) = 2.4 cycles
Misses per instruction in L1 = 4% × 1.25 = 5%
Misses per instruction in L2 = 4% × 25% × 1.25 = 1.25%
Memory stall cycles per instruction = 5% × 10 + 1.25% × 100 = 1.75
Can be also obtained as: (2.4 – 1) × 1.25 = 1.75 cycles

Chapter 5: Exploiting Memory hierarchy Cash Memory Part III 24

You might also like