Memory Hierarchy Design-Aca
Memory Hierarchy Design-Aca
Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or data uniformly. Locality occurs - Time (Temporal locality) - Space (spatial locality) Guidelines Smaller hardware can be made faster Different speed and sizes
Goal is provide a memory system with cost per byte than the next lower level Each level maps addresses from a slower, larger memory to a smaller but faster memory higher in the hierarchy.
Address mapping Address checking. Hence protection scheme for address for scrutinizing addresses are also part of the memory hierarchy.
Memory Hierarchy
Upper Faste
Register
Cach Block
Capacit
Spee
100,000 10,000 Performance 1,000 100 10 1 1980 1985 1990 1995 Year 2000 2005 2010 Memory
Processor
The importance of memory hierarchy has increased with advances in performance of processors.
Prototype When a word is not found in cache Fetched from memory and placed in cache with the address tag. Multiple words( block) is fetched for moved for efficiency reasons.
key design Set associative Set is a group of block in the cache. Block is first mapped on to set. Find mapping Searching the set
Chosen by the address of the data: (Block address) MOD(Number of sets in cache) n-block in a set Cache data - Cache read. - Cache write. Write through: update cache and writes through to update memory. Both strategies - Use write buffer. this allows the cache to proceed as soon as the data is placed in the buffer rather than wait the full latency to write the data into memory. Metric used to measure the benefits is miss rate No of access that miss No of accesses Write back: updates the copy in the cache. Cache replacement is called n-way set associative.
Causes of high miss rates Three Cs model sorts all misses into three categories Compulsory: every first access cannot be in cache Compulsory misses are those that occur if there is an infinite cache Capacity: cache cannot contain all that blocks that are needed for the program. As blocks are being discarded and later retrieved.
Conflict: block placement strategy is not fully associative Block miss if blocks map to its set.
Miss rate can be a misleading measure for several reasons So, misses per instruction can be used per memory reference Misses Instruction = Miss rate X Memory accesses Instruction count
Cache Optimizations
Six basic cache optimizations 1. Larger block size to reduce miss rate: - To reduce miss rate through spatial locality. Increase block size. Larger block size reduce compulsory misses. But they increase the miss penalty.
- Increases larger hit time for larger cache memory and higher cost and power.
3. Higher associativity to reduce miss rate: - Increase in associativity reduces conflict misses. 4. Multilevel caches to reduce penalty: - Introduces additional level cache - Between original cache and memory. - L1- original cache L2- added cache. L1 cache: - small enough - speed matches with clock cycle time. L2 cache: - large enough - capture many access that would go to main memory.
Average access time can be redefined as Hit timeL1+ Miss rate L1 X ( Hit time L2 + Miss rate L2 X Miss penalty L2) 5. Giving priority to read misses over writes to reduce miss penalty: - write buffer is a good place to implement this optimization. - write buffer creates hazards: read after write hazard. 6. Avoiding address translation during indexing of the cache to reduce hit time: - Caches must cope with the translation of a virtual address from the processor to a physical address to access memory. - common optimization is to use the page offset. - part that is identical in both virtual and physical addresses- to index the cache.
Reducing miss penalty or miss rate via parallelism Hardware prefetching Compiler prefetching
First Optimization : Small and Simple Caches Index tag memory and then compare takes time Small cache can help hit time since smaller memory takes less time to index E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip Simple direct mapping Can overlap tag check with data transmission since no choice
Access time estimate for 90 nm using CACTI model 4.0 Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches
2.50 Access time (ns) 2.00 1.50 1.00 0.50 16 KB 32 KB 64 KB 128 KB Cache size 256 KB 512 KB 1 MB 1-way 2-way 4-way 8-way
Miss Penalty
Multiplexer is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data
Miss 1st check other blocks for matches in next clock cycle
Accuracy 85% Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Used for instruction caches vs. data caches
1. Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory Built-in branch predictor
2. Cache the micro-ops vs. x86 instructions + Decode/translate from x86 to micro-ops on trace cache miss
1. better utilize long blocks (dont exit in middle of block, dont enter at label
in middle of block) 1. complicated address mapping since addresses no longer aligned to powerof-2 multiples of word size 1. instructions may appear multiple times in multiple dynamic traces due to
different branch outcomes Fourth optimization: pipelined cache access to increase bandwidth Pipeline cache access to maintain bandwidth, but higher latency Instruction cache access pipeline stages: 1: Pentium 2: Pentium Pro through Pentium III 4: Pentium 4 greater penalty on mispredicted branches more clock cycles between the issue of the load and the use of the data
hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests
hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses
Integer
Floating Point
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92
Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system
Simple mapping that works well is sequential interleaving Spread block addresses sequentially across banks E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1;
Seventh optimization :Reduce Miss Penalty: Early Restart and Critical Word First
Dont wait for full block before restarting CPU Early restartAs soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Spatial locality tend to want next sequential word, so not clear size of benefit of just early restart Critical Word FirstRequest the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Long blocks more popular today Critical Word 1st Widely used
block
Eight optimization: Merging Write Buffer to Reduce Miss Penalty Write buffer to allow processor to continue while waiting to write to memory If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry If so, new data are combined with that entry Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory The Sun T1 (Niagara) processor, among many others, uses write merging
Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows
Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE];
/* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality
0.1
0.05
Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache
1.32
1.40
ga lg el fa ce re c
ap pl u
ga p
wu pw is
SPECint2000
SPECfp2000
eq ua ke
3d
sw im
fa m
lu ca s
gr id
cf
Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth
The techniques to improve hit time, bandwidth, miss penalty and miss rate generally affect the other components of the average memory access equation as well as the complexity of the memory hierarchy.