03-Memory
03-Memory
Board-Based Systems
Cache and Memory
• cache
• performance
• cache partitioning
• multi-level cache
• memory
• off-die memory designs
Outline for memory design
Area comparison of memory
tech.
System environments and
memory
Performance factors
Virtual
address
Factors:
1. physical word size
• processor cache
2. block / line size
• cache memory
3. cache hit time
• cache size, organization
4. cache miss time
• memory and bus
5. virtual-to-real translation time
6. number of processor requests per cycle
Design target miss rates
beyond 1MB
double the size
half the miss rate
System effects limit hit rate
L3: multiple
256KB arrays
L1 usually less
than 64kB
Analysis: multi-level cache miss rate
• L2 cache analysis by statistical inclusion
• if L2 cache > 4 x size of the L1 cache then
• assume statistically: contents of L1 lies in L2
• relevant L2 miss rates
• local miss rate: No. L2 misses / No. L2 references
• global Miss Rate: No. misses / No. processor ref.
• solo Miss Rate: No. misses without L1/No. proc. ref.
• Inclusion => solo miss rate = global miss rate
• miss penalty calculation
• L1 miss rate x (miss in L1, hit in L2 penalty) plus
• L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)
Multi-level cache example
L1 L2 Memory
Miss Rate 4% 1%
- delays:
Miss in L1, Hit in L2 2 cycles
Miss in L1, Miss in L2 15 cycles
- assume one reference/instruction
L1 delay is 1 ref/instr x .04 misses/ref x 2 cycles/miss = 0.08 cpi
L2 delay is 1 ref/instr x .01 misses/ref x (15-2) = 0.13 cpi
Total effect of 2 level system is 0.08 + 0.13 = 0.29 cpi
Memory design
• logical inclusion
• embedded RAM
• off-die: DRAM
• basic memory model
• Strecker’s model
Physical memory system
Hierarchy of caches
Name ? Size Access Transfer
size
L0 Registers <256 <1 cycle word
words
L1 Core local <64K <4 cycle Line
L2 On Chip <64M <30 cycle Line
L3 DRAM on <1G <60 cycle >= Line
Chip
M0 Off Chip
Cache
M1 Local Main <16G <150 cycle >= Line
Memory
M2 Cluster
Memory
Hierarchy of caches
• Working Set – how much memory an “iteration” requires
• if it fits in a level then that will be the worst case
• if it does not, hit rate typically determines performance
• double the cache level size half the miss rate – good rule of
thumb
• if 90% hit rate, 10x memory access time, performance 50%
• and that’s for 1 core
Logical inclusion
• multiprocessors with L1 and L2 caches
• Important: L1 cache does NOT contain a line
• sufficient to determine
• L2 cache does not have the line
• need to ensure
• all the contents of L1 are always in L2
• this property: Logical Inclusion
Logical inclusion techniques
• passive
• control Cache size, organization, policies
• no. L2 sets no. L1 sets
• size
L2 set size L1 set
• compatible replacement
algorithms
• but: highly restrictive and difficult to guarantee
• active
• whenever a line is replaced or invalidated in the L2
• ensure it is not present in L1 or it is evicted from L1
Memory system design outline
• memory chip technology
• on-die or off die
• static versus dynamic:
• SRAM versus DRAM
• access protocol: talking to memory
• synchronous vs asynchronous DRAMs
• simple memory performance model
• Strecker’s model for memory banks
Why BIG memory?
Memory
• many times, computation limited by memory
• not processor organization or cycle time
• model description
• each processor generates 1 reference per cycle
• requests randomly/uniformly distributed over modules
• any busy module serves 1 request
• all unserviced requests are dropped each cycle
• assume there are no queues
• B(m,n) = m[1 - (1 - 1/m)n]
• relative Performance Prel = B(m,n) / n
Deriving Strecker’s model
• Prob[given processor not reference module]
= (1 – 1/m)
• Prob[no processor references module]
= P[idle]
= (1 – 1/m)n
• Prob[module busy]
= 1 - (1 – 1/m)n
• average number of busy modules is B(m,n)
• B(m,n) = m[1 - (1 - 1/m)n]
Example 1
• 2 dual core processor dice share memory
• Ts = 24 ns
• each die has 2 processors
• sharing 4MB L2
• miss rate is 0.001 misses reference
• each processor makes 3 references/cycle @ 4 GHz
2 x 2 x 3 x 0.001 =0.012 refs/cyc
Ts = 4 x 24 cycles
n = 1.152 processor requests / Ts; if m= 4
success rate B(m,n) = B(4,1.152) = 0.81
Relative Performance = B/n = .81/1.152 =0.7
Example 2