0% found this document useful (0 votes)
47 views

Large and Fast: Exploiting Memory Hierarchy

This document discusses computer memory hierarchies and caches. It begins by noting that while on-chip memory is fast, programs require more memory than can fit on chip. To address this, computers use a memory hierarchy with multiple cache levels between fast on-chip memory and slower main memory. The caches exploit locality to achieve fast average memory access times. Caches are organized with blocks and use tags, indexes, and offsets to determine hits and misses. Direct mapped and set associative mappings are described. Cache misses are classified and reducing capacity misses involves increasing cache size.

Uploaded by

api-26072581
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Large and Fast: Exploiting Memory Hierarchy

This document discusses computer memory hierarchies and caches. It begins by noting that while on-chip memory is fast, programs require more memory than can fit on chip. To address this, computers use a memory hierarchy with multiple cache levels between fast on-chip memory and slower main memory. The caches exploit locality to achieve fast average memory access times. Caches are organized with blocks and use tags, indexes, and offsets to determine hits and misses. Direct mapped and set associative mappings are described. Cache misses are classified and reducing capacity misses involves increasing cache size.

Uploaded by

api-26072581
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

CSE313 – Computer

Architecture

Large and Fast:


Exploiting Memory
Hierarchy
The memory dilemma
 Ch 6 assumption: on-chip instruction and data
memories hold the entire program and its data
and can be accessed in one cycle
 Reality check
 Inhigh performance machines, programs may require 100’s of
megabytes or even gigabytes of memory to run
 Embedded processors have smaller needs but there is also less
room for on-chip memory
 Basic problem
 We need much more memory than can fit on the
microprocessor chip
 But we do not want to incur stall cycles every time the pipeline
accesses instructions or data
 At the same time, we need the memory to be economical for
the machine to be competitive in the market
Solution: a hierarchy of memories
Another view
Typical characteristics of each
level
 First level (L1) is separate on-chip instruction
and data caches placed where our instruction
and data memories reside
 16-64KB for each cache (desktop/server machine)
 Fast, power-hungry, not-so-dense, static RAM (SRAM)

 Second level (L2) consists of another larger


unified cache
 Holds both instructions and data
 256KB-4MB
 On or off-chip
 SRAM

 Third level is main memory


 64-512MB
 Slower, lower-power, denser dynamic RAM (DRAM)
 Final level is I/O (e.g., disk)
Caches and the pipeline
 L1 instruction and data caches and L2
cache

L1 L1

cache cache

L2 cache

to mm
Memory hierarchy operation

(1) Search L1 for the instruction or data


L1
If found (cache hit), done
(2) Else (cache miss), search L2 cache
L2 If found, place it in L1 and repeat (1)
(3) Else, search main memory
If found, place it in L2 and repeat (2)

main
(4) Else, get it from I/O (Chapter 8)
memory

Steps (1)-(3) are performed in hardware


 1-3 cycles to get from L1 caches
 5-20 cycles to get from L2 cache
 50-200 cycles to get from main memory
Principle of locality
 Programs access a small portion of
memory within a short time period

 Temporal locality: recently accessed


memory locations will likely be accessed
soon

 Spatial locality: memory locations near


recently accessed locations will likely be
accessed soon
POL makes memory hierarchies
work
 A large percentage of the time (typically
>90%) the instruction or data is found in
L1, the fastest memory

 Cheap, abundant main memory is accessed


more rarely

  Memory hierarchy operates at nearly the


speed of expensive on-chip SRAM with
about the cost of main memory (DRAMs)
How caches exploit the POL
 On a cache miss, a block of several
instructions or data, including the
requested item, are returned
requested instruction

instruction instructioni+1 instructioni+2 instructioni+3

 The entire block is placed into the cache so


that future searches for items in the block
will be successful
How caches exploit the POL
 Consider sequence of instructions and data
accesses in this loop with a block size of 4
words
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1 , -4
bne $s1, $zero, Loop
Searching the cache
 The cache is much smaller than main
memory

 Multiple memory blocks must share the


same cache location

block
Searching the cache
 Need a way to determine whether the
desired instruction or data is held in the
cache
 Need a scheme for replacing blocks when a
new block needs to be brought in on a miss

block
Cache organization alternatives
 Direct mapped: each block can be placed in
only one cache location

 Set associative: each block can be placed


in any of n cache locations

 Fully associative: each block can be placed


in any cache location
Cache organization alternatives
 Searching for block 12 in caches of size 8
blocks

Set Set # 0
Searching a direct mapped cache
 Need log2 number of sets of the
Set
address bits (the index) to
select the block location
 block offset used to select the
desired byte, half-word, or
word within the block
 Remaining bits (the tag) used
to determine if this is the
desired block or another that
assume data cache shares the same cache location
with 16 byte blocks
8 sets
4 block offset bits
block
3 index bits tag index offset
25 tag bits memory address
Searching a direct mapped cache
 Block is placed in the set index
Set
 number of sets = cache
size/block size

assume data cache


with 16 byte blocks
8 sets
4 block offset bits
block
3 index bits tag index offset
25 tag bits memory address
Direct mapped cache organization
 64KB instruction cache with 16 byte (4 word)
blocks

 4K sets (64KB/16B)  need 12 address bits to pick


Direct mapped cache organization
 The data section of the cache holds the
instructions
Direct mapped cache organization
 The tag section holds the part of the
memory address used to distinguish
different blocks
Direct mapped cache organization
 A valid bit associated with each set
indicates if the instructions are valid or not
Direct mapped cache access
 The index bits are used to select one of the
sets
Direct mapped cache access
 The data, tag, and Valid bit from the
selected set are simultaneously accessed
Direct mapped cache access
 The tag from the selected entry is
compared with the tag field of the address
Direct mapped cache access
 A match between the tags and a Valid bit
that is set indicates a cache hit
Direct mapped cache access
 The block offset selects the desired
instruction
Set associative cache
 Block placed in one way of set index
 Number of sets = cache size/(block
size*ways)
ways 0-3
Set associative cache operation
 The index bits are used to select one of the
sets
Set associative cache operation
 The data, tag, and Valid bit from all ways of
the selected entry are simultaneously
accessed
Set associative cache operation
 The tags from all ways of the selected entry
are compared with the tag field of the
address
Set associative cache operation
 A match between the tags and a Valid bit
that is set indicates a cache hit (hit in way1
shown)
Set associative cache operation
 The data from the way that had a hit is
returned through the MUX
Cache misses
 A cache miss occurs when the block is
not found in the cache
 The block is requested from the next
level of the hierarchy
L1
 When the block returns, it is loaded
L2
into the cache and provided to the
requester
 A copy of the block remains in the
lower levels of the hierarchy
main  The cache miss rate is found by
memory
dividing the total number of misses
by the total number of accesses
(misses/accesses)
 The hit rate is 1-miss rate
Classifying cache misses
 Compulsory misses
 Caused by the first access to a block that has never
been in the cache
 Capacity misses
 Dueto the cache not being big enough to hold all the
blocks that are needed
 Conflict misses
 Due to multiple blocks competing for the same set
 A fully associative cache with a “perfect”
replacement policy has no conflict misses
Cache miss classification
 Direct mapped examples
cache of size two blocks
 Blocks A and B map to set 0, C and D to set
1
 Access
A
patternB 1: A, B, C,
B
D, A, B, BC, D
0 0 0 0
1 1 1 C 1 D

?? ?? ?? ??

0 A 0 B 0 B 0 B
1 D 1 D 1 C 1 D

?? ?? ?? ??

 Access pattern 2: A, A, B, A
0 A 0 A 0 B 0 A
1 1 1 1

?? ?? ?? ??
Reducing capacity misses
 Increase the cache size
 More cache blocks can be simultaneously held in the
cache
 Drawback: increased access time
Block replacement policy
 Determines what block to replace on a cache
miss to make room for the new block
 Least recently used (LRU)
 Pickthe one that has been unused for the longest time
 Based on temporal locality
 Requires ordering bits to be kept with each set
 Too expensive beyond 4-way

 Random
 Pseudo-randomly pick a block
 Generally not as effective as LRU (higher miss rates)
 Simple even for highly associative organizations

 Most recently used (MRU)


 Keeptrack of which block was accessed last
 Randomly pick a block from other than that one
 Compromise between LRU and random
Cache writes and block
replacement
 With a write back cache, when a block is
written to, copies of the block in the lower
levels are not updated
 If this block is chosen for replacement on a
miss, we need save it to the next level
 Solution:
A dirty bit is associated with each cache block
 The dirty bit is set if the block is written to
 A block with a set dirty bit that is chosen for
replacement is written to the next level before being
overwritten with the new block
Virtual memory
 A single program may require more main
memory than is present in the machine

 Multiple programs must share main memory


without interfering with each other

 With virtual memory , portions of different


programs are loaded from I/O to memory on
demand

 When memory gets full, portions of


programs are swapped out to I/O
Implementing virtual memory
 Separate different programs in memory by
assigning different memory addresses to
each
 Identify when the desired instructions or
data are in memory or not
 Generate an interrupt when they are not in
memory and must be retrieved from I/O
 Provide support in the OS to retrieve the
desired instructions or data, replacing
others if necessary
 Prevent users from accessing instructions
or data they do not own
Memory pages
 Transfers between disk
and memory occur in
pages whose size is
defined in the ISA
L2
cache
 Page size is large to cache block
(16-128B)
amortize the high cost
of disk access (~5- disk
main
10ms) page
(4-64KB)
memory

 Tradeoffs in increasing
page size are similar as
for cache block size
Virtual and physical addresses
 The virtual addresses in your program are
translated during program execution into
the physical addresses used to address
memory
 Some virtual addresses may refer to data
that is not in memory (not memory
resident)
Address translation
 The virtual page number (vpn) part of the
virtual address is translated into a physical
page number (ppn) that points to the
desired page
 The low-order page offset bits point to
which byte is being accessed within a page
 The ppn + page offset form the physical
address
4GB of virtual
address space

212 = 4KB page

1GB of main memory


in the machine
Address translation
 Another view of the physical address

ppn page offset

location of first
byte in page

2pageoffset bytes

location of
addressed byte
in page

page
Address translation
 Address translation is performed by the
hardware and the operating system (OS)

 The OS maintains for each program


 What pages are associated with it
 Where on disk each page resides
 What pages are memory resident
 The ppn associated with the vpn of each memory
resident page
Address translation
 For each program, the OS sets up a page
table in memory that holds the ppn
corresponding to the vpn of each memory
resident page

 The page table register in hardware


provides the base address of the page
table for the currently running process

 Each program has a unique page table and


page table register value that is loaded by
the OS
Address translation
 Page table access

address is
offset into the page table register + vpn
page table
located in memory 
requires loads/stores to
access!
The TLB: faster address
translation
 Major problem: for each instruction or data
access we have to first access the page
table in memory to get the physical address
 Solution: cache the address translations in a
Translation Lookaside Buffer (TLB) in
hardware
 The TLB holds the ppns of the most recently
accessed pages
 Hardware first checks the TLB for the
vpn/ppn pair; if not found (TLB miss), then
the page table is accessed to get the ppn
 The ppn is loaded into the TLB for later
access

You might also like