0% found this document useful (0 votes)
17 views45 pages

CS2115 Chapter-6

Lecture slides from Chapter 6 - Memory

Uploaded by

Zlata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views45 pages

CS2115 Chapter-6

Lecture slides from Chapter 6 - Memory

Uploaded by

Zlata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

CS2115 Computer Organization

2023/2024 Sem A

Chapter 6: Memory

Dr. Nan Guan


Department of Computer Science
City University of Hong Kong

1
Memory
• By the word "memory", we mean

– Wide sense: state elements


• Main memory, hard disk, flash memory, register, cache …

– Narrow sense: main memory

2
The Memory Wall

The processor performance is


significantly increasing over the
years, but the performance of
the memory is not increasing at
the same pace.

The low memory speed becomes


the performance bottleneck

3
Principle of Locality

• Temporal locality: accesses to the same


memory location that occur close in time

• Spatial locality: accesses to the memory


locations that occur close in space

4
Memory Hierarchy
• Organize memory system into a hierarchy with faster (but
smaller) memory closer to processor

5
Memory Hierarchy

6
Cache Memory

7
Direct-Mapped Caches
Suppose the cache has 8 lines, the memory has 32 words,
Simplified Example
each cache line stores one word
data
address

000 00000 01000 10000 11000


001 00001 01001 10001 11001
010 00010 01010 10010 11010
011 00011 01011 10011 11011
Cache 100 00100 01100 10100 11100 Memory
101 00101 01101 10101 11101
110 00110 01110 10110 11110
111 00111 01111 10111 11111

8
Direct-Mapped Caches
Suppose the cache has 8 lines, the memory has 32 words,
Simplified Example
each cache line stores one word
Cache_Index = Memory_Address mod Cache_Size
Cache_Tag = Memory_Address/Cache_Size
Index Tag Contents

000 00 00000 01000 10000 11000


001 00 00001 01001 10001 11001
010 00 00010 01010 10010 11010
011 00 00011 01011 10011 11011
Cache 100 00 00100 01100 10100 11100 Memory
101 00 00101 01101 10101 11101
110 00 00110 01110 10110 11110
111 00 00111 01111 10111 11111

9
Direct-Mapped Caches
• When access the cache with 2w lines:
– Index into cache with W address bits (the index bits) A cache line is invalid if its value is
inconsistent with the memory any
– Read out valid bit, tag, and data longer, e.g., because the data is
– If valid bit == 1 and tag matches -> Cache Hit modified by other elements (DMA,
other cores…)
– Otherwise -> Cache Miss
– Offset bits: number of memory blocks in each cache line
Valid bit Tag(27 bits) Data (32 bits)
Example: 8-blocks Direct-Mapped cache (W=3) 0
1
2
32-bit BYTE address 3
4
(suppose each cache line stores one word) 5
6
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0
7

Tag bits IndexOffset


bits bits HIT
=? 10
Direct-Mapped Caches
64-line direct-mapped cache → 64 indexes → 6 index bits

Read Mem[0x0000400C] Valid bit Tag(24 bits) Data (32 bits)


0000 0000 0000 0000 0100 0000 0000 1100 0 1 0x000058 0XDEADBEEF
1 1 0x000058 0X00000000
2 0 0x000058 0X00000007
TAG: 0x000040 3 1 0x000040 0X42424242
INDEX: 0x3 4 1 0x000007 0x6FBA2381

OFFSET: 0x0

...

...
...
Hit!
Read from cache: 0x42424242 63 1 0x000058 0xF7324A32

11
Direct-Mapped Caches
64-line direct-mapped cache → 64 indexes → 6 index bits

Read Mem[0x00004004] Valid bit Tag(24 bits) Data (32 bits)


0000 0000 0000 0000 0100 0000 0000 0100 0 1 0x000058 0XDEADBEEF
1 1 0x000040
0x000058 0x12345678
0X00000000
2 0 0x000058 0X00000007
TAG: 0x000040 3 1 0x000040 0X42424242
INDEX: 0x1 4 1 0x000007 0x6FBA2381

OFFSET: 0x0

...

...
...
Miss!
Load from memory: 0x12345678 63 1 0x000058 0xF7324A32
Tag updated to 0x000040

12
Direct-Mapped Caches
64-line direct-mapped cache → 64 indexes → 6 index bits

Read Mem[0x0000400C] Valid bit Tag(24 bits) Data (32 bits)


0000 0000 0000 0000 0100 0000 0000 1100 0 1 0x000058 0XDEADBEEF
1 1 0x000058 0X00000000
2 0 0x000058 0X00000007
TAG: 0x000040 3 1 0x000040 0X42424242
INDEX: 0x3 4 1 0x000007 0x6FBA2381

OFFSET: 0x0

...

...
...
HIit! Read Data: 0x42424242
63 1 0x000058 0xF7324A32
Would Read Mem [0x00004008] be hit?
INDEX:0x2 → tag mismatch → miss
What are the addresses of data in indexes 0, 1, and 2?
TAG:0x58 → 0000 0000 0000 0000 0101 1000 iiii ii00 (substitute iiiiii by index 0, 1, 2)
→ 0x5800,0x5804,0x5808
13
Fully-Associative Cache
• A memory block can be stored in any cache line

Cache Memory
000 00000 01000 10000 11000
001 00001 01001 10001 11001
010 00010 01010 10010 11010
011 00011 01011 10011 11011
100 00100 01100 10100 11100
101 00101 01101 10101 11101
110 00110 01110 10110 11110
111 00111 01111 10111 11111

14
Fully-Associative Cache
• A memory block can be stored in any cache line
Example:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0
32-bit BYTE address
each cache line stores 1 word Tag bits Offset
bits
• To check if an access is a cache hit
cache line 1 cache line 2 cache line N
– Need to check all cache lines
tag offset

15
Fully-Associative Cache
• When a cache miss occurs, we need to load the missed data
from memory to cache
• Some data currently stored in cache will be replaced
• Replacement policy decides which data to be replaced
• Many different policies
– Random, FIFO (First-In-First-Out), LRU (Least Recently Used) … …

16
Fully-Associative Cache
• FIFO replacement policy
– Each cache line has an age young a
miss
f
hit
f
b f a b a
– When miss, loaded data is the youngest c b b
old d c c
• Other data’s age increased by 1
– When hit, no change
• LRU replacement policy
miss hit
– Same to FIFO when miss young a f b
b f a b f
– When hit, the accessed data becomes c b a
old d c c
the youngest
– Previously younger ones’ age increase by 1
17
FIFO Algorithm
• FIFO (First-In-First-Out): replaces earliest loaded page
• Example:
Sequence 

18
LRU Algorithm
• LRU (Least Recently Used): replaces pages that have not been
used for the longest time
• Example:
Sequence 

19
Direct-Mapped v.s. Fully-Associative
• Fully-associative cache is more flexible
– Higher utilization of the resource
• a memory block can be stored in any cache line, so lower chance to conflict
– More difficult to search
Cache Memory
• Need to search the entire cache 000 00000 01000 10000 11000

• Direct-mapped cache is simpler 001


010
00001
00010
01001
01010
10001
10010
11001
11010
011 00011 01011 10011 11011
– Easier to search 100 00100 01100 10100 11100
101 00101 01101 10101 11101
• Only need to check one cache line 110 00110 01110 10110 11110
111 00111 01111 10111 11111

20
Set-Associative Cache
• Can we combine the easy searching of direct-mapped and the
flexibility of fully-associative caches?
• Yes: set-associative
— Each cache block can go in only 1 set (same modulo addressing)
— But, each set can have multiple blocks (associative searching)
• Easy to search:
— You know which set based on the address
— Only have to check the few cache blocks in the set
• Flexible:
— Can put each cache block anywhere in the set
— Reduces (but doesn’t eliminate) conflict misses
21
Simplified Example
Set-Associative Cache
0 0
1 1
Set 0

• Use modulo addressing


2
3
2
3
Set 1
4
— 0 → (0 mode 4) = anywhere is Set 0 5
4
5
Set 2
6
— 1 → (1 mode 4) = anywhere is Set 1 7
6
7
Set 3
8
— 2 → (2 mode 4) = anywhere is Set 2 9
Cache
10
— 8 → (8 mode 4) = anywhere is Set 0 11
We know which set any address
12
— 9 → (9 mode 4) = anywhere is Set 1 13 will go to, so we only have to
14
— 18 → (18 mode 4) = anywhere is Set 2 15 search the set.
16
• Just look at last 2 bits 17
18
— 0 → 000000 → Set 0 19
20
— 1 → 000001 → Set 1 21
22
— 2 → 000010 → Set 2 23

— 8 → 001000 → Set 0 24
25
— 9 → 001001 → Set 1 26
27
— 18→010010 → Set 2 28
29
30
31
Tag bits Memory 22
Set-Associative Cache
• Address: suppose the cache has 23 = 8 sets
32-bit BYTE address 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0

Tag bits Set Offset


index bits

• Way number: the number of cache lines in each set


– Called X-way Set Associative cache if way number = X
0
Set 0
1
2
Set number = 4
3
Set 1 Way number = 2
4
5
Set 2
6
7
Set 3
Cache
• Replacement policy is used within each set

23
Set-Associative Cache
Tag Index Byte Offset (2 bits)

10 Set number = 1024


Way number = 4
Tag V Data Tag V Data Tag V Data Tag V Data
0 0 0 0
20 1 1 1 1
2 2 2 2
3 3 3 3 a set
1021 1021 1021 1021
1022 1022 1022 1022
1023 1023 1023 1023

= = = =

32 32 32 32

4 to 1 multiplexor
hit Data

24
Quantifying Cache Performance
• When a perfect cache is considered:

CPU time = CPU clock cycles × Clock cycle time

• If processor is stalled for a memory access

CPU time = (CPU clock cycles + Memory stall cycles) × Clock cycle time

Memory stall cycles (MemStall) = Number of misses × Miss Penalty

25
Example #1
• Assume that the CPI (cycle per instruction) of a computer is 1. If 30% of the instructions access data
memory, miss penalty is 100 cycles and the overall miss rate of accesses to both data and
instructions is 5%, how much faster the computer be if all instructions were cache hits?
• If all memory accesses are cache hits
CPU timeideal = CPU clock cycles × Clock cycle time IC (instruction count): the
= IC × CPI × Clock cycle time total number of instructions
= IC × 1.0 × Clock cycle time
• With cache misses
Memory stall cycles= IC × (1.0 + 0.3) × miss rate × miss penalty
= IC × 1.3 × 0.05 × 100 × Clock cycle time
CPU timeMemStall = (IC × 1.0 + IC × 6.5) × Clock cycle time
= 7.5 × IC × Clock cycle time

Speedup = 7.5
26
Memory Hierarchy

27
RAM
• RAM: Random-Access Memory
– allows data items to be read or written in almost the same amount of
time irrespective of the physical location of data inside the memory
(there are other types of memory which are not the case)
• Two major types:
– SRAM (static random-access memory)
– DRAM (dynamic random-access memory)
• SRAM and DRAM are both volatile memory
– The data is lost when powered off

28
SRAM vs DRAM
• SRAM (static random-access memory)
– Use flip-flops to store bits
– Usually used for Cache and registers

• DRAM (dynamic random-access memory)


– Use capacitors to store bits
– Access speed is much slower than SRAM

29
SRAM vs DRAM

30
Hard Disk Drive
• HDD: Hard Disk Drive
– electro-mechanical data storage device using magnetic storage and one
or more rigid rapidly rotating platters coated with magnetic material
– HDDs are a type of non-volatile memory (NVM), retaining stored data
even when powered off
– Used to store mass data
Solid State Disk
• SSD: Solid State Disk
– uses integrated circuit assemblies to store data persistently, typically
using flash memory

– Compared with HDD, SSDs are typically more resistant to physical


shock, run silently, and have quicker access time and lower latency

32
Flash Memory
• Toshiba developed flash memory in the 1980s
• Two main types of flash memory: NAND (for data) and NOR
(for code)
• Started to be heavily used because of digital cameras
• Exponential growth with smartphones/tablets

33
Comparison of SSD and HDD

Price trend (US$/GB)


2.5

2.0

1.5

1.0

0.5

0.0
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

SSD HDD

34
Basic NAND Flash Cell
• Store charges in the floating gate
• Threshold voltage (Vth) represents data

Control Gate eeeee


Gate Oxide
Read voltage
e e e e
Floating Gate
Tunnel Oxide 1 0
S D Vth
Substrate Erased Programmed

Floating-Gate Cell
Notices, high voltage is 0, low voltage is 1

35
Flash Cell Organization

BL BL BL BL BL BL

Source Gate Line


Control Gate WL 3
Gate Oxide 36 Page: 4 K to 32 KB
WL 2 One page
e e e e
Floating Gate WL 1 For read and write
Tunnel Oxide WL 0

S D Source Gate Line


Substrate Source Line

Floating-Gate Cell Block: 64 to 1024 pages


For erase

36
Read from Flash
Vpass = 5.0 V
3.0V Pass
3.8V (5V) 3.9V 4.8V Page 1

Vread = 2.5 V
3.5V Read
2.9V(2.5V)2.4V 2.1V Page 2

Vpass = 5.0 V
2.2V Pass
4.3V (5V) 4.6V 1.8V Page 3

Vpass = 5.0 V
3.5V Pass
2.3V (5V) 1.9V 4.3V Page 4

values for page 2:


0 0 1 1
37
Write to Flash
• One can only charge each individual Flash Cell (but not discharge)
– i.e., can only change an individual bit from 1 to 0, but not from 0 to 1
• In order to write 1, we have to erase the entire block
– Erase: change the entire block to 1
• After that, we can write those 0 bits as needed

Control Gate e e e e e
Gate Oxide
e e e e Read voltage
Floating Gate
Tunnel Oxide 1 0
S D Vth
Substrate Erased Programmed
Floating-Gate Cell

38
Garbage Collection
• Three basic operations: read, write (page) and erase (block)
• Out-of-place update
• Garbage collection to reclaim space
– To get continuous space

Example: Block 1 Block 2


Write page 0 and 1 clean
Page 0 Page 1
Update page 0
Page 1 Page 0 valid
Write page 2 and 3
Update page 2 Page 0 Page 3
Garbage collection on block 1 Page 2 Page 2 invalid
Page 3
Page 2

39
SLC Threshold Voltage Distribution
• SLC: single level cell

– –
Vread –
Lower Voltage Higher Voltage

Read Reference Voltage


State State
Probability

1 0

Threshold Voltage
40
MLC Threshold Voltage Distribution
• MLC: multi-level cell
V1 V2 V3
– – – –
– –
Lowest Voltage Highest Voltage

Read Reference Voltage

Read Reference Voltage

Read Reference Voltage


State State
Probability

LSB 1 1 0 0
MSB 1 0 0 1

Threshold Voltage
41
Different Densities of Flash
eeeee e e e e e eeeee
V V1 V2 V3

1 0 1 1 0 0
Vth 1 0 0 1 Vth
Erased Programmed

Single Level Cell (SLC) Multi Level Cell (MLC)


1 bit per cell 2 bits per cell

V1 V2 V3 V4 V5 V6 V7

1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
1 0 0 1 1 0 0 1
1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1
1 1 0 0 0 0 1 1
1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1
1 1 1 1 0 0 0 0 Vth Vth
1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0

Triple Level Cell (TLC) Quad Level Cell (QLC)


3 bits per cell 4 bits per cell

42
Comparison among Different Types
Types of flash SLC MLC TLC QLC
bit/cell 1 2 3 4
P/E cycles (K) 100 3 1 <1
Read time (µs) 30 50 75 110
Program time (µs) 300 600 1000 2000
Erase time (µs) 1500 3000 4500 5000
Price ($/GB) 0.37 0.33 0.2 0.12

Higher density, lower cost

Worse performance, worse lifetime

43
Exercise
• If you have 100$ budget to build the memory hierarchy for a
computer. The cost of SRAM is 10$/Mbyte, the cost of DRAM is
1$/Mbyte and the cost of Flash is 0.1$/Mbyte.
• There are 3 options:
– Option 1: 50$ for SRAM, 49$ for DRAM and 1$ for Flash
– Option 2: 10$ for SRAM, 20$ for DRAM and 70$ for Flash
– Option 3: 60$ for SRAM, 5$ for DRAM and 35$ for Flash
• Which option do you think is the best? Why?

44
Exercise
• In a flash memory, the high voltage (2.5V-5V) represents 0 and low voltage (0V-2.5V)
represents 1. Suppose currently the voltage in each cell is shown in the following
figure. What voltages should we add to page 1, 2, 3 and 4 respectively, in order to
read the value store in the cell in the blue circle in the picture?

45

You might also like