How the Streamlined Architecture of NVM
Express Enables High Performance PCIe SSDs
Peter Onufryk
Director of Engineering
IDT
Flash Memory Summit 2012
Santa Clara, CA 1
The Need for a Large Number of
Parallel Commands
...
NAND
NAND
NAND
NAND
...
NAND
NAND
NAND
NAND
NVMe
PCIe x8 NAND Flash
Gen3 ...
NAND
NAND
NAND
NAND
Controller
BW ~6 GBps
...
...
...
...
...
NAND
NAND
NAND
NAND
8KB Page
TREAD = 75s 109 MBps Read BW
Need:
TPROG = 1ms 8 MBps Write BW
55 parallel 8KB reads
732 parallel 8KB writes
Flash Memory Summit 2012
Santa Clara, CA 2
Scalable Queuing Interface
Host
Controller
Core 0 Core 1 Core N
Managment
Admin Admin I/O I/O I/O I/O I/O I/O I/O
Submission Completion Submission Completion Submission Submission Completion Submission Completion
Queue Queue Queue Queue Queue Queue Queue Queue Queue
...
MSI-X MSI-X MSI-X MSI-X
NVMe Controller
• Enables NUMA optimized drivers
One or more I/O submission queues, completion queue, and MSI-X interrupt per core
High performance and low latency command issue
No locking between cores
• Up to 232 outstanding commands
Support for up to 64K I/O submission and completion queues
Each queue supports up to 64K outstanding commands
Flash Memory Summit 2012
Santa Clara, CA 3
Efficient Queuing Interface
Queue Process
1 Command Host 7 Completion
Submission Completion
Queue Host Memory Queue
Ring Ring
Doorbell Doorbell
New Tail Tail Head New Head
2 8
Head Tail
Submission Queue Completion Queue
Tail Doorbell 3 4 5 6 Head Doorbell
Fetch Process Queue Generate
Command Command Completion Interrupt
NVMe Controller
Command Submission Command Processing Command Completion
1. Host writes command to 3. Controller fetches command 5. Controller writes completion to
submission queue 4. Controller processes command completion queue
2. Host writes updated submission 6. Controller generates MSI-X
queue tail pointer to doorbell interrupt
7. Host processes completion
8. Host writes updated completion
queue head pointer to doorbell
Flash Memory Summit 2012
Santa Clara, CA 4
NVMe Command Arbitration
Admin ASQ
SQ
SQ
Urgent RR
...
SQ
High Strict Priority
SQ
High SQ
RR
Medium Strict Priority
Priority
Priority
...
SQ
High WRR Priority
Low Strict Priority
SQ
Medium SQ
RR
Medium WRR Priority
WRR
Priority
...
SQ
Low WRR Priority
SQ
Low SQ
RR
Priority
...
SQ
Flash Memory Summit 2012
Santa Clara, CA 5
Fixed Sized
Commands & Completions
Submission Queue Entry (64B) Completion Queue Entry (16B)
Byte 3 Byte 2 Byte 1 Byte 0 Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 Command Identifier FUSE Opcode 0
1 Namespace Identifier 1
DWord
2 2 SQ Identifier SQ Head Pointer
3 3 Status Field P Command Identifier
4
Metadata Pointer
5
6
PRP Entry 1
7
DWord
8
PRP Entry 2
9
10
11
12
13
14
15
Standard Fields Used By All Commands
Standard Fields Optionally Used By Commands
Flash Memory Summit 2012
Santa Clara, CA 6
Benefit of Fixed Sized Commands
Submission Queues in PCIe Memory
Element Buffer Element Buffer
...
Cmd Queue Queue
Cmd
PCIe 7 Element Element
Memory
... 3
Cmd Queue Queue
6 Element Element
Cmd Queue Queue
5 Element Element
Cmd
2
Candidate Queue Selector Cmd Queue Queue
4 Element Element
Arbiter & Element Fetch
Cmd Queue Queue
3 Element Element
NVMe
Controller Cmd Queue Cmd Queue
Front End Element Element
2 Element 1 Element
Buffer Buffer
0 ... N
Cmd Queue Queue
1 Element Element
Cmd
0
Cmd Queue Queue
Command Issue Logic 0 Element Element
Command Processing / Firmware
Fixed Sized Variable Sized
...
Commands Commands
Fixed Sized Commands Simplify Command Parsing, Arbitration, and Error Handling
Flash Memory Summit 2012
Santa Clara, CA 7
Simple Optimized Command Set
Admin Commands NVM Admin Commands
Create I/O Submission Queue Format NVM (optional)
Delete I/O Submission Queue Security Send (optional)
Create I/O Completion Queue Security Receive (optional)
Delete I/O Completion Queue
Get Log Page
Identify
Abort
Set Features NVM I/O Commands
Get Features Read
Asynchronous Event Request Write
Firmware Activate (optional) Flush
Firmware Image Download (optional) Write Uncorrectable (optional)
Compare (optional)
Dataset Management (optional)
10 Required Admin Commands
3 Required NVM I/O Commands
Flash Memory Summit 2012
Santa Clara, CA 8
NVM Creates New
Challenges and Opportunities
Physical
NAND Flash
Pages
Storage
Logical Block
Address Range
SLC
NAND Flash
MLC (2-bit)
NAND Flash
TLC
Logical NVMe
PCIe NAND Flash
to Controller
Physical
Mapping
Other NVM
Wear (MRAM, PCM …)
Leveling
DRAM
NVM Controller with Tiered Storage
Flash Translation Layer
Flash Memory Summit 2012
Santa Clara, CA 9
NVMe Data Set Management Hints
Write Read Write Read
LBA LBA LBA LBA
Num LB Num LB Num LB Num LB
Host Controller
Commands
Traditional Storage Command Set
Write Read Read
LBA LBA LBA
Num LB Num LB DSM Num LB
Host DSM DSM DSM
Controller
Commands
NVMe Command Set
with Data Set Management (DSM)
Flash Memory Summit 2012
Santa Clara, CA 10
NVMe Data Set Management
Range Attributes
• Overall DSM Command
Write Read Deallocate
LBA
Num LB
LBA
Num LB DSM Integral write dataset
DSM DSM
Integral read dataset
• Per DSM Range
LBA Range
DSM Access size (in logical blocks)
LBA Range
DSM
Written in near future
LBA Range Sequential read
DSM
LBA Range Sequential write
1 to 256 DSM
Ranges LBA Range
Access latency (longer, typical,
DSM small)
LBA Range
DSM Access frequency
LBA Range o Typical read and write
DSM
LBA Range o Infrequent read and write
DSM
o Infrequent write, frequent read
o Frequent write, infrequent read
o Frequent read and write
Flash Memory Summit 2012
Santa Clara, CA 11
Out-Of-Order Data
NAND NAND NAND
0,1,2
Read(7-0)
PCIe NVMe NAND NAND NAND
NAND Flash Erase
Controller 5 3 7
D7 D0 D6 D5 D1 D2 D4 D3
NAND NAND NAND
Buffer
6 4
Possible Sources of Out-Of-Order Data
NAND or page TRead variation
Target/LUN conflict
o Operations associated with same command (e.g., multiple reads to NAND)
o Different operation (e.g., previously issued program or erase)
NAND error handling
o ECC correction time variation, read-retry, …
Flash channel conflict
Flash Memory Summit 2012
Santa Clara, CA 12
Traditional Scatter Gather List
(SGL)
Host Controller
Host Read
Physical
Memory
C3
NAND
C1 Data
D0
Address Length C0
C4 C0 a D1
C1 b C1
C2 c D2
C3 d
C0
C2 D3
D4
Address Length
C3
C4 e D5
C2 C5 f C4
D6
C5
D7
C5
Flash Memory Summit 2012
Santa Clara, CA 13
I/O Operation and Host Memory
Process Host
Virtual Physical
Memory Memory
C5
bufPtr
read(buPtr, numBytes)
C0 C6
Page Offset
C1 C0
C2
C3 C1
numBytes C4 C7
C5
C6 C2
C7 C3
C8
C8
C4
Flash Memory Summit 2012
Santa Clara, CA 14
NVMe Physical Region Page
(PRPs) Read
Process Host
Virtual Physical
Memory Memory
C5
bufPtr
read(buPtr, numBytes) NAND
C0 C6
Page Address Offset
Data
C0 offset
C1 C0
C1 - C0
D0
C2 -
C2 C1
C3 -
D1
C3 C1 C2
D2
Page Address Offset
numBytes C4 C7 C3
C4 -
D3
C5 -
C5 C4
C6 -
D4
C7 -
C6 C2 C5
D5
C7 C3 C6
Page Address Offset
D6
C8 -
C8 C7
- -
D7
- -
C8 C8
- -
D8
C4
Flash Memory Summit 2012
Santa Clara, CA 15
Summary
• Scalable and Efficient Queuing Interface
Low overhead command issue and completion
Parallel command execution
• Fixed Sized Commands
Straightforward command fetch, parsing and arbitration
• Simple Command Set (3 required I/O commands)
Fast command processing
• Data Set Management Hints
Controller optimization of data placement
• Physical Region Pointers
Simplified out-of-order data delivery
Flash Memory Summit 2012
Santa Clara, CA 16