OFA Intro RDMA 2011-08-23
OFA Intro RDMA 2011-08-23
virtual space
O App App O
S S
App
App
CA
CA
O O
S S
App
App
CA
CA
O O
S S
Verbs
An API used by an application to control and conduct an RDMA operation.
WRITE
requester responder
READ
proc … proc
- A compute platform consists of
I/O bus - a CPU/memory complex and
- an I/O subsystem
App
App
App
O - I/O resources are owned by the OS
S - It provides I/O services to the supported
applications
I/O
App
App
App
App
App
App
O O
S S
- Typically, applications demand three types of I/O - storage, networking and IPC
- Usually, this means different wires, and different I/O protocols…
App
App
App
App
App
App
O O
S S
- Typically, applications demand three types of I/O - storage, networking and IPC
- Usually, this means different wires, and different I/O protocols…
App
App
App
App
App
App
O O
S S
App
App
App
App
App
App
O proposition,
O RDMA seeks to
S eliminate the
S OS as part of the
I/O subsystem
HBA NIC NIC HBA
• Storage I/O
– block storage , e.g. SCSI (SAS, SATA, FC…)
• typically a kernel app doesn’t impact user apps
– file or object I/O, e.g. NFS, Lustre, NFS
• Sockets-based I/O
– both IP networking and IPC
– both are based on sockets
App
App
App
Applications exist in virtual memory, but…
O
S
App
App
App
Applications exist in virtual memory, but…
O
S
App
App
App
App
App
App
O O
S S
App
App
App
App
App
App
O O
S S
channel i/f channel i/f
CA CA
App
messages
App
App App
Messaging Service
App App
compute compute
platform* platform*
Messaging Service
- Gets more interesting when the two processes reside in disjoint physical spaces
- This requires a networking construct to transport the messages
App App
end payload header
message
message message
buffer buffer
app app
send user recv recv
Unfortunately….
-sockets is synchronous. The application waits for buffer copies on both ends
Messages can range in size from very small up to very, very large.
The channel handles the transmission and delivery of the whole message.
App App
platform platform
App App
I/O I/O
Remember this?
- Applications exist in virtual space
- Device adapters, (NICs, HBAs…) exist in physical space
- Each compute platform occupies a disjoint physical address space
App App
v-2-p p-2-v
I/O I/O
p-2-p
Therefore, three address translations are required end-to-end:
- a virtual-to-physical address translation
- a physical-to-physical ‘translation’, and
- a physical-to-virtual translation
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc. 08/23/2011 46
Sockets-based address translation
App App
buf buf
virtual virtual
OS n n OS
buf i i buf
c c
physical physical
App App
channel adapters contain
buf address translation tables. buf
C C
A A
virtual virtual
physical
space space
App App
buf buf
C C
A A
OS OS
App App
buf buf
C C
A A
App App
buf buf
C C
A A
An I/O Channel connects virtual address spaces. The virtual address spaces
can exist in disjoint physical address spaces.
Remote Direct Memory Access means that an application can directly read
and write remote virtual memory
App App
Application layer
Session layer
Transport layer
RDMA Messaging Service
Network layer
Link layer
The messaging service contains a full
network stack. Phy layer
Transport TCP
Network
- Sending app uses the sockets API to post a message to the
Link OS’ I/O service
- Using the I/O subsystem, the OS delivers bytes to the
receiver.
Phy - The OS uses the sockets API to copy a stream of bytes to
the receiver
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc. 08/23/2011 54
RDMA architecture – 3 components
Application
s/w xport 1. Software Transport Interface
interface
Session
RDMA
2. RDMA Protocols
protocols
Transport
Transport
Network 3. Network Transport Service
Link
Let’s agree not to be too literal about the functional
Phy
definitions of each layer.
Application
s/w xport • Virtual channel adapter architecture
interface
Session
RDMA
protocols
Transport
Transport
Network
Link
- a queued interface - ordered operations
Phy - queues are mapped into application virtual space
- a message oriented architecture
App
s/w xport
interface
QP CQs
RDMA
protocols
app pulls
applications put App completions off
requests on the queue
the CQ
QP CQs
Work queues drive the channel interface, CQs signal completed work
App
This simple observation accounts for much of the RDMA value proposition:
App App
receive
queue
App
WR
s/w xport WR A Work Request (WR) is a data structure that
interface WR describes a piece of work to be completed:
WR -a message to be sent,
RDMA WR -a message to be received…
protocols
An application posts a WR to a queue.
Transport
QPs CQs
You will learn a great deal about these data structures in the next part of the course.
App
Once posted to a work queue, a WR becomes
WR an element of that queue, called a Work Queue
s/w xport WR Element (WQE – pronounced wookie)
interface WR
WR Similarly, the elements on a completion queue
RDMA WR are called CQEs – pronounced cookie
protocols
Transport
QPs CQs
OFED deals strictly in Work Requests…you won’t find references to WQEs in the
code. (But you will find them in the architecture, if you go looking).
App App
s/w xport
interface
RDMA QP CQ
protocols
Transport
- App posts a WR to the queue and returns immediately.
It does not wait for a buffer copy.
App App
s/w xport
interface
QPs QPs QPs QPs
RDMA
protocols
Transport
App App
s/w xport
interface
RDMA QPs QPs
protocols
Transport
•To send a message the user posts Send WR to the send queue.
You will learn a great deal more about these verbs during the next section.
Application
s/w xport
interface
Session
RDMA
• Memory-to-memory transfer protocols
protocols
Transport
Transport
Network
Link
Phy
App App
QP QP
CA
Every SEND/RECEIVE:
-consumes a WQE from the Requester’s SEND queue,
-consumes a WQE from the Responder’s RECEIVE queue.
CA CA
-Send WQE defines source buffer -Receive WQE defines destination buffer
-Receive buffers must be pre-posted
QP QP
CA
Requester Responder
buffers
App App
buffer -VA
descriptor -R-KEY
Requester Responder
buffers
App App
The means for passing control of the buffer back and forth is not specified in the
RDMA architecture; it is defined by the upper layer protocol
CA CA
-RDMA READ is exactly the same, except for data transfer direction
(and Yes, the RDMA READ REQUEST is posted to the SEND queue)
• A single ended operation – reads and writes are opaque to the responder
while they are happening
• The requester accesses the responder’s memory with the help of the
responder side channel hardware
Operations on the SEND queue are completed in the order in which they were
posted.
3 SEND responder
2 RDMA RD
1 SEND
SEND 3 RDMA RD SEND 1
Application
s/w xport
interface
Session
RDMA
protocols
Transport
Transport
Network •Transport service
Link
Phy
App
The transport layer provides (at least) two services:
s/w xport
Reliable, Connection-oriented (RC). TCP is an example
interface
of a reliable connected service.
RDMA
protocols Unreliable Datagram (UD). UDP is an example of an
unreliable datagram service.
Transport
How the service is provided is a function of the
underlying wire protocol.
A ‘lossless wire’ (i.e. one that does not drop packets) is NOT the same thing as a
reliable transport.
A connection is created between two QPs when they have exchanged a set of
attributes, and have agreed that they are connected.
connection
App between QPs
App
QP QP
This guarantees isolation and protection of the channel between the two
applications.
App
App App
App
App
switch
App App
App
* Actually there are a couple more operations, but not being covered here.
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc. 08/23/2011 93
Verbs introduction
App App
The Queues (work queues plus completion queues) represent the inputs and
outputs to the channel.
Thus the Verbs plus the Queues together represent the Channel Interface
WR
App WR
WR
WR
WR
1. Open an HCA
2. Create a Protection Domain (what the heck is that?)
3. Create a Queue Pair
4. Create (a) Completion Queue(s)
5. Register memory regions (coming up shortly)
6. Post work requests
7. Wait for completions
Part 2 of this course will take us through the verbs in some detail.
You will hear people talk about Verbs, sometimes in reference to the abstract
semantic description and sometimes in reference to the API.
It is rarely confusing.
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc. 08/23/2011 100
Transport independence
• The RDMA Consortium spiffed up the Verbs spec while defining iWARP
• All three present the same interface (APIs) and semantic behavior
This means the application can manipulate the channel directly - no need
for a context switch to a privileged entity
verb API
Requester Responder
- VA
App App
- R-KEY
L-KEY
- L-KEY is passed to the local CA as part of a WR
- R-KEY controls remote buffer access (RDMA ops)
- VA memory
CA CA
- R-KEY region
app app
QP QP
app app
QP QP
PD PD
- A PD binds together:
- an application
- one or more memory regions, and
- one or more QPs
QP
mem
QP
region R-KEY 1
app QP
R-KEY 2
mem
region
QP QP app
mem
app QP The PD ensures that only QPs that are
region
registered to that virtual memory can access it.
QP
For connected service, the combination of two
PDs plus the connection between QPs creates
a protected, isolated channel between
application virtual memory
QP QP app
mem
app
region
QP
• Connected service:
– the receiver validates incoming packets based on the connection ID
– that creates a protected QP-to-QP connection…
– …but it doesn’t create end-to-end protection!
– we’ll need something more to complete the chain of protection
app
QP
app app
QP
QP
app
QP
Q-Keys
app
QP
app app
QP
QP
PD
PD
App Application
VERBS
S/W transport i/f
Session
RDMA protocols
Transport
RDMA messaging Transport
service Network
Link
Phy
Session
Transport ULPs
Network
Link
Phy user verbs kernel verbs
ULPs
VNIC SDP SRP
Upper Layer Protocols (ULPs) H/W driver H/W driver H/W driver
allow a standard application to hardware hardware hardware
execute over an RDMA network
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc. 08/23/2011 122
Main focus of this course…
you already know what H/W driver H/W driver H/W driver
the black box can do. hardware hardware hardware
Kernel bypass
Mid-Layer Service
SA Connection Connection
MAD SMA VNIC
Client Virtual NIC
Manager Manager
UDAPL User Direct Access
OpenFabrics Kernel Level Verbs / API Programming Lib
HCA Host Channel
Adapter
Provider Hardware Hardware Specific
R-NIC RDMA NIC
Specific Driver Driver
Common Apps &
Key Access
Hardware InfiniBand HCA iWARP R-NIC InfiniBand Methods
for using
iWARP OF Stack
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc.
© 2011 OpenFabrics Alliance, Inc.
08/23/2011 125
Introduction to wire protocols
IB IP Enet
network network fabric
www.openfabrics.org © 2011 OpenFabrics Alliance, Inc. 08/23/2011 128
Wire level protocols
verbs
RDMAP
DDP
MPA Xport i/f Xport i/f
RNIC TCP HCA IB Xport ‘NIC’ IB Xport
IP IB N/W IB N/W
Enet link IB Link Enet link
iWARP InfiniBand RoCE
RDMAP
Session s/w transport interface
DDP
RDMA protocols
MPA Xport i/f Xport i/f
Transport Transport
TCP IB Xport IB Xport
Network IP IB N/W IB N/W
Link Enet link IB Link Enet link
Phy Enet phy IB phy Enet phy
ref model iWARP InfiniBand RoCE
(OSI, sort of…)
app
user app
send
copy
send kernel anon
bfr switch bfr
switch
App
App
O O
S S
An I/O Channel connects virtual address spaces. The virtual address spaces
can exist in disjoint physical address spaces.
Remote Direct Memory Access means that an application can directly read
and write remote virtual memory
App
App
O O
S S
CA CA