0% found this document useful (0 votes)
28 views21 pages

ssc18 Rdma

Rdma ethernet interface specifications

Uploaded by

kaviya08 rag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views21 pages

ssc18 Rdma

Rdma ethernet interface specifications

Uploaded by

kaviya08 rag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Large-Scale Data & Systems Group

RDMA Tutorial

Jana Giceva
Large-Scale Data & Systems (LSDS) Group
Imperial College London
http://lsds.doc.ic.ac.uk
<[email protected]>

1
What is RDMA?

Remote Direct Memory Access

RDMA is a hardware mechanism


through which the network card (NIC)
can directly access all or parts of
the main memory of a remote node
without involving the processor.

Jana Giceva - Imperial College London 2


RDMA properties

Remote – data is transferred between nodes in a network


Direct – no CPU or OS kernel is involved in the data transfer
Memory – data transferred between two apps and their virtual address spaces
Access – support to send, receive, read, write, and do atomic operations

Main highlights of RDMA


▪ Zero-copy data
▪ Bypasses the CPU
▪ Bypasses the OS kernel
▪ Message based transactions

Jana Giceva - Imperial College London 3


Benefits of using RDMA

✓ High throughput (bandwidth)

✓ Low end-to-end latencies

✓ Low CPU utilization


One-sided RDMA operations do not involve the remote CPU at all.

✓ Low memory bus contention


No data is copied between the user space and kernel, and the other way around.

✓ Asynchronous operations
Great for overlapping communication and computation.

Jana Giceva - Imperial College London 4


Traditional TCP/IP sockets vs RDMA

Your Application

Command
space
User

channel
uverbs
Socket Layer
Kernel

Data TCP UDP


channel
Kernel mod IP
Verbs API Ethernet Driver

Adapter Driver
RDMA-enabled Channel Adapter network card

Jana Giceva - Imperial College London src: InfiniBand Trade Association: Introduction to IB for end users
Setting up the RDMA data channels

Buffers need to be registered with the network card before used

During the registration process:


▪ Pin memory so that it cannot be swapped by the Operating System.
▪ Store the address translation information in the NIC.
▪ Set permissions for the memory region.
▪ Return a remote and local key, which are used by the adapters when
executing the RDMA operations.

Jana Giceva - Imperial College London 6


Work Queues

RDMA communication is based on a set of three queues


▪ Send
work queues, always created as a Queue Pair (QP)
▪ Receive
▪ Completion

The send and receive queues are there to schedule the work to be done.

A completion queue is used to notify when the work has been completed.

Jana Giceva - Imperial College London 7


Queue Elements

Applications issue a job using a work request or a work queue element

A work request is a small struct with a pointer to a buffer:


▪ In a send queue – it’s a pointer to a message to be sent.
▪ In a receive queue – it’s shows where an incoming message should be
placed.

Once a work request has been completed, the adapter creates a


completion queue element and enqueues it in the completion queue.

Jana Giceva - Imperial College London 8


RDMA’s network stack overview

▪ Posts work requests to a queue


Application ▪ Each work request is a message, a unit of work

▪ Verbs interface – allows the application to request services


▪ Maintains the work queues
RDMA adapter driver ▪ Manages address translation
▪ Provides completion and even mechanisms

▪ Transport layer: reliable/unreliable, datagram, etc.


RDMA-supporting ▪ Packetizes messages
NIC and ▪ Implements the RDMA protocol
network protocols ▪ Implements end-to-end reliability
▪ Assures reliable delivery
src: InfiniBand Trade Association: Introduction to IB for end users

Jana Giceva - Imperial College London 9


Network protocols supporting RDMA

▪ InfiniBand (IB)
• QDR 4x – 32 Gbps
• FDR 4x – 54 Gbps
• EDR 4x – 100 Gbps

▪ RoCE – RDMA over Converged Ethernet


• 10 Gbps
• 40 Gbps

▪ iWARP – internet Wide Area RDMA Protocol

Jana Giceva - Imperial College London 10


RDMA is just a mechanism

Does not specify the semantics of a data transfer

RDMA networks support two types of memory access models:

▪ One sided – RDMA read and write + atomic operations


▪ Two sided – RDMA send and receive

Jana Giceva - Imperial College London 11


RDMA Send and Receive

Traditional message passing where both the source and the destination
processes are actively involved in the communication.

Both need to have created their queues:


▪ A queue pair of a send and a receive queue.
▪ A completion queue for the queue pair.

Sender’s work request has a pointer to a buffer that it wants to send. The
WQE is enqueued in the send queue.

Receiver’s work request has a pointer to an empty buffer for receiving the
message. The WQE is enqueued in the receive queue.

Jana Giceva - Imperial College London 12


Example RDMA send

System A send queue send queue System B


Host Memory Host Memory
Registered Memory Registered Memory
receive queue receive queue

compl. queue compl. queue

Buffer to Transfer Buffer to Place Data

Jana Giceva - Imperial College London 13


Example RDMA send

System A send queue send queue System B


Host Memory Host Memory
Registered Memory Registered Memory
receive queue receive queue

compl. queue compl. queue

Jana Giceva - Imperial College London 14


Example RDMA send

System A send queue send queue System B


Host Memory Host Memory
Registered Memory Registered Memory
receive queue receive queue

compl. queue compl. queue

Jana Giceva - Imperial College London 15


Example RDMA send

System A send queue send queue System B


Host Memory Host Memory
Registered Memory Registered Memory
receive queue receive queue

compl. queue compl. queue

Jana Giceva - Imperial College London 16


RDMA Read and Write

Only the sender side is active; the receiver is passive.

The passive side issues no operation, uses no CPU cycles, gets no


indication that a “read” or a “write” happened.

To issue an RDMA read or a write, the work request must include:


1. the remote side’s virtual memory address and
2. the remote side’s memory registration key.

The active side must obtain the passive side’s address and key beforehand.
Typically, the traditional RDMA send/receive mechanisms are used.

Jana Giceva - Imperial College London 17


Using the verbs API

Intro to RDMA/IB * Jana Giceva src: https://blog.zhaw.ch/icclab/infiniband-an-introduction-simple-ib-verbs-program-with-rdma-write/ 18


Challenges of using RDMA

Added extra complexity for the developer to use the Verbs API

Sockets IP-based Apps


Application
File Systems Block storage

Upper Layer
Protocols MPI : Message Passing Interface
MPI NFS RDMA VNIC SRP ▪ Widely used in HPC
Cluster File Systems SDP RDS ▪ Example: OpenMPI, MVAPICH,
Intel MPI, etc.
RDMA adapter driver
File Systems:
IB Fabric Interconnect ▪ Lustre – parallel distributed FS
for Linux
▪ NFS_RDMA – Network FS over
src: InfiniBand Trade Association: Introduction to IB for end users
RDMA

Jana Giceva - Imperial College London 19


RDMA References

▪ IB trade introduction https://cw.infinibandta.org/document/dl/7268

▪ First steps for programming with IB verbs


https://thegeekinthecorner.wordpress.com/2010/08/13/building-an-rdma-
capable-application-with-ib-verbs-part-1-basics/

▪ Figures from https://zcopy.wordpress.com/category/getting-started/

▪ More details at http://www.mellanox.com/related-


docs/prod_software/RDMA_Aware_Programming_user_manual.pdf

Jana Giceva - Imperial College London 20


Overview of our new EDR cluster
▪ EDR InfiniBand

▪ 36-port Mellanox switch

▪ 18 nodes cluster (EDR NICs)

▪ 1 server with 4 Xeon E5-5660 v4 processors:


▪ 64 cores (128 with HT enabled)
▪ 512 GB RAM
▪ 2 EDR NICs, 1 x 10G NIC, 1 x 1G NIC

▪ 8 servers with 2 Xeon E5-2630 v4 processors:


▪ 20 cores (40 with HT enabled)
▪ 32 GB RAM
▪ 2 EDR NICs

Jana Giceva - Imperial College London 21

You might also like