RDMA介绍

这是对这篇文章的大致翻译:

What is RDMA

RDMA is Remote Dynamic Memory Access which is a way of moving buffers between two applications across a network. RDMA differs from traditional network interfaces because it bypasses the operating system. This allows programs that implement RDMA to have:

  1. The absolute lowest latency
  2. The highest throughput
  3. Smallest CPU footprint

什么是RDMA

          RDMA指的是远程直接内存访问。是一种通过网络在两个应用之间移动缓冲区的方式。RDMA区别于其他传统网络接口在于它绕过了操作系统。RDMA具有以下特点:

          1、绝对的最低时延

          2、最高的吞吐量

          3、最小的CPU足迹

 

How Can We Use It

To make use of RDMA we need to have a network interface card that implements an RDMA engine.

We call this an HCA (Host Channel Adapter). The adapter creates a channel from it’s RDMA engine though the PCI Express bus to the application memory. A good HCA will implement in hardware all the logic needed to execute RDMA protocol over the wire. This includes segmentation and reassembly as well as flow control and reliability. So from the application perspective we deal with whole buffers.

In RDMA we setup data channels using a kernel driver。 We call this the command channel。 We use the command channel to establish data channels which will allow us to move data bypassing the kernel entirely。  Once we have established these data channels we can read and write buffers directly。

The API to establish these the data channels are provided by an API called “verbs”。 The verbs API is a maintained in an open source linux project called the Open Fabrics Enterprise Distribution (OFED)。 ()。 There is an equivalent project for Windows WinOF located at the same site。

The verbs api is different from the sockets programming API you might be used to. But once you learn some concepts it is actually a lot easier to use and much simpler to design your programs.

如何使用

       要使用RDMA,我们需要一张实现了RDMA引擎的网卡。我们称这种网卡为HCA(Host Channel Adapter,主机通道适配器)。该适配器创建了一条沿着PCIe总线从RDMA引擎到应用程序内存的通道。一个好的HCA将在硬件中实现通过线路执行RDMA协议所需的所有逻辑。这包括分段和重组以及流量控制和可靠性。 从应用程序的角度来看,我们处理整个缓冲区。

在RDMA中,我们使用内核驱动创建数据通道,该通道称为命令通道。然后我们使用命令通道来创建允许我们完全绕过内核移动数据的数据通道。

     建立数据通道的API是一种称之为"verbs"的API。"verbs" API是由一个叫做OFED的Linux开源项目维护的。

    "verbs" API跟你用过的socket编程API是不一样的。但是,一旦你掌握了一些概念后,就会变得非常容易,而且在设计你的程序的时候更简单。

Queue Pairs

RDMA operations start by “pinning” memory. When you pin memory you are telling the kernel that this memory is owned by the application. Now we tell the HCA to address the memory and prepare a channel from the card to the memory. We refer to this as registering a Memory Region. We can now use the memory that has been registered in any of the RDMA operations we want to perform. The diagram below show the registered region and buffers within that region in use by the communication queues.

RDMA communication is based on a set of three queues. The send queue and receive queue are responsible for scheduling work. They are always created in pairs. They are referred to as a Queue Pair(QP). A Completion Queue (CQ) is used to notify us when the instructions placed on the work queues have been completed.

A user places instructions on it’s work queues that tells the HCA what buffers it wants to send or receive. These instructions are small structs called work requests or Work Queue Elements (WQE). WQE is pronounced “WOOKIE” like the creature from starwars. A WQE primarily contains a pointer to a buffer. A WQE placed on the send queue contains a pointer to the message to be sent. A pointer in the WQE on the receive queue contains a pointer to a buffer where an incoming message from the wire can be placed.

RDMA is an asynchronous transport mechanism. So we can queue a number of send or receive WQEs at a time. The HCA will process these WQE in order as fast as it can. When the WQE is processed the data is moved. Once the transaction completes a Completion Queue Element (CQE) is created and placed on the Completion Queue (CQ). We call a CQE a “COOKIE”.

队列对

       RDMA操作开始于“搞”内存。当你在“搞”内存的时候,就是告诉内核这段内存由应用程序所拥有了。于是,我们告诉HCA在这段内存上寻址,并且准备创建一条从HCA卡到这段内存的通道。我们将这一系列动作称之为注册内存区域(MR,Memory Region)。一旦MR注册完毕,我们就可以使用这段内存来做任何RDMA操作。在图中,我们可以看到注册的内存区域(MR)和被通信队列所使用的位于内存区域之内的缓冲区(buffer)。


      RDMA通信基于三条队列(SQ, RQ和CQ)。 其中, 发送队列(SQ)接收队列(RQ)负责调度工作,他们总是成对被创建,称之为队列对(QP)。当在工作队列上的指令被完成的时候,RDMA用完成队列(CQ)通知我们.

     当用户把指令放置到工作队列的时候,就意味着告诉HCA那些缓冲区需要被发送或者用来接受数据。这些指令是一些小的结构体,称之为工作请求(WR)或者工作队列元素(WQE)。 WQE的发音为"WOOKIE",就像星球大战里的猛兽。一个WQE主要包含一个指向某个缓冲区的指针。一个放置在发送队列(SQ)里的WQE中包含一个指向待发送的消息的指针。一个放置在接受队列里的WQE里的指针指向一段缓冲区,该缓冲区用来存放待接受的消息。

      RDMA是一种异步传输机制。因此我们可以一次性在工作队列里放置好多个发送或接收WQE。HCA将尽可能快地按顺序处理这些WQE。当一个WQE被处理了,那么数据就被搬运了。 一旦传输完成,HCA就创建一个完成队列元素(CQE)并放置到完成队列(CQ)中去。 相应地,CQE的发音为"COOKIE"。

A Simple Example

Lets look at a simple example. In this example we will move a buffer from the memory of system A to the memory of system B. This is what we call Message Passing semantics. The operation is a SEND, this is the most basic form of RDMA.

      让我们看一个简单的例子。在这个例子中,我们将把系统A内存中一个缓冲区的数据移动到系统B的内存中去。这就是我们所说的消息传递语义。接下来的操作为SEND,是RDMA中最基础的操作类型。

Step 1 System A and B have created their QP’s Completion Queue’s and registered a regions in memory for RDMA to take place。 System A identifies a buffer that it will want to move to System B。 System B has an empty buffer allocated for the data to be placed。

    步骤1系统A和B都已经创建了他们各自的QP的完成队列(CQ), 并为RDMA注册了相应的内存区域(MR)。 系统A识别了一段缓冲区,该缓冲区的数据将被搬运到系统B上。系统B分配了一段空的缓冲区,用来存放来自系统A发送的数据。

 

Step 2 System B creates a WQE “WOOKIE” and places in on the Receive Queue。 This WQE contains a pointer to the memory buffer where the data will be placed。 System A also creates a WQE which points to the buffer in it’s memory that will be transmitted。

步骤2:系统B创建一个WQE并放置到它的接收队列(RQ)中。这个WQE包含了一个指针,该指针指向的内存缓冲区用来存放接收到的数据。系统A也创建一个WQE并放置到它的发送队列(SQ)中去,该WQE中的指针执行一段内存缓冲区,该缓冲区的数据将要被发送。

Step 3 The HCA is always working in hardware looking for WQE’s on the send queue. The HCA will consume the WQE from System A and begin streaming the data from the memory region to system B.  When data begins arriving at System B the HCA will consume the WQE in the receive queue to learn where it should place the data. The data streams over a highspeed channel bypassing the kernel.

步骤3:HCA始终在硬件中查找发送队列中的WQE。 HCA将从系统A消耗WQE并开始将数据从存储器区域流式传输到系统B.当数据开始到达系统B时,HCA将消耗接收队列中的WQE以了解它应该放置数据的位置。 数据流的高速传输绕过了内核。

Step 4 When the data movement completes the HCA will create a CQE “COOKIE”. This is placed in the Completion Queue and indicates that the transaction has completed. For every WQE consumed a CQE is generated. So a CQE is created on System A ‘s CQ indicating that the operation completed and also on System B’s CQ. A CQE is always generated even if there was an error. The CQE will contain field indicating the status of the transaction.

步骤4:当数据移动完成时,HCA将创建CQE。它位于完成队列中,表示事务已完成。对于消耗的每个WQE,生成CQE。因此,在系统A的完成队列中放置一个CQE,意味着对应的WQE的发送操作已经完成。同理,在系统B的完成队列中也会放置一个CQE,表明对应的WQE的接收操作已经完成。如果发生错误,HCA依然会创建一个CQE。在CQE中,包含了一个用来记录传输状态的字段。

版权声明:本文为博主原创文章,遵循版权协议,转载请附上原文出处链接和本声明。
本文链接:
快乐赛车 三分PK拾平台 幸运飞艇官网 pk10怎么玩 赖子棋牌 幸运赛车 北京赛车pk10玩法 福建11选5开奖 快赢彩票计划群 一分时时彩