RDMAvisor: Toward Deploying Scalable and Simple RDMA as a Service in
Datacenters
Zhi Wang, Xiaoliang Wang, Cam-Tu Nguyen, Zhuzhong Qian, Baoliu Ye, Sanglu Lu
Nanjing University Huawei
submit to [USENIX ATC ’17] Paper #9
Abstract
RDMA is increasingly adopted by cloud computing plat-
forms to provide low CPU overhead, low latency, high
throughput network services. On the other hand, how-
ever, it is still challenging for developers to realize fast
deployment of RDMA-aware applications in the data-
center, since the performance is highly related to many
low-level details of RDMA operations. To address this
problem, we present a simple and scalable RDMA as
Service (RaaS) to mitigate the impact of RDMA oper-
ational details. RaaS provides careful message buffer
management to improve CPU/memory utilization and
improve the scalability of RDMA operations. These op-
timized designs lead to simple and flexible programming
model for common and knowledgeable users. We have
implemented a prototype of RaaS, named RDMAvisor,
and evaluated its performance on a cluster with a large
number of connections. Our experiment results demon-
strate that RDMAvisor achieves high throughput for
thousand of connections and maintains low CPU and
memory overhead through adaptive RDMA transport se-
lection.
1 Introduction
1.1 Background and Motivation
Remote Direct Memory Access (RDMA) technique pro-
vides the messaging service that directly access the vir-
tual memory on remote machines. Since data can be
copied by the network interface cards (NICs), RDMA
provides minimal operating system involvement and
achieves low latency data transportation through stack
bypass and copy avoidance. It has been widely used by
the high performance computing (HPC) community and
closely coupled with the InfiniBand (IB) network.
Recently, due to the decreasing price of RDMA hard-
ware and the compatible design of RDMA over Ethernet
(RoCE) standard, the distributed computing platforms
have been testing and deploying RDMA in order to al-
leviate communication bottleneck of existing TCP/IP-
based environment [17, 18]. To make RDMA network
scalable to support hundreds of thousands of nodes in
the datacenter, the IP routable RoCE (RoCEv2) [4] pro-
tocol was also defined and quickly evaluated in the Mi-
crosoft datacenter [10, 18]. By leveraging the advanced
techniques of priority-based flow control (PFC) [3] and
congestion notification (QCN) [18] it has indicated the
potential RDMA by replacing TCP for intra-datacenter
communications [10].
RDMA defines asynchronous network programming
interfaces, RDMA verbs, for submitting work requests
to the channel adapter and returning completion status.
Three transport types are provided: Reliable Connec-
tion (RC), Unreliable Connection (UC), and Unreliable
Datagram (UD). Reliable transport guarantees lossless
transfer and ensures in-order delivery of messages by
using acknowledge. The unreliable transport providing
no sequence guarantee consumes less bandwidth and de-
lay due to no ACK/NACK packet. In the RDMA verbs,
both channel semantics and memory semantics are pro-
vided to users. The channel semantics are two-sided op-
erations, sometimes called Send/Receive, which is the
communication style used in a classic I/O channel. The
Memory verbs are one-sided operations, embedded in the
Read and Write operations, which allows the initiator of
the transfer to specify the source or destination location
without the involvement of the other endpoint CPUs. Ta-
ble 1 shows the operations available in each transport
mode.
1.2 Challenges and Our Solution
Despite rich transport modes and operations have been
provided, it is still a challenging task to achieve advance
capabilities for applications deploying in the RDMA-
capable datacenters. With regard to the specific demands
of different applications and the shared environment of
1
arXiv:1802.01870v1 [cs.DC] 6 Feb 2018