NUMA-remote memory access. The table in Figure 2 shows
memory access latency results in nanoseconds for a four-
socket machine of Intel Xeon Platinum 8160 processors with
a total of 192 cores.
Fig. 2. Memory access latency in nanoseconds in a server with 4 NUMA
zones. The diagonal values in bold correspond to NUMA-local accesses.
One consequence of the NUMA organization is that I/O is
also bound to a given zone, thus playing a role in networked
computation. External devices communicate with processors
through adapters that are physically connected to one NUMA
zone. Section II-C provides a more detailed explanation of this
aspect. The result is that memory accesses originating from a
device targeting an address whose page resides in NUMA-
remote memory must traverse the interconnect in order to be
served. As we will see in our evaluation, this can impact
the performance of RDMA despite an order of magnitude
difference between the network and interconnect latency.
B. Remote Direct Memory Access
RDMA gives a node the ability to remotely access the
physical memory of another node directly, without involving
the CPU of the remote node. Bypassing the CPU avoids two
significant overheads, which are present in standard TCP/IP
stacks. First, everything is performed in user-space so there is
no CPU overhead due to mode or context switches. Second,
direct memory access (DMA) is leveraged by the RDMA-
capable network interface controller (RNIC) to avoid unnec-
essary memory copies, as detailed below.
All RDMA interaction shares certain procedures to set up
communication and register remotely accessible memory. Each
endpoint maintains a queue pair (QP), which consists of a send
queue and receive queue. Work requests are added to these
queues during operation depending on the communication
protocol used. During creation, this structure is associated with
a remotely accessible memory region.
Endpoints exchange data via one-sided or two-sided verbs.
Two-sided verbs are somewhat analogous to the socket model
presented by TCP/IP. The name of this type of RDMA
communication is based on the fact that both sides are needed.
We avoid a detailed discussion of this form of communication
as it does not pertain to our study.
To issue a one-sided RDMA, the requester posts a work re-
quest containing local and remote addresses, size, and remote
key to their send queue to initiate the operation on remote
memory. These verbs only require action from the requester
and, in contrast to two-sided communication, the requester has
knowledge of the virtual address on the remote node. In the
case of read operations, remote data is copied to the local
address and a work completion is added to the completion
queue. For writes, the remote RNIC ACKs the request and a
work completion is added to the requester’s completion queue.
One-sided communications requires a reliable connection and
QPs cannot serve multiple connections [17]. Communication
is faster than for two-sided verbs with the caveat that scal-
ability can suffer when the number of connections increases
significantly.
As mentioned before, since one-sided interactions are ag-
nostic to the machine’s operating system and, at the same
time, they share the same physical hardware resources of the
local operating system and applications, their performance
can be significantly affected by factors that cannot be opti-
mized at runtime by either the requesting applications nor the
software on the receiving machine. One consequence of this
transparency is that NUMA balancing optimization [2] cannot
be employed to move memory pages between NUMA zones
because pages must be locked upon memory registration with
the RNIC.
C. NUMA and RDMA I/O
Modern architectures offer a mechanism for I/O to directly
access the last-level cache. For example, the Intel machines
used in this study use Data Direct I/O (DDIO) to achieve
direct cache access. Given modern I/O speeds and cache
sizes, it is practical to allow I/O to access cache to avoid
overhead. Previously, incoming data would be written to main
memory and local accesses would then read it into cache.
With technology like DDIO, I/O latency improves for accesses
to cached memory and local computation benefits from I/O
placing memory directly into the cache.
The intent of this technology is to transparently improve
latency and throughput for I/O operations. However, it is
important to note that this is currently only applicable to the
cache in the same NUMA zone as the I/O controller and is
enabled by default. Physical memory in the remote NUMA
zone is accessed by a normal DMA [18]. As we will address
in our evaluation, this behavior can act against RDMA and
negatively impact performance.
III. RELATED W ORK
Extensive investigation of the role of NUMA locality in
system performance solidifies it as an important consideration
when designing high-performance applications to run on mod-
ern multicore machines. Integrating NUMA-awareness into
algorithms and data structures improves performance [8], [9],
[12], [24], [42].
Recent literature has been flooded with systems that exploit
RDMA for different types of computation. Because of its
reduced communication latency, RDMA is an ideal technology
for, but not limited to, distributed transactional systems [10],
[14], [20], [22], [39], [40], distributed shared-memory [3], [7],
[13], data transfer and storage [6], [33], and group commu-
nication [38]. Most systems using one-sided communication
implement a similar pattern that resembles the traditional
client-server model using one-sided writes for message passing
and one-sided reads for direct data access. Their designs