Figure 1:
Hardware components of a node in an RDMA cluster
2 Background
Figure 1 shows the relevant hardware components of a
machine in an RDMA cluster. A NIC with one or more
ports connects to the PCIe controller of a multi-core CPU.
The PCIe controller reads/writes the L3 cache to service
the NIC’s PCIe requests; on modern Intel servers [
4
], the
L3 cache provides counters for PCIe events.
2.1 PCI Express
The current fastest PCIe link is PCIe “3.0 x16,” the 3rd
generation PCIe protocol, using 16 lanes. The bandwidth
of a PCIe link is the per-lane bandwidth times the number
of lanes. PCIe is a layered protocol, and the layer headers
add overhead that is important to understand for efficiency.
RDMA operations generate 3 types of PCIe transaction
layer packets (TLPs): read requests, write requests, and
read completions (there is no transaction-layer response
for a write). Figure 2a lists the bandwidth and header
overhead for the PCIe generations in our clusters. Note
that the header overhead of 20–26 bytes is comparable to
the common size of data items used in services such as
memcached [25] and RPCs [15].
MMIO writes vs. DMA reads
There are important dif-
ferences between the two methods of transferring data
from a CPU to a PCIe device. CPUs write to mapped
device memory (MMIO) to initiate PCIe writes. To avoid
generating a PCIe write for each store instruction, CPUs
use an optimization called “write combining,” which com-
bines stores to generate cache line–sized PCIe transac-
tions. PCIe devices have DMA engines and can read
from DRAM using DMA. DMA reads are not restricted
to cache lines, but a read response larger than the CPU’s
read completion combining size (
C
rc
) is split into multiple
completions.
C
rc
is 128 bytes for the Intel CPUs used in
our measurements (Table 2); we assume 128 bytes for the
AMD CPU [
4
,
3
]. A DMA read always uses less host-
to-device PCIe bandwidth than an equal-sized MMIO;
Figure 2b shows an analytical comparison. This is an
important factor, and we show how it affects performance
of higher-layer protocols in the subsequent sections.
PCIe counters
Our contributions rely on understanding
the PCIe interaction between NICs and CPUs. Although
precise PCIe analysis requires expensive PCIe analyzers
or proprietary/confidential NICs manuals, PCIe counters
available on modern CPUs can provide several useful
Gen Bitrate Per-lane b/w Request Completion
2.0 5 GT/s 500 MB/s 24 B 20 B
3.0 8 GT/s 984.6 MB/s
26 B 22 B
(a)
Speed and header sizes for PCIe generations. Lane band-
width excludes physical layer encoding overhead.
(b)
CPU-to-device PCIe traffic for an
x
-byte transfer with
DMA and MMIO, assuming PCIe 3.0 and C
rc
= 128 bytes.
Figure 2: PCIe background
insights.
2
For each counter, the number of captured events
per second is its counter rate. Our analysis primarily uses
counters for DMA reads (
PCIeRdCur
) and DMA writes
(PCIeItoM).
2.2 RDMA
RDMA is a network feature that allows direct access to
the memory of a remote computer. RDMA-providing
networks include InfiniBand, RoCE (RDMA over Con-
verged Ethernet), and iWARP (Internet Wide Area RDMA
Protocol). RDMA networks usually provide high band-
width and low latency: NICs with 100 Gbps of per-port
bandwidth and
∼
2
µ
s round-trip latency are commercially
available. The performance and scalability of an RDMA-
based communication protocol depends on several factors
including the operation (verb) type, transport, optimiza-
tion flags, and operation initiation method.
2.2.1 RDMA verbs and transports
RDMA hosts communicate using queue pairs (QPs); hosts
create QPs consisting of a send queue and a receive queue,
and post operations to these queues using the verbs API.
We call the host initiating a verb the requester and the
destination host the responder. For some verbs, the re-
sponder does not actually send a response. On completing
a verb, the requester’s NIC optionally signals completion
by
DMA-ing
a completion entry (CQE) to a completion
queue (CQ) associated with the QP. Verbs can be made
unsignaled by setting a flag in the request; these verbs do
not generate a CQE, and the application detects comple-
tion using application-specific methods.
The two types of verbs are memory verbs and messag-
ing verbs. Memory verbs include RDMA reads, writes,
2
The CPU intercepts cache line-level activity between the PCIe
controller and the L3 cache, so the counters can miss some critical
information. For example, the counters indicate 2 PCIe reads when the
NIC reads a 4-byte chunk straddling 2 cache lines.
2