NUMA架构下单边RDMA交互性能调优研究

需积分: 9 196 浏览量更新于2024-08-05 收藏 614KB PDF 举报

"NUMA对单侧RDMA交互影响的性能评估-CR-SRDS-2020"是一项针对现代高性能计算平台关键技术——远程直接内存访问(RDMA)和非均匀内存访问(NUMA)的研究。该研究由Jacob Nelson和Roberto Palmieri两位来自Lehigh University的计算机科学与工程系教授合作完成，他们探讨了在多处理器架构中，当RDMA和NUMA技术结合时可能产生的性能瓶颈。在传统的计算环境中，RDMA允许节点直接访问远程机器上的内存，显著提高了数据传输速度。然而，随着网络速度的提升，NUMA所带来的局部性问题在分布式系统中的影响变得更加显著。NUMA设计旨在通过将内存划分为多个独立的、非对称的内存区域，每个CPU核心与其物理内存之间有更高的带宽，从而提高整体性能。但在单侧RDMA交互场景下，这种局部性优化可能会导致跨 NUMA域的数据访问延迟，特别是当工作负载集中在单个节点或一侧时。研究者发现，当远程内存的NUMA局部性较差时，会对RDMA系统的性能造成显著影响，甚至可能导致高达20%的性能下降。这意味着在选择硬件配置和优化应用策略时，必须充分考虑NUMA对单侧RDMA交互的影响。此外，这项研究揭示了一些意外的行为，即原本可能是局部性的优化策略，在高网络带宽环境下可能转变为性能瓶颈。为了量化这些影响，研究者进行了详细的性能评估，通过对不同NUMA配置进行实验，收集了大量的性能数据，并分析了各种因素如何相互作用，以确定最佳的系统设计和工作负载分布策略。这项工作为理解并应对现代计算环境中NUMA与RDMA协同工作的复杂性提供了重要的理论依据和实践经验，对于优化数据中心和云计算环境中的分布式应用程序至关重要。"

NUMA-remote memory access. The table in Figure 2 shows

memory access latency results in nanoseconds for a four-

socket machine of Intel Xeon Platinum 8160 processors with

a total of 192 cores.

Fig. 2. Memory access latency in nanoseconds in a server with 4 NUMA

zones. The diagonal values in bold correspond to NUMA-local accesses.

One consequence of the NUMA organization is that I/O is

also bound to a given zone, thus playing a role in networked

computation. External devices communicate with processors

through adapters that are physically connected to one NUMA

zone. Section II-C provides a more detailed explanation of this

aspect. The result is that memory accesses originating from a

device targeting an address whose page resides in NUMA-

remote memory must traverse the interconnect in order to be

served. As we will see in our evaluation, this can impact

the performance of RDMA despite an order of magnitude

difference between the network and interconnect latency.

B. Remote Direct Memory Access

RDMA gives a node the ability to remotely access the

physical memory of another node directly, without involving

the CPU of the remote node. Bypassing the CPU avoids two

signiﬁcant overheads, which are present in standard TCP/IP

stacks. First, everything is performed in user-space so there is

no CPU overhead due to mode or context switches. Second,

direct memory access (DMA) is leveraged by the RDMA-

capable network interface controller (RNIC) to avoid unnec-

essary memory copies, as detailed below.

All RDMA interaction shares certain procedures to set up

communication and register remotely accessible memory. Each

endpoint maintains a queue pair (QP), which consists of a send

queue and receive queue. Work requests are added to these

queues during operation depending on the communication

protocol used. During creation, this structure is associated with

a remotely accessible memory region.

Endpoints exchange data via one-sided or two-sided verbs.

Two-sided verbs are somewhat analogous to the socket model

presented by TCP/IP. The name of this type of RDMA

communication is based on the fact that both sides are needed.

We avoid a detailed discussion of this form of communication

as it does not pertain to our study.

To issue a one-sided RDMA, the requester posts a work re-

quest containing local and remote addresses, size, and remote

key to their send queue to initiate the operation on remote

memory. These verbs only require action from the requester

and, in contrast to two-sided communication, the requester has

knowledge of the virtual address on the remote node. In the

case of read operations, remote data is copied to the local

address and a work completion is added to the completion

queue. For writes, the remote RNIC ACKs the request and a

work completion is added to the requester’s completion queue.

One-sided communications requires a reliable connection and

QPs cannot serve multiple connections [17]. Communication

is faster than for two-sided verbs with the caveat that scal-

ability can suffer when the number of connections increases

signiﬁcantly.

As mentioned before, since one-sided interactions are ag-

nostic to the machine’s operating system and, at the same

time, they share the same physical hardware resources of the

local operating system and applications, their performance

can be signiﬁcantly affected by factors that cannot be opti-

mized at runtime by either the requesting applications nor the

software on the receiving machine. One consequence of this

transparency is that NUMA balancing optimization [2] cannot

be employed to move memory pages between NUMA zones

because pages must be locked upon memory registration with

the RNIC.

C. NUMA and RDMA I/O

Modern architectures offer a mechanism for I/O to directly

access the last-level cache. For example, the Intel machines

used in this study use Data Direct I/O (DDIO) to achieve

direct cache access. Given modern I/O speeds and cache

sizes, it is practical to allow I/O to access cache to avoid

overhead. Previously, incoming data would be written to main

memory and local accesses would then read it into cache.

With technology like DDIO, I/O latency improves for accesses

to cached memory and local computation beneﬁts from I/O

placing memory directly into the cache.

The intent of this technology is to transparently improve

latency and throughput for I/O operations. However, it is

important to note that this is currently only applicable to the

cache in the same NUMA zone as the I/O controller and is

enabled by default. Physical memory in the remote NUMA

zone is accessed by a normal DMA [18]. As we will address

in our evaluation, this behavior can act against RDMA and

negatively impact performance.

III. RELATED W ORK

Extensive investigation of the role of NUMA locality in

system performance solidiﬁes it as an important consideration

when designing high-performance applications to run on mod-

ern multicore machines. Integrating NUMA-awareness into

algorithms and data structures improves performance [8], [9],

[12], [24], [42].

Recent literature has been ﬂooded with systems that exploit

RDMA for different types of computation. Because of its

reduced communication latency, RDMA is an ideal technology

for, but not limited to, distributed transactional systems [10],

[14], [20], [22], [39], [40], distributed shared-memory [3], [7],

[13], data transfer and storage [6], [33], and group commu-

nication [38]. Most systems using one-sided communication

implement a similar pattern that resembles the traditional

client-server model using one-sided writes for message passing

and one-sided reads for direct data access. Their designs

剩余10页未读，继续阅读

bandaoyu

粉丝: 18w+
资源: 63

NUMA架构下单边RDMA交互性能调优研究

介绍RDMA的文章

鲁小亿：在现代集群上使用RDMA加速大数据处理

在NUMA架构下，单侧RDMA通信如何影响多处理器系统的性能？是否存在性能损失，以及如何通过系统配置来优化负载平衡？

NUMA内存架构下的Spark性能优化

SALSA - Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools (spaa049-gidron)-计算机科学

numa-angularjs-client:Numa 的 AngularJS 客户端

vdsm-hook-numa-4.20.39.1-1.el7.noarch.rpm

vdsm-hook-numa-4.20.23-1.el7.noarch.rpm

虚拟网络性能调优总结-张旭2020.pdf

morsel-driven-numa

最新资源