多核NPU平台上的TCP大接收卸载实现

113 浏览量更新于2024-08-27 收藏 263KB PDF 举报

"Implementation of TCP Large Receive Offload on Multi-core NPU Platform" 在当前的网络环境中，以太网的发展速度远超内存和CPU技术，导致协议处理成为限制TCP性能的主要瓶颈。传统的网络接口卡（NIC）通常支持诸如校验和卸载和TCP分段卸载（TSO）等技术，这些技术将部分处理工作转移至NIC硬件，从而减轻主机系统的负担。本文提出了一种在多核网络处理单元（NPU）平台上实现TCP大型接收卸载（LRO）的方法，以提升TCP性能。 TCP大型接收卸载（LRO）是一种网络优化技术，它允许NIC合并连续的数据包，减少主机系统对小数据包的处理次数。在多核NPU平台上实现LRO，可以充分利用硬件资源，显著提高数据处理效率。特别地，作者们引入了一种称为“主动ACK”（active ACK）的机制，使得能够聚合非常大的数据包（例如64KB），这进一步提升了数据传输的效率。文章指出，通过实验结果展示了该提案的有效性。关键词包括：多核NPU、TCP和LRO。这项工作对于理解如何在现代网络架构中利用硬件加速来优化TCP性能具有重要意义，特别是在面对高速网络流量时，LRO可以显著降低延迟，提高吞吐量，并减少CPU的使用率。在实际应用中，LRO技术可以被广泛应用于数据中心、云计算服务以及高带宽网络通信等领域，以优化网络性能并提升整体系统效率。通过在多核NPU上实现LRO，不仅可以减轻服务器的处理压力，还能增强网络的可扩展性和稳定性，这对于应对不断增长的网络流量需求至关重要。此外，主动ACK机制的设计是本文的一个创新点。传统TCP协议中，每次接收到数据包后都会发送一个ACK确认，而在LRO中，通过主动ACK，可以等到积累到一定大小的数据包再进行一次确认，减少了网络交互次数，提高了数据传输效率。这种优化策略对于大规模网络环境尤其有益，因为它能有效减少网络拥塞，同时提高数据传输的连续性。这篇研究论文探讨了如何利用多核NPU平台上的LRO技术改善TCP性能，通过实验验证了其有效性，并提出了主动ACK机制以增强大型数据包的聚合能力。这一实施策略对于未来网络硬件设计和TCP性能优化提供了有价值的参考。

Implementation of TCP Large Receive Ofﬂoad on

Multi-core NPU Platform

Li Jie

School of Computer

National University of

Defense Technology

Changsha, China

Email: lijie13d@nudt.edu.cn

Chen Shuhui

School of Computer

National University of

Defense Technology

Changsha, China

Email: shchen@nudt.edu.cn

Su Jinshu

School of Computer

National University of

Defense Technology

Changsha, China

Email: sjs@nudt.edu.cn

Abstract—Nowadays, the ethernet is developing much faster

than memory and CPU technologies, protocol processing has

become the bottleneck of TCP performance on end system-

s. Modern NICs usually support ofﬂoad techniques such as

checksum ofﬂoad and TCP Segmentation Ofﬂoad(TSO), allowing

the end system to ofﬂoad some processing work onto the NIC

hardware. In this paper, we propose an implementation of Large

Receive Ofﬂoad(LRO) on a multi-core NPU platform to improve

TCP performance, particularly, we employ a so called active

ACK mechanism to make very large packets(64KB) aggregation

possible. We present experiment results to demonstrate the

effectiveness of our proposal.

Keywords—Multi-core NPU, TCP, LRO

I. INTRODUCTION

In the last few years, ethernet bandwidth has increased from

1Gbps to 100Gbps, while memory bandwidth from DDR2’s

8533.33MBps [1] to DDR4’s 19200MBps [2], the top speed

of processors settled around 4GHz and has not increased

much since the year of 2005 [3]. This performance gap

makes memory access and protocol processing become the

bottleneck of TCP, instead of link capacity. The constantly

increasing network bandwidth has caused a severe burden for

CPU, optimizing TCP processing mechanism can mitigate this

situation and improve TCP performance on end systems.

Traditional TCP acceleration techniques such as checksum

optimization [4] [5] [6] [7], zero-copy [8] and interrupt co-

alescing [9] focused on the host side, protocol processing

is still done by host CPU. TOE [10] can ofﬂoad the entire

TCP protocol processing workload and improve TCP perfor-

mance dramatically, but its implementation is very complex

and often causes security and compatibility issues [11]. TSO

[12] optimizes TCP’s data sending path by ofﬂoading the

data segmentation and checksum calculation functions, the

technique has become rather mature because of its simplicity.

LRO [13] improves TCP performance by reducing the number

of packet headers processed by CPU, but it works in the NIC

driver layer and the packet aggregation job is still done by

host CPU.

A multi-core Network Processing Unit(NPU) is an integrat-

ed circuit which has a feature set speciﬁcally targeted at the

networking application domain, it usually has excellent packet

processing capability for the following reasons:

1) More than a dozen hardware based, low-switching-

overhead threads. The large number of hardware con-

texts enables software to more effectively leverage the

inherent parallelism exhibited by packet processing ap-

plications.

2) Favorable I/O features. A multi-core NPU can import

packets from network interface to memory with high

throughput, moreover, its ﬂexible dispatching compo-

nent can distribute packets to different threads or cores

according to application conﬁgurations and pipeline with

the corresponding processing threads.

3) Well designed message passing mechanism among dif-

ferent threads. A multi-core NPU often employs cross-

bar structure or SRAM as its message transfer medium,

which makes thread synchronization efﬁcient and ele-

gant.

Different TCP ﬂows are weakly correlated and can be

processed concurrently, this fact naturally leads to the idea

of employing a multi-core NPU’s excellent packet processing

capability to accelerate TCP processing on an end system.

In this paper, we propose to use multi-core NPU as NIC

and implement LRO on it, our implementation reduces the

number of packets processed by host network stack and the

number of interrupts generated by NIC, eventually improves

TCP performance on an end system. The experiment results

demonstrate the effectiveness of our proposal. Further more,

our implementation only involves the NIC hardware and driver

layer, there is no difference between the multi-core NPU

and a normal NIC from the kernel network stack’s and user

applications’ angle, our implementation does not suffer TOE’s

compatibility and security problems.

II. R

ELATED WORK

Large Receive Ofﬂoad is a NIC driver layer technique

for increasing TCP data receiving throughput, it works by

aggregating multiple small data packets of the same ﬂow into

large but much fewer ones, the aggregated packets are then

delivered to the kernel network stack for further processing,

as shown in Figure 1.

LRO was ﬁrst proposed by Grossman [13] and implemented

in the NIC driver program for Neterion Xframe-II. When

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38571759

粉丝: 6
资源: 897

多核NPU平台上的TCP大接收卸载实现

BlueField2-MOC3.0: OVS-DPDK Offload & Virtio-Net Emulation Progress

理解TCP/IP Offload Engine：加速网络性能的关键技术

OVS Offload with ASAP2: Multi-well Profile Tuning Guide for Enhanced Performance

gcc-offload-nvptx-8.4.1-2.1.el8.ppc64le.rpm

libgomp-offload-nvptx-8.4.1-2.1.el8.ppc64le.rpm

libgomp-offload-nvptx-8.5.0-1.el8.ppc64le.rpm

gcc-offload-nvptx-8.5.0-2.el8.x86_64.rpm

gcc-offload-nvptx-8.4.1-1.el8.x86_64.rpm

libgomp-offload-nvptx-8.5.0-3.el8.x86_64.rpm

gcc-offload-nvptx-8.5.0-3.el8.x86_64.rpm

最新资源