Implementation of TCP Large Receive Offload on
Multi-core NPU Platform
Li Jie
School of Computer
National University of
Defense Technology
Changsha, China
Email: lijie13d@nudt.edu.cn
Chen Shuhui
School of Computer
National University of
Defense Technology
Changsha, China
Email: shchen@nudt.edu.cn
Su Jinshu
School of Computer
National University of
Defense Technology
Changsha, China
Email: sjs@nudt.edu.cn
Abstract—Nowadays, the ethernet is developing much faster
than memory and CPU technologies, protocol processing has
become the bottleneck of TCP performance on end system-
s. Modern NICs usually support offload techniques such as
checksum offload and TCP Segmentation Offload(TSO), allowing
the end system to offload some processing work onto the NIC
hardware. In this paper, we propose an implementation of Large
Receive Offload(LRO) on a multi-core NPU platform to improve
TCP performance, particularly, we employ a so called active
ACK mechanism to make very large packets(64KB) aggregation
possible. We present experiment results to demonstrate the
effectiveness of our proposal.
Keywords—Multi-core NPU, TCP, LRO
I. INTRODUCTION
In the last few years, ethernet bandwidth has increased from
1Gbps to 100Gbps, while memory bandwidth from DDR2’s
8533.33MBps [1] to DDR4’s 19200MBps [2], the top speed
of processors settled around 4GHz and has not increased
much since the year of 2005 [3]. This performance gap
makes memory access and protocol processing become the
bottleneck of TCP, instead of link capacity. The constantly
increasing network bandwidth has caused a severe burden for
CPU, optimizing TCP processing mechanism can mitigate this
situation and improve TCP performance on end systems.
Traditional TCP acceleration techniques such as checksum
optimization [4] [5] [6] [7], zero-copy [8] and interrupt co-
alescing [9] focused on the host side, protocol processing
is still done by host CPU. TOE [10] can offload the entire
TCP protocol processing workload and improve TCP perfor-
mance dramatically, but its implementation is very complex
and often causes security and compatibility issues [11]. TSO
[12] optimizes TCP’s data sending path by offloading the
data segmentation and checksum calculation functions, the
technique has become rather mature because of its simplicity.
LRO [13] improves TCP performance by reducing the number
of packet headers processed by CPU, but it works in the NIC
driver layer and the packet aggregation job is still done by
host CPU.
A multi-core Network Processing Unit(NPU) is an integrat-
ed circuit which has a feature set specifically targeted at the
networking application domain, it usually has excellent packet
processing capability for the following reasons:
1) More than a dozen hardware based, low-switching-
overhead threads. The large number of hardware con-
texts enables software to more effectively leverage the
inherent parallelism exhibited by packet processing ap-
plications.
2) Favorable I/O features. A multi-core NPU can import
packets from network interface to memory with high
throughput, moreover, its flexible dispatching compo-
nent can distribute packets to different threads or cores
according to application configurations and pipeline with
the corresponding processing threads.
3) Well designed message passing mechanism among dif-
ferent threads. A multi-core NPU often employs cross-
bar structure or SRAM as its message transfer medium,
which makes thread synchronization efficient and ele-
gant.
Different TCP flows are weakly correlated and can be
processed concurrently, this fact naturally leads to the idea
of employing a multi-core NPU’s excellent packet processing
capability to accelerate TCP processing on an end system.
In this paper, we propose to use multi-core NPU as NIC
and implement LRO on it, our implementation reduces the
number of packets processed by host network stack and the
number of interrupts generated by NIC, eventually improves
TCP performance on an end system. The experiment results
demonstrate the effectiveness of our proposal. Further more,
our implementation only involves the NIC hardware and driver
layer, there is no difference between the multi-core NPU
and a normal NIC from the kernel network stack’s and user
applications’ angle, our implementation does not suffer TOE’s
compatibility and security problems.
II. R
ELATED WORK
Large Receive Offload is a NIC driver layer technique
for increasing TCP data receiving throughput, it works by
aggregating multiple small data packets of the same flow into
large but much fewer ones, the aggregated packets are then
delivered to the kernel network stack for further processing,
as shown in Figure 1.
LRO was first proposed by Grossman [13] and implemented
in the NIC driver program for Neterion Xframe-II. When
258978-1-5090-1325-8/16/$31.00 ©2016 IEEE ICTC 2016