PacketUsher: a DPDK-Based Packet I/O Engine for Commodity PC
Zhijun Xu
∗
, Liang Zhou
†
, Li Feng
‡
, Yujun Zhang
†
, Jun Zhang
§
and Huadong Ma
∗
∗
Beijing University of Post and Telecommunications, Beijing, CHINA
Email: {xuzhijun, mhd}@bupt.edu.cn
†
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, CHINA
Email: {zhouliang, zhmj}@ict.ac.cn
‡
Faculty of Information Technology, Macau University of Science and Technology, Macau, CHINA
Email: lfeng@must.edu.mo
§
Inner Mongolia University, Hohhot, CHINA
Email: zhangjun@imu.edu.cn
Abstract—Deploying network applications on commodity PC
is increasingly important because of its flexibility and cheapness.
Due to high packet I/O overheads, the performance of these appli-
cations are low. In this paper, we present PacketUsher, an efficient
packet I/O engine based on the libraries of DPDK. By replacing
standard I/O routine with PacketUsher, we can remarkably
accelerate both I/O-intensive and compute-intensive applications
on commodity PC. As a case study of I/O-intensive application,
our RFC 2544 benchmark over PacketUsher achieves same
testing results as dedicated commercial device. For compute-
intensive application, the performance of our application-layer
traffic generator over PacketUsher is more than 4 times of the
original value and outperforms existing frameworks by about 3
times.
Index Terms—network application; packet I/O; DPDK; I/O-
intensive; compute-intensive
I. INTRODUCTION
Software packet processing on commodity PC is an ideal
choice to deploy network applications, especially after the
thriving of Network Function Virtualization (NFV) [1]. It is
inexpensive to operate, easy to switch between vendors and
perfect to accommodate future software innovations [2]. While
a significant step forward in some respects, it was a step
backwards in others. Flexibility on commodity PC is at the cost
of discouraging low performance, which is mainly restricted
by packet I/O overheads. For example, the sendto() system
call of FreeBSD averagely takes 942ns to transmit packets,
and RouteBricks (a software router) reports that 66% CPU
cycles are spent on packet I/O [3].
To address the issue of costly packet I/O, current works
anticipate to bypass Operating System and design novel packet
I/O frameworks to take direct control of hardware. Research
[4] demonstrates that replacing raw packet I/O APIs in general
purpose OS with novel packet I/O frameworks like Netmap [5]
can transparently accelerate software routers, including Open
vSwitch [6] and Click [7].
PF RING [8] is a novel packet I/O framework on commod-
ity PC. Its zero copy version can achieve line rate (14.881
Mpps) packet I/O on 10 Gbit/s link [9]. However, this version
is not free for commercial companies or common users. The
open-source Netmap usually takes 90 CPU cycles to send
or receive packets [5]. But it is not convenient to deploy
(sometimes need to re-compile Linux kernel) and suffers
packet loss at high frame rate. Intel DPDK [10] is a set of
open-source libraries for high-performance packet processing.
It reduces the cost of packet I/O to less than 80 CPU cycles
[10]. Many companies (Intel, 6WIND, Radisys, etc.) have
already supported DPDK within their products. In this paper,
we use the libraries of DPDK to design an efficient packet I/O
engine for common users.
We argue that the packet I/O engine for commodity PC
should have four properties: low coupling with user applica-
tions, multi-thread safe, simple packet I/O API and high-speed
packet I/O performance. Such design goal motives us to im-
plement the packet I/O engine of PacketUsher on commodity
PC. PacketUsher brings noticeable performance improvement
for both I/O-intensive and compute-intensive applications. For
example, it makes our RFC 2544 benchmark (I/O-intensive)
have same testing results as dedicated hardware, and makes
our application-layer traffic generator (compute-intensive) gain
more than 4 times of performance improvement.
The remainder of this paper is organized as follows. In
Section II, we introduce some background knowledge. Section
III shows PacketUsher and its performance evaluation. Section
IV presents two case studies.
II. BACKGROUND KNOWLEDGE
A. Overheads of Standard Packet I/O
Standard packet I/O mechanism in general purpose OS is
interrupt-driven. It has three overheads: interrupt handling,
buffer allocation and memory copy.
Interrupt handling: At high frame rate, interrupt-driven
mechanism suffers the problem of receive livelock [12]. Pre-
vious works [3] [4] [13] utilize batch processing to mitigate
receive livelock. However, some received packets may be
dropped if the OS fails to handle interrupt requests timely. An-
other possible method is replacing interrupt-driven mechanism
with polling which periodically checks the arrival of packets
on NICs. Its drawback is that we must use custom drivers
instead of standard ones. Compromised method is Linux NAPI
[14] which uses interrupt to notify the arrival of packets and
then uses polling to receive batch of packets.
Buffer allocation: Buffer allocation is another time-
consuming action. Allocating buffers for transmitted or re-
ceived packets costs much system resources. Previous works