AE: An Asymmetric Extremum Content Defined
Chunking Algorithm for Fast and
Bandwidth-Efficient Data Deduplication
Yucheng Zhang
†
, Hong Jiang
‡
, Dan Feng
†*
, Wen Xia
†
, Min Fu
†
, Fangting Huang
†
, Yukun Zhou
†
†
Wuhan National Laboratory for Optoelectronics
School of Computer, Huazhong University of Science and Technology, Wuhan, China
‡
University of Nebraska-Lincoln, Lincoln, NE, USA
*
Corresponding author: dfeng@hust.edu.cn
Abstract—Data deduplication, a space-efficient and
bandwidth-saving technology, plays an important role in
bandwidth-efficient data transmission in various data-intensive
network and cloud applications. Rabin-based and MAXP-based
Content-Defined Chunking (CDC) algorithms, while robust
in finding suitable cut-points for chunk-level redundancy
elimination, face the key challenges of (1) low chunking
throughput that renders the chunking stage the deduplication
performance bottleneck and (2) large chunk-size variance that
decreases deduplication efficiency. To address these challenges,
this paper proposes a new CDC algorithm called the Asymmetric
Extremum (AE) algorithm. The main idea behind AE is based
on the observation that the extreme value in an asymmetric local
range is not likely to be replaced by a new extreme value in
dealing with the boundaries-shift problem, which motivates AE’s
use of asymmetric (rather than symmetric as in MAXP) local
range to identify cut-points and simultaneously achieve high
chunking throughput and low chunk-size variance. As a result,
AE simultaneously addresses the problems of low chunking
throughput in MAXP and Rabin and high chunk-size variance
in Rabin. The experimental results based on four real-world
datasets show that AE improves the throughput performance
of the state-of-the-art CDC algorithms by 3x while attaining
comparable or higher deduplication efficiency.
I. INTRODUCTION
According to a study of International Data Corporation (ID-
C), the amount of digital information generated in the whole
world is about 1.8ZB in 2012, and that amount will reach 40ZB
by 2020 [1]. How to efficiently store and transfer such large
volumes of digital data is a challenging problem. Moreover,
IDC also shows that about three quarters of digital information
is duplicated. As a result, data deduplication, a space and
bandwidth efficient technology that prevents redundant data
from being stored in storage devices and transmitted over the
networks, is one of the most important methods to tackle
this challenge. Due to its significant data reduction efficiency,
chunk-level deduplication is used in various fields, such as
storage systems [2], [3], Redundancy Elimination (RE) in
networks [4], [5], file-transfer systems (rsync [6]) and remote-
file systems (LBFS [7]).
Chunk-level deduplication schemes divide the input data
stream into chunks and then hash each chunk to generate its
fingerprint that uniquely identifies the chunk. Duplicate chunks
can be removed if their fingerprints are matched with those of
previously stored or transmitted chunks. As the first and key
stage in the chunk-level deduplication workflow, the chunking
algorithm is responsible for dividing the input data stream into
chunks of either fixed size or variable size for redundancy de-
tection. Fix-Sized Chunking (FSC) [8] marks chunks’ bound-
aries by their positions and thus is simple and extremely fast.
The main drawback of FSC is its low deduplication efficiency
that stems from the boundaries-shift problem. For example, if
one byte is inserted at the beginning of an input data stream, all
current chunk boundaries declared by FSC will be shifted and
no duplicate chunks will be identified and eliminated. Content-
Defined Chunking (CDC) divides the input data stream into
variable-sized chunks. It solves the boundaries-shift problem
by declaring chunk boundaries depending on local content. As
a result, the CDC algorithm outperforms the FSC algorithm
in terms of deduplication efficiency and has been widely used
in bandwidth- and storage-efficient applications [9], [10]. To
provide the necessary basis to facilitate the discussion of and
comparison among different CDC algorithms, we list below
some key properties that a desirable CDC algorithm should
have.
1) Content defined. To avoid the boundaries-shift problem,
the algorithm should declare the chunk boundaries based
on local content, i.e., the cut-points for chunking must be
content defined.
2) Low computational overhead. CDC algorithms need to
check almost every byte in an input data stream to
find the chunk boundaries. This means that the algorith-
m execution time is approximately proportional to the
number of bytes of the input data stream, which can
take up significant CPU resources. Hence, in order to
achieve higher deduplication throughput, the chunking
algorithm should be simple and devoid of time-consuming
operations.
3) Small chunk size variability. The variance of chunk size
has a significant impact on the deduplication efficiency.
The smaller the variance of the chunk size is, the higher
the deduplication efficiency will be achieved [11].
4) Ability to identify and eliminate low-entropy strings.
The content of real data may sometimes include low-
entropy strings [12]. These strings include very few
distinct characters but a large amount of repetitive bytes.
In order to achieve higher deduplication efficiency, it is
desirable for the algorithm to be capable of detecting and
eliminating these duplicate strings.
5) Less limitations on chunk size. Minimum and maximum
2015 IEEE Conference on Computer Communications (INFOCOM)
978-1-4799-8381-0/15/$31.00 ©2015 IEEE 1337