RMD: A Resemblance and Mergence based
Approach for High Performance Deduplication
Panfeng Zhang
∗
, Ping Huang
†
, Xubin He
†
, Hua Wang
∗
, Lingyu Yan
‡
and Ke Zhou
∗
∗
School of Computer, Huazhong University of Science and Technology,Wuhan, China
Wuhan National Laboratory for Optoelectronics, Wuhan, China
†
Virginia Commonwealth University, USA
‡
School of Computer Science, Hubei University of Technology
{panf
zhang, k.zhou, hwang}@hust.edu.cn {phuang, xhe2}@vcu.edu yanranyaya@126.com
Abstract—Data deduplication, a data redundancy elimination
technique, has been employed in almost all kinds of application
environments to reduce storage space. However, one of the
main challenges facing deduplication technology is to provide
a fast key-value fingerprint index for large datasets, as the index
performance is critical to the overall deduplication performance.
This paper proposes RMD, a resemblance and mergence based
deduplication scheme, which aims to provide quick responses to
fingerprint queries. The key idea of RMD is to leverage a bloom
filter array and the data resemblance algorithm to dramatically
reduce the query range for deduplication. Moreover, RMD uti-
lizes mergence based approach to merge resemblance segments to
relevant bins, and exploits frequency-based Fingerprint Retention
Policy to reduce the bin capacity to improve query throughput
and improve data deduplication ratio. Extensive experimental
results with real-world datasets have shown that RMD is able to
achieve pretty high query performance and outperforms several
state-of-the-art deduplication schemes.
I. INTRODUCTION
As a space-efficient technology to reduce storage overhead,
deduplication technology attracts great attention and popu-
larity in various storage systems, such as primary storage
systems [1]–[3], secondary backup systems [4]–[6], and high-
performance data centers [7], [8]. Deduplication, as a global
data redundancy removal technology, mainly mines and iden-
tifies duplicate data content, stores only one data copy, and
replaces other identical copies with indirect references rather
than storing full copies. Typically, data deduplication relies on
fingerprinting to find duplicate data instead of using byte-wise
comparison. In deduplication systems, a fingerprint of a file or
chunk is calculated using a cryptographic hash algorithm (e.g.,
SHA-1). Fingerprints serve as the proxies for testing content
uniqueness. A fingerprint index is established for mapping
fingerprints to the physical addresses of files or chunks. A
duplicate file or chunk can be identified via checking the
existence of its fingerprint in the fingerprint index.
In recent years, data-intensive applications have become
popular in cloud environments [9], [10], which produce a
huge amount of data. Data deduplication has been leveraged
to facilitate the management of such “big data”. One of
the main challenges with data deduplication is to provide
a high performance and scalable key-value fingerprint index
for deduplicating large-scale datasets [11], [12]. For example,
to support a unique dataset of 800 TB and assuming an
average chunk size of 8KB, at least 2 TB of SHA-1(20-
byte) fingerprints will be generated, which are too large
to be stored in the memory [13]. Therefore, for practical
considerations, the realistically adopted approach is to store
the entire fingerprint index either in a disk system [11] or on
flash [12], with part of the fingerprints cached in a memory
buffer. However, accessing disk or flash is much slower than a
memory access, creating the well-known fingerprint bottleneck
problem. In this paper, we are mainly concerned about the
case where fingerprints are stored in an HDD disk storage
system and optimize the finperprint index organizations on
the disk storage. Fortunately, due to the existence of locality
[13], in most cases it only requires to check against a portion
of the fingerprints using near-exact deduplication, without
significantly losing deduplication efficiency.
Our goal is to design an efficient fingerprint index store
which can provide high query performance, while minimally
sacrificing deduplication efficiency. To this end, we propose
RMD, a new resemblance and mergence based near-exact d-
eduplication scheme, which can provide very high index query
performance by reducing the search space via resemblance-
based segment store organization. The key idea of RMD is
to exploit the data resemblance theory and a bloom filter
array to narrow the query range. Specifically, RMD uses
the resemblance algorithm to detect resemble segments and
clusters resemble segments into the same bin. Due to the
clustering of the resemblance segments, RMD can rapidly find
all t he potential segments for comparisons to identify duplicate
chunks, which might be distributed in a large index space
in other existing deduplication approaches. Meanwhile, RMD
leverages a bloom filter array which is a high performance
scalable data structure suitable for membership query. Overall,
with the employment of a bloom filter array and the clustering
of resemble segments, RMD can quickly locate the most re-
semble segments for deduplication, significantly reducing disk
I/O accesses incurred by fingerprint queries, thus improving
deduplication performance.
II. B
ACKGROUND AND MOTIVATION
In this section, we first provide the necessary background
knowledge for RMD, and then motivate our work by analyz-