A High-precision Duplicate Image Deduplication
Approach
Ming Chen
1
1. National Engineering Laboratory for Disaster Backup and Recovery, Beijing University of Posts and
Telecommunications, Beijing, China
Email: cm19834@163.com
Shupeng Wang
2
and Liang Tian
3
2. Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
3. College of Computer and Information Engineering, Xinxiang University, Xinxiang, China
Email: wangshupeng@iie.ac.cn, gaa252@gmail.com
Abstract—Deduplication has been widely used in backup
systems and archive systems to improve storage utilization
effectively. However the traditional deduplication
technology can only eliminate exactly the same images, but
it is unavailable to duplicate images which have the same
visual perceptions but different codes. To address the above
problem, this paper proposes a high-precision duplicate
image deduplication approach. The main idea of the
proposed approach is eliminating the duplicate images by
five stages including feature extraction, high-dimension
indexing, accuracy optimization, centroid selection and
deduplication evaluation. Experimental results demonstrate:
in a real dataset, the proposed approach not only effectively
saves storage space, but also significantly improves the
retrieval precision of duplicate images. In addition, the
selection of the centroid images can meet the requirements
of people’s perception.
Index Terms—image deduplication; B+ tree; accuracy
optimization; centroid selection; fuzzy synthetic evaluation
I. INTRODUCTION
Recently, with the development of Internet and the
popularity of digital products, the volume of global
digital resource is growing at an alarming rate. For
examples, in 2007, for the first time ever, the total
volume of digital resource exceeded the global storage
capacity. It is estimated that by 2011 only half of the
digital information will be stored [1]. Hence, it is
impossible to solve the data explosion problem by blindly
increasing storage devices. In order to solve the
requirement of storage space, Kai Li Professor of
Princeton University presented a new technology called
global compression technology or deduplication.
Deduplication can identify redundant data, eliminate all
but one copy, and create local pointers to the information
that users can access. This technology has been
widespread concerned by industry and academia [2, 3, 4,
5].
However the traditional deduplication technology
judges two data items redundant only if their underlying
bit-streams are identical. This restriction is too strict for
many applications [6]. For example, in an image storage
platform, according to encoding rules, any tiny
transformation will completely change bit-streams of
images. So the traditional deduplication technology can
only eliminate exactly the same images. It is unavailable
to duplicate images which have the same visual
perceptions but different codes.
However, in practical applications, due to the
requirement of network transmission or the restriction of
storage space, users often uploaded the modified images
and the same content images often present different
versions which are varied in resolution or quality. From a
visual angle, the images which have the same visual
perceptions but different codes can be seen a redundancy.
Therefore in large scale datacenter storages and data-
clouds, effectively eliminating the redundant copies of
images can significantly improve storage utilization.
Namely, the storage optimization will have an important
practical significance.
At present, the research of image deduplication doesn’t
have satisfactory results. In 2011, Katiyar proposed an
application-aware framework for video deduplication [6].
This framework chose ordinal signature to construct
video-signatures, and used Sequence Shape Similar (SSS)
to measure the similarity of the compared video
sequences. Finally, the proper centroid-video was
selected in the duplicate video collection by minimizing
the compression-ratio and maximizing the quality of
compressed videos. But there were two defects in this
framework: Firstly, it did not consider deduplication
accuracy. Error deduplication would bring losses to user
and affect the quality of the service. Secondly, it did not
consider the system scalability. For 1017 videos, this
involved
1017
2
⎛⎞
⎜⎟
⎝⎠
pairwise video-comparisons which
would spend nearly 2 hours [6]. In addition, The Targeted
Public Distribution System (TPDS) of India was a
mechanism for ensuring access and availability of food
grains and other essential commodities at subsidized
prices to the households [7]. To bogus ration cards appear
JOURNAL OF COMPUTERS, VOL. 8, NO. 11, NOVEMBER 2013
doi:10.4304/jcp.8.11.2768-2775