2
nd
INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND SOFTWARE ENGINEERING (ICACSE-19)
A Study of Data De-duplication Methods
Dinesh Mishra
1
and Dr. Sanjeev Patwa
2
Abstract—Data is the most imperative part of any organization for their productive need or to make more profit. Rapid growth of data
with variations is solemn issue to handle or process. Data is generating at higher rate that has to be stored in the databases with
uniqueness. Deduplication is an approach to abolish the duplicated data from the databases and provides the backup of the data. In
data deduplication numerous algorithm are feasible that basically detect and eliminate the superfluous data and store unique copy of
data contents. In our paper, we first survey the background and key features of de-duplication of data, and then classify the research in
data de-duplication according to the key strategy of the data de-duplication process. The summary and locution of the state of the art
on de-duplication helps identify and understand the most important design considerations for data de-duplication systems. Finally, we
draft the open problems and future research directions covering de-duplication-based storage systems.
Keywords: Data de-duplication; data reduction; Level of de-duplication; de-duplication approaches; storage systems.
I. INTRODUCTION
Deduplication is becoming increasingly important in that
it can effectively reduce the storage space in the cloud
server.The exponential growth of data volumes makes it
necessary to explore techniques such as data
deduplication to make data manageable and reduce the
archive or backup cost. With the rapid growth of cloud
data volume, deduplication technology has become
important to cloud storage. It can eliminate redundant
copies of user-uploaded data to save storage space and
management cost of cloud storage server. [1].
1
The use of cloud for storing and backing up data by
companies and common people for sharing information
has increased awfully over the past few years. Data
deduplication is a commonly used method to reduce
storage requirements in data centers and enterprise
servers. It operates by identifying and removing duplicate
blocks of data over long ranges. For example, consider a
corporate logo used in many slide decks of that
corporation. The enterprise storage server, using
deduplication, can store only the first occurrence of the
logo and replace subsequent occurrences with pointers to
the earlier stored one. [25]. De-duplication belongs to data
compression technique for redundant data reduction [5].
Today in IT budgets, on an average of 13% of the
money being invested on storage capacity. Data to grow
more quickly says IDC’s Digital Universe study [3].
1
Ph.D Scholar, Dept. of Comp. Sc. & Engg.,
School of Engg. & Tech, MODY University,
Lakshmangarh, Rajasthan, India
2
Asstt. Prof., Dept. of Comp. Sc. & Engg.,
School of Engg. & Tech, MODY University,
Lakshmangarh, Rajasthan, India
E-mail:
1
dmishra1475@gmail.com,
2
sanjeevpatwa.cet@modyuniversity.ac.in
These impacts creates problems like degradation of
performance and more operational costs. So in order to
swamped the above problems and handle system, the
concept of De-duplication is derived.
A Data De-duplication refers to the eradication of
redundant data by physically storing only the data that is
unique. This technique effectively reduces storage
capacity requirements and has application whenever
multiple copies of same data set need to be stored. De-
duplication reduces the required data storage capacity,
since only single copy of data is stored. Some researches
carried out the area of data de-duplication are [17] [20].
In general, data de-duplication increases the speed of
services and reduces costs. It improves the efficiency of
disk based backups.
De-duplication reduces the storage cost as it
allows reducing the amount of physical capacity
required for the backup job.
As the De-duplication curtails the amount disk
that is needed to support a backup job it will
reduce the power, space, and cooling
requirements of the disk.
II. DE-DUPLICATION PROCESS
De-duplication process mainly has four stages that is
Chunking, Fingerprinting, Indexing and Writing [25].
Figure 1: De-duplication process
Chunking
Fingerprinting
Indexing
Writing
Electronic copy available at: https://ssrn.com/abstract=3351012