基于相似度和合并的高性能数据去重复方法

研究论文

187 浏览量更新于2024-08-31 收藏 595KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"RMD: A Resemblance and Mergence based Approach for High Performance Deduplication" 数据去重复（Deduplication）是一种数据冗余消除技术，旨在减少存储空间。它已经被广泛应用于各种应用环境中。然而，数据去重复技术面临的一个主要挑战是提供快速的关键值指纹索引，以满足大型数据集的需求，因为索引性能对整个去重复性能的影响非常关键。为了解决这个挑战，本论文提出了 RMD，一种基于相似性和合并的去重复方案。RMD 的关键思想是利用布隆过滤器数组和数据相似性算法，以大幅减少指纹查询的范围。这种方法可以提供快速的响应速度，以满足大型数据集的需求。 RMD 的主要优点是可以提供高性能的去重复性能，同时也可以减少存储空间。该方法可以应用于各种应用环境中，以提高存储效率和数据处理速度。 RMD 的工作原理是基于数据相似性算法和布隆过滤器数组。首先，RMD 使用布隆过滤器数组来快速过滤掉不相关的数据，然后使用数据相似性算法来比较数据块之间的相似性。这种方法可以大幅减少指纹查询的范围，从而提高去重复性能。 RMD 的优点还在于可以适应各种数据类型和存储系统。该方法可以应用于不同的应用环境中，以提高存储效率和数据处理速度。 RMD 是一种高性能的去重复方案，能够提供快速的响应速度和高效的存储空间利用率。该方法可以应用于各种应用环境中，以提高存储效率和数据处理速度。知识点： * 数据去重复（Deduplication）是一种数据冗余消除技术，旨在减少存储空间。 * 数据去重复技术面临的一个主要挑战是提供快速的关键值指纹索引，以满足大型数据集的需求。 * RMD 是一种基于相似性和合并的去重复方案，能够提供高性能的去重复性能和存储空间利用率。 * RMD 的关键思想是利用布隆过滤器数组和数据相似性算法，以大幅减少指纹查询的范围。 * RMD 可以应用于各种应用环境中，以提高存储效率和数据处理速度。 * RMD 的优点包括高性能的去重复性能、快速的响应速度和高效的存储空间利用率。 * RMD 可以适应各种数据类型和存储系统。

资源详情

资源推荐

RMD: A Resemblance and Mergence based

Approach for High Performance Deduplication

Panfeng Zhang

∗

, Ping Huang

†

, Xubin He

†

, Hua Wang

∗

, Lingyu Yan

‡

and Ke Zhou

∗

School of Computer, Huazhong University of Science and Technology,Wuhan, China

Wuhan National Laboratory for Optoelectronics, Wuhan, China

†

Virginia Commonwealth University, USA

‡

School of Computer Science, Hubei University of Technology

{panf

zhang, k.zhou, hwang}@hust.edu.cn {phuang, xhe2}@vcu.edu yanranyaya@126.com

Abstract—Data deduplication, a data redundancy elimination

technique, has been employed in almost all kinds of application

environments to reduce storage space. However, one of the

main challenges facing deduplication technology is to provide

a fast key-value ﬁngerprint index for large datasets, as the index

performance is critical to the overall deduplication performance.

This paper proposes RMD, a resemblance and mergence based

deduplication scheme, which aims to provide quick responses to

ﬁngerprint queries. The key idea of RMD is to leverage a bloom

ﬁlter array and the data resemblance algorithm to dramatically

reduce the query range for deduplication. Moreover, RMD uti-

lizes mergence based approach to merge resemblance segments to

relevant bins, and exploits frequency-based Fingerprint Retention

Policy to reduce the bin capacity to improve query throughput

and improve data deduplication ratio. Extensive experimental

results with real-world datasets have shown that RMD is able to

achieve pretty high query performance and outperforms several

state-of-the-art deduplication schemes.

I. INTRODUCTION

As a space-efﬁcient technology to reduce storage overhead,

deduplication technology attracts great attention and popu-

larity in various storage systems, such as primary storage

systems [1]–[3], secondary backup systems [4]–[6], and high-

performance data centers [7], [8]. Deduplication, as a global

data redundancy removal technology, mainly mines and iden-

tiﬁes duplicate data content, stores only one data copy, and

replaces other identical copies with indirect references rather

than storing full copies. Typically, data deduplication relies on

ﬁngerprinting to ﬁnd duplicate data instead of using byte-wise

comparison. In deduplication systems, a ﬁngerprint of a ﬁle or

chunk is calculated using a cryptographic hash algorithm (e.g.,

SHA-1). Fingerprints serve as the proxies for testing content

uniqueness. A ﬁngerprint index is established for mapping

ﬁngerprints to the physical addresses of ﬁles or chunks. A

duplicate ﬁle or chunk can be identiﬁed via checking the

existence of its ﬁngerprint in the ﬁngerprint index.

In recent years, data-intensive applications have become

popular in cloud environments [9], [10], which produce a

huge amount of data. Data deduplication has been leveraged

to facilitate the management of such “big data”. One of

the main challenges with data deduplication is to provide

a high performance and scalable key-value ﬁngerprint index

for deduplicating large-scale datasets [11], [12]. For example,

to support a unique dataset of 800 TB and assuming an

average chunk size of 8KB, at least 2 TB of SHA-1(20-

byte) ﬁngerprints will be generated, which are too large

to be stored in the memory [13]. Therefore, for practical

considerations, the realistically adopted approach is to store

the entire ﬁngerprint index either in a disk system [11] or on

ﬂash [12], with part of the ﬁngerprints cached in a memory

buffer. However, accessing disk or ﬂash is much slower than a

memory access, creating the well-known ﬁngerprint bottleneck

problem. In this paper, we are mainly concerned about the

case where ﬁngerprints are stored in an HDD disk storage

system and optimize the ﬁnperprint index organizations on

the disk storage. Fortunately, due to the existence of locality

[13], in most cases it only requires to check against a portion

of the ﬁngerprints using near-exact deduplication, without

signiﬁcantly losing deduplication efﬁciency.

Our goal is to design an efﬁcient ﬁngerprint index store

which can provide high query performance, while minimally

sacriﬁcing deduplication efﬁciency. To this end, we propose

RMD, a new resemblance and mergence based near-exact d-

eduplication scheme, which can provide very high index query

performance by reducing the search space via resemblance-

based segment store organization. The key idea of RMD is

to exploit the data resemblance theory and a bloom ﬁlter

array to narrow the query range. Speciﬁcally, RMD uses

the resemblance algorithm to detect resemble segments and

clusters resemble segments into the same bin. Due to the

clustering of the resemblance segments, RMD can rapidly ﬁnd

all t he potential segments for comparisons to identify duplicate

chunks, which might be distributed in a large index space

in other existing deduplication approaches. Meanwhile, RMD

leverages a bloom ﬁlter array which is a high performance

scalable data structure suitable for membership query. Overall,

with the employment of a bloom ﬁlter array and the clustering

of resemble segments, RMD can quickly locate the most re-

semble segments for deduplication, signiﬁcantly reducing disk

I/O accesses incurred by ﬁngerprint queries, thus improving

deduplication performance.

II. B

ACKGROUND AND MOTIVATION

In this section, we ﬁrst provide the necessary background

knowledge for RMD, and then motivate our work by analyz-

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38692043

粉丝: 9
资源: 947

基于相似度和合并的高性能数据去重复方法

RMD:网络研讨会“ RMD简介，参数化报告和发布以进行连接”的内容

rmd:React上的markdown编辑器

FTP服务器文件夹操作

请介绍一下每个FTP命令的序号、名令及参数、用途

vscode rmd

r语言Rmd如何插入plotly的图片

rmd文件怎么打开网页版

在Rmd导出的Html文件中如何设置目录

怎么实现运行.Rmd文件，图片自动保存到了文件夹figure-latex里

基于文献“姜黄素对抑郁症大鼠模型行为学及神经肽Y的影响_史楠楠.pdf，完成DOE_exp2.rmd文档，并生成word文档。上传rmd和word两个文件（注意上传两个文件，不要压缩，文件名为“姓名+学号，例如：张三202200101”）

运行.Rmd文件，图片自动保存到了文件夹figure-latex里

请介绍一下每个FTP命令的序号、命令及参数、用途

RMD X6在robomaster比赛中一般作用什么地方

源 专业指导 已收藏 r语言_消费者信心指数_时间序列分析代码.rmd

R语言rmd文档无法输出图片

R语言 Rmd文件中插入html图片

在rmd生成的pdf中居右

python转换为ipynb

plot生成的图表在Rmd中会报错

最新资源

源专业指导已收藏 r语言_消费者信心指数_时间序列分析代码.rmd