AE算法：提升带宽效率的数据去重新方案

163 浏览量更新于2024-08-26 收藏 846KB PDF 举报

"AE算法是一种非对称极值内容定义分块算法，用于带宽快速高效重复数据删除。该算法在数据密集型网络和云应用中，对于节省存储空间和带宽资源具有重要作用。" 在当前的数据存储和传输环境中，重复数据删除技术（Data Deduplication）成为了一项重要的优化策略，它能够有效地减少存储需求和带宽消耗。AE算法（Asymmetric Extremum Algorithm）是针对这一领域提出的一种新方法，特别设计用于提高分块效率和减少块大小的变异，从而提升整体的去重性能。传统的Rabin-based和MAXP-based内容定义分块（CDC, Content-Defined Chunking）算法虽然在查找适合的切点进行块级冗余消除方面表现稳健，但它们存在两个主要问题：一是分块吞吐量低，使得分块阶段成为去重过程的瓶颈；二是块大小的显著差异，这降低了去重效率。AE算法正是为了解决这些挑战而被提出。 AE算法的核心思想是非对称极值策略，它通过分析数据流中的局部特征，寻找数据分布的异常点，以此作为分块的边界。这种方法能够在保证识别重复数据的同时，显著提高分块速度，减少处理时间，降低系统延迟。此外，AE算法通过更加智能的分块策略，试图减少块大小的波动，从而提高数据去重的一致性和效率。在实际应用中，AE算法可以广泛应用于备份系统、云存储服务以及大数据传输等场景。例如，在云备份服务中，AE算法可以快速高效地识别并去除重复的数据块，极大地节省了用户的存储成本，并减少了网络传输的带宽需求。同时，由于其高吞吐量和低块大小变异的特性，AE算法也有助于提升整个系统的响应时间和资源利用率。总结来说，AE算法是针对数据去重领域的创新性解决方案，它通过非对称极值内容定义分块策略，解决了传统方法在分块效率和块大小一致性上的问题，提高了带宽效率和整体性能，对于数据密集型应用具有显著优势。这一研究论文详细阐述了AE算法的设计原理和实现方法，为未来数据去重技术的发展提供了新的思路。

AE: An Asymmetric Extremum Content Deﬁned

Chunking Algorithm for Fast and

Bandwidth-Efﬁcient Data Deduplication

Yucheng Zhang

†

, Hong Jiang

‡

, Dan Feng

†*

, Wen Xia

†

, Min Fu

†

, Fangting Huang

†

, Yukun Zhou

†

Wuhan National Laboratory for Optoelectronics

School of Computer, Huazhong University of Science and Technology, Wuhan, China

‡

University of Nebraska-Lincoln, Lincoln, NE, USA

Corresponding author: dfeng@hust.edu.cn

Abstract—Data deduplication, a space-efﬁcient and

bandwidth-saving technology, plays an important role in

bandwidth-efﬁcient data transmission in various data-intensive

network and cloud applications. Rabin-based and MAXP-based

Content-Deﬁned Chunking (CDC) algorithms, while robust

in ﬁnding suitable cut-points for chunk-level redundancy

elimination, face the key challenges of (1) low chunking

throughput that renders the chunking stage the deduplication

performance bottleneck and (2) large chunk-size variance that

decreases deduplication efﬁciency. To address these challenges,

this paper proposes a new CDC algorithm called the Asymmetric

Extremum (AE) algorithm. The main idea behind AE is based

on the observation that the extreme value in an asymmetric local

range is not likely to be replaced by a new extreme value in

dealing with the boundaries-shift problem, which motivates AE’s

use of asymmetric (rather than symmetric as in MAXP) local

range to identify cut-points and simultaneously achieve high

chunking throughput and low chunk-size variance. As a result,

AE simultaneously addresses the problems of low chunking

throughput in MAXP and Rabin and high chunk-size variance

in Rabin. The experimental results based on four real-world

datasets show that AE improves the throughput performance

of the state-of-the-art CDC algorithms by 3x while attaining

comparable or higher deduplication efﬁciency.

I. INTRODUCTION

According to a study of International Data Corporation (ID-

C), the amount of digital information generated in the whole

world is about 1.8ZB in 2012, and that amount will reach 40ZB

by 2020 [1]. How to efﬁciently store and transfer such large

volumes of digital data is a challenging problem. Moreover,

IDC also shows that about three quarters of digital information

is duplicated. As a result, data deduplication, a space and

bandwidth efﬁcient technology that prevents redundant data

from being stored in storage devices and transmitted over the

networks, is one of the most important methods to tackle

this challenge. Due to its signiﬁcant data reduction efﬁciency,

chunk-level deduplication is used in various ﬁelds, such as

storage systems [2], [3], Redundancy Elimination (RE) in

networks [4], [5], ﬁle-transfer systems (rsync [6]) and remote-

ﬁle systems (LBFS [7]).

Chunk-level deduplication schemes divide the input data

stream into chunks and then hash each chunk to generate its

ﬁngerprint that uniquely identiﬁes the chunk. Duplicate chunks

can be removed if their ﬁngerprints are matched with those of

previously stored or transmitted chunks. As the ﬁrst and key

stage in the chunk-level deduplication workﬂow, the chunking

algorithm is responsible for dividing the input data stream into

chunks of either ﬁxed size or variable size for redundancy de-

tection. Fix-Sized Chunking (FSC) [8] marks chunks’ bound-

aries by their positions and thus is simple and extremely fast.

The main drawback of FSC is its low deduplication efﬁciency

that stems from the boundaries-shift problem. For example, if

one byte is inserted at the beginning of an input data stream, all

current chunk boundaries declared by FSC will be shifted and

no duplicate chunks will be identiﬁed and eliminated. Content-

Deﬁned Chunking (CDC) divides the input data stream into

variable-sized chunks. It solves the boundaries-shift problem

by declaring chunk boundaries depending on local content. As

a result, the CDC algorithm outperforms the FSC algorithm

in terms of deduplication efﬁciency and has been widely used

in bandwidth- and storage-efﬁcient applications [9], [10]. To

provide the necessary basis to facilitate the discussion of and

comparison among different CDC algorithms, we list below

some key properties that a desirable CDC algorithm should

have.

1) Content deﬁned. To avoid the boundaries-shift problem,

the algorithm should declare the chunk boundaries based

on local content, i.e., the cut-points for chunking must be

content deﬁned.

2) Low computational overhead. CDC algorithms need to

check almost every byte in an input data stream to

ﬁnd the chunk boundaries. This means that the algorith-

m execution time is approximately proportional to the

number of bytes of the input data stream, which can

take up signiﬁcant CPU resources. Hence, in order to

achieve higher deduplication throughput, the chunking

algorithm should be simple and devoid of time-consuming

operations.

3) Small chunk size variability. The variance of chunk size

has a signiﬁcant impact on the deduplication efﬁciency.

The smaller the variance of the chunk size is, the higher

the deduplication efﬁciency will be achieved [11].

4) Ability to identify and eliminate low-entropy strings.

The content of real data may sometimes include low-

entropy strings [12]. These strings include very few

distinct characters but a large amount of repetitive bytes.

In order to achieve higher deduplication efﬁciency, it is

desirable for the algorithm to be capable of detecting and

eliminating these duplicate strings.

5) Less limitations on chunk size. Minimum and maximum

2015 IEEE Conference on Computer Communications (INFOCOM)

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38661087

粉丝: 3
资源: 979

AE算法：提升带宽效率的数据去重新方案

典型AE算法

MFC实现AES非对称加密解密算法

ae:D2库，用于游戏，网络应用程序等

用于选择性谐波消除的 Newton-Raphson 算法：用于计算用于消除选定谐波的切换角的 Newton-Raphson 算法的 MATLAB 程序-matlab开发

reconstruct_pointcloud_ae:这是一个简单的自动编码器，用于重建点云

isp算法之一AE算法

ELM-AE: 子空间保持的高效多层极限学习机自编码器

单片机程序设计中的数据结构与算法：高效管理数据，提升程序性能

并行算法设计艺术：高效并发数据结构的构建指南

Python数据结构与算法：掌握数据存储和处理，解锁数据科学之门

最新资源