应用感知的本地-全局重复数据删除：优化个人云备份效率

38 浏览量更新于2024-08-27 收藏 900KB PDF 举报

在当前个人计算设备日益依赖云存储环境进行数据备份的情况下，一种重要的挑战是源重复数据删除（Source Deduplication）效率低下，这源于数据去重过程本身的资源密集性和设备有限的系统资源之间的矛盾。针对这一问题，本文提出了ALG-Dedupe，一个应用感知的本地-全局源去重（Application-Aware Local-Global Source Deduplication）方案。 ALG-Dedupe的核心理念在于利用应用程序的特性和行为模式来增强数据去重的智能性。它通过以下几个关键方面改进了去重效率： 1. 应用程序感知：ALG-Dedupe能够识别和理解不同应用程序产生的数据特征，比如文档格式、媒体类型等，从而更精确地识别和排除重复的数据块，减少不必要的去重操作。 2. 本地与全局结合：该方案将去重策略分为本地和全局两个层面。在本地层面上，通过快速分析和处理频繁更新的数据，如缓存和临时文件，实现快速去重。而在全局层面上，对长期不变或较少变化的数据进行深度扫描，确保在整个存储空间中的唯一性。 3. 资源管理优化：ALG-Dedupe通过智能调度和优化资源分配，平衡云存储容量节省和去重时间的降低。它避免了因资源过度消耗导致的备份性能下降，同时减少了重复数据在网络传输中的带宽占用。实验部分，通过原型系统的实施验证，ALG-Dedupe在保持低系统开销的前提下，显著提高了去重效率，缩短了备份窗口，提升了整体的能源效率，并降低了云备份的成本。这些成果表明，对于个人存储云备份服务来说，ALG-Dedupe是一种有效的解决方案，能够显著提升用户体验并适应现代数据管理的需求。总结来说，ALG-Dedupe不仅解决了云备份中源重复数据删除的效率问题，还通过应用感知和资源优化策略，为用户提供了一个高效、节能且成本效益高的备份服务，具有很高的实际应用价值和研究意义。随着数据量的不断增长和个人设备的多样化，此类智能去重技术将成为未来云计算领域的重要研究方向。

local source deduplication if (4) is satisfied when network

bandwidth is very low, and also outperform global source

deduplication since (5) is always satisfied due to high cloud

latency

BWS

¼ T

þ T



(3)

B  T

G 1(4)

G 1: (5)

We can d efine a metric for deduplication efficiency to

balance the cloud s torage cost saving an d backup window

shrinking in source deduplicati on based cloud backup

services. It is well understood that the deduplication

efficiency is proportional to deduplication eff ectiven ess

that is always defined by duplicate el imination ratio

R ¼ L=P , and inversely proportional to the average backup

window size per chunk BWS with average chunk size C.

Based on this understanding and to better quantify and

compare deduplication efficiency of a wide variety of

deduplication techniques, we propose a new metric, called

‘‘bytes saved per second,’’ which is expressed in (6), to

measure the deduplication efficiency DE of different

deduplication schemes on the same platform. Our local-

global source deduplication design can achieve high

efficiency for its global deduplication effectiveness and

reduced backup window

DE ¼

BWS

 1 



: (6)

Different from the traditional deduplication based

cloud backup services that are oblivious to the file level

semantic knowl edge, we opti mize the eff iciency for the

source deduplication based cloud backup services by

exploiting application awareness. We can divide backup

dataset into n applic ation data subsets acc ording to fil e

semantics, and i mprove the deduplication effec tiveness

and decrease deduplication latency by dedic ated de dupl i-

cation process for each kind of application data. For

application i,wedefineL

, P

,andP

as its lo gical data

subset size, its physical data subset sizes after traditional

local and global source deduplication, respectively; P

ALi

and P

AGi

as its physical data subset sizes after lo cal and

global application aware source deduplication, respective-

ly. We assume its average la tency per chunk for local and

global application a ware duplication detection are T

ALi

and T

AGi

, respectively, then T

ALi

G T

and T

AGi

G T

due to

the index lookup optimization by exploiting application

awareness. According to our observation in Section 3: the

amount of data shared among different types of applica-

tionsisnegligiblysmall,wehaveP

 P

ALi

and

 P

AGi

for any i is established. We make formulas

(7) and (8) to estimate the upper bound of physical dataset

size P

and the average backup windo w size per chunk

BWS

ALG

for application-aware local -global source dedu-

plication based cloud backup services, and then we know

formula (9) is established, and it indicates that the

efficiency of application-awarelocal-globalsourcededu-

plication DE

ALG

is higher than that of local-global source

deduplication DE

BWS

ALG

i¼1

 T

þT



ALi



AGi



i¼1

 T

þ T



ALi



AGi





i¼1

T

þT



i¼1



i¼1

¼ T

þ T



¼ BWS

(7)

i¼1

AGi



i¼1

¼ P

(8)

ALG

BWS

ALG

 1



BWS

 1 



¼ DE

: (9)

3DEDUPLICATION ANALYSIS ON PERSONAL DATA

In this section, we will investigate how data redundancy,

space utilization efficiency of popular data chunking

methods and computational overhead of typ ical hash

functions change in different applications of personal

computing to motivate our research. We perform prelim-

inary experimental study on datasets collected from desk-

tops in our research group, volunteers’ personal laptops,

personal workstati ons for image processing and financial

analysis,andasharedhomeserver.Table1outlinesthekey

dataset characteristics: the number of devices, applications

and d ataset size for each studied workload. To the best of

ourknowledge,thisisthefirstsystematicdeduplication

analysis on personal storage.

Observation 1

Adisproportionallylargepercentageofstoragespaceis

occupied by a very small number of large f iles with very

low chunk-level redundancy after file-level dedupe.

Implication

File-level deduplic ation using weak hash fun-c tions for

these large files is sufficient t o avoid hash collisio ns f or

small datasets in t he personal computing environment.

To reveal the relationship between file count and storage

capacity under various file size, we co llect statistics on

thedistributionoffilecountandstoragespaceoccupied

by files of differen t sizes in the datasets listed in Table 1

TABLE 1

Datasets Used for Deduplication Analysis

FU ET AL.: LOCAL-GLOBAL SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES 1157

剩余10页未读，继续阅读

weixin_38590738

粉丝: 8
资源: 902

应用感知的本地-全局重复数据删除：优化个人云备份效率

云环境中的应用感知大数据重复数据删除

用自己的语言定义下列分布式数据库中的术语：全局数据、全局用户（应用）、全局DBMS、全局DB、全局外模式、全局概念模式、分片模式、分配模式。局部概念模式。

Transporter数据同步功能支持的数据映射规则包括（ ） 2分 全局新增字段 全局字段类型映射 全局字段名映射 单表表名映射

stm32编写boot和应用程序有同名的全局变量value,那么上电后,从boot跳转到应用程序,value的地址会变化吗

全网服务器数据备份的shell脚本怎么写

objective-c 开发IOS 存储和获取全局数据

ASP.net Web应用程序的全局配置文件是

stm32编写boot和应用程序的时候,有相同的全局变量value,那么上电后,从boot跳转到应用程序,value的地址会变化吗

c++产生全局不重复

最新资源

Transporter数据同步功能支持的数据映射规则包括（） 2分全局新增字段全局字段类型映射全局字段名映射单表表名映射