大规模图相似性Join：MapReduce实现的可伸缩前缀过滤算法

35 浏览量更新于2024-08-26 收藏 289KB PDF 举报

"高效的图相似性与使用MapReduce的可伸缩前缀过滤结合在一起" 在大数据时代，图数据的处理和分析变得越来越重要，尤其是在社交网络、生物信息学和推荐系统等领域。图相似性连接（Graph Similarity Join）是图数据分析中的一个关键任务，它旨在找出两个大型图数据集中所有相似的图对。本文提出了一种基于MapReduce的高效算法，该算法能够有效地处理大规模图数据集上的图相似性连接问题。首先，该算法的核心是可伸缩的前缀过滤（Scalable Prefix-Filtering）。传统的前缀过滤方法通常受限于内存限制，无法处理具有大量节点和边的q-gram字母表。然而，该算法采用了一种新的策略，即使在超出内存容量的情况下，也能有效地进行前缀过滤，从而筛选出可能相似的图对。这极大地减少了计算资源的需求，提高了处理效率。其次，为了进一步提升性能，论文提出了一种有效的候选减少策略（Effective Candidate Reduction Strategy）。这个策略能够在过滤阶段减少大量的无效候选对，从而显著降低数据通信成本。在分布式环境中，数据通信通常是性能瓶颈，因此这种策略对于提高整体系统的并行性和可扩展性至关重要。再者，该算法还引入了两轮数据访问提案（Two-Round Data Access Proposal）。通过优化数据访问模式，算法能够在两轮迭代中减少对存储设备的访问次数，降低了数据访问开销。这不仅减少了I/O操作，也加快了计算速度，使得大规模图数据处理变得更加高效。实验结果表明，该提案在多个大型真实和合成数据集上均优于现有的最先进的方法。无论是在计算时间、内存使用还是系统扩展性方面，新算法都表现出显著的优势。这些改进对于应对不断增长的图数据规模以及满足实时分析的需求具有重要意义。这篇论文提供的是一种创新的解决方案，将高效的图相似性计算与MapReduce框架相结合，解决了大规模图数据处理中的挑战。这种方法有望在图挖掘、网络分析和相关应用中发挥重要作用，推动图数据处理技术的发展。

Eﬃcient Graph Similarity Join with Scalable

Preﬁx-Filtering Using MapReduce

Jun Pang

,YuGu

,JiaXu

, Yubin Bao

,andGeYu

College of Information Science and Engineering,

Northeastern University, Liaoning, 110819, China

pangjun@research.neu.edu.cn, {guyu,baoyubin,yuge}@ise.neu.edu.cn

School of Information System and Management,

National University of Defense Technology, Changsha, 410073, China

xujia.neu@gmail.com

Abstract. The graph similarity join retrieves all pairs of similar graphs

on graph datasets. In this paper, we propose an eﬃcient MapReduce-

friendly algorithm tackling with the graph similarity join problem on

large-scale graph datasets. In particular, the eﬃciency of our algorithm

is guaranteed by: 1) scalable preﬁx-ﬁltering suitable for q-gram alphabet

that is beyond the memory; 2) an eﬀective candidate reduction strategy

that greatly cuts down the data communication cost; 3) a two-round

data access proposal that reduces the data access overhead. Extensive

experiments on large-scale real and synthetic datasets demonstrate that

our proposal outperforms the state-of-the-art method with higher system

scalability and faster speed.

1 Introduction

With the quick growth of graph data generated and collected by many applica-

tions in social networks, bioinformatics and chemistry, there is a huge demand of

developing eﬀective analysis tools on the big graph datasets. The graph similar-

ity join provides an indispensable functionality to such analysis tasks. However,

most previous graph similarity join algorithms are in-memory algorithms, be-

ing incompetent to analyze the graph datasets with large sizes. Worse still, the

graph similarity functions, e.g. graph edit distance, is commonly computation-

ally expensive [1], making the performance of the graph similarity join faced

with large-scale sets a serious concern.

To solve the problems above, a potential solution is to resort to the popular

distributed computation paradigms, such as MapReduce [2][3]. However, to our

best knowledge, it has not been reported that the works of large-scale graph

similarity joins based on MapReduce. In this paper, we implement the GSimJoin

algorithm [6] in parallel, that is the state-of-the-art centralized graph similarity

join method with edit distance constrains. In particular, we optimize this parallel

algorithm with scalable preﬁx-ﬁltering and compression techniques.

We propose the progressive MR-GSimJoin algorithm in Section 2. Extensive

experimental results are reported in Section 3. We discuss related work in Section

4 and Section 5 concludes this paper.

F. Li et al. (Eds.): WAIM 2014, LNCS 8485, pp. 415–418, 2014.

 Springer International Publishing Switzerland 2014

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38583286

粉丝: 2
资源: 936

大规模图相似性Join：MapReduce实现的可伸缩前缀过滤算法

基于MapReduce实现物品协同过滤算法（ItemCF）.zip

HadoopApp:使用MapReduce在大型数据集中查找与输入图像相似的图像

使用MapReduce高效处理多路联接

mapreduce mapreduce mapreduce

MapReduce_ItemCF:基于MapReduce实现物品协同过滤算法（ItemCF）

MapReduce过滤图书年份 课程设计

基于MapReduce实现物品协同过滤算法（ItemCF）

基于Java MapReduce实现物品协同过滤算法【100012582】

MapReduce中基于扩展Bloom过滤器的高效两表联接查询处理

BF-MapReduce：利用Bloom过滤器的高效轻量级MapReduce搜索优化

最新资源

MapReduce过滤图书年份课程设计