优化图挖掘算法：MapReduce上的局部性探索

147 浏览量更新于2024-08-28 收藏 241KB PDF 举报

"这篇研究论文探讨了在MapReduce框架下如何优化图挖掘算法，以利用分布式系统的局部性特性。文章作者是来自复旦大学计算机科学学院的Li, Dai, Wang和Wang。他们提出了一种名为LI-MR（LocalIterationMapReduce）的框架，旨在改进可以通过重复矩阵向量乘法描述的图操作。LI-MR考虑到了子图的局部性，并采用了粗粒度的通信单元，减少了MapReduce中的远程节点通信需求。" 在分布式系统中，局部性是指计算操作主要发生在本地计算节点上，而无需与其他远程计算节点进行频繁通信。传统的MapReduce实现通常忽视了这一关键特征，这可能导致不必要的网络延迟和效率降低。针对这一问题，LI-MR框架通过识别和利用子图的局部性，提高了计算效率。 LI-MR的核心是将通信单位设计为粗粒度，这意味着对于子图的处理，只需要部分操作同步，降低了全局同步的复杂性和成本。此外，为了支持随机数据访问，研究者提出了一种在Hadoop上实现的方法，通过将结果输出到HBase数据库。HBase提供了范围查询功能，使得子图可以在主内存中存储足够的信息来完成计算任务，从而进一步增强了局部性。这种利用局部性的方法对内存管理和数据访问模式进行了优化，能够在不增加过多通信开销的情况下，提高图挖掘算法在大规模分布式环境下的性能。通过减少网络传输，加快数据处理速度，LI-MR有助于在MapReduce架构下更高效地执行复杂的图算法，如社区检测、路径查找和最短路径计算等。 "Exploring Computation Locality of Graph Mining Algorithms on MapReduce"这篇论文深入研究了如何在MapReduce环境下最大化利用图算法的局部性，提出了一种新的框架LI-MR，它通过改进通信策略和利用NoSQL数据库（如HBase）的功能，显著提升了图挖掘的执行效率。这项工作对于理解和优化分布式图处理系统具有重要的理论与实践意义。

Exploring Computation Locality of Graph Mining

Algorithms on MapReduce

Qiuhong Li, Ke Dai, Wei Wang, Peng Wang

School of Computer Science, Fudan University, Shanghai, China

Email: {09110240012, ke, weiwang1, pengwang5}@fudan.edu.cn

Abstract—Previous implementations of graph mining algo-

rithms on MapReduce ignore the characteristic of locality in

distributed systems. For distributed systems, locality means the

operations take place in local computing nodes without the

communication with remote computing nodes. In this paper

we present LI-MR (Local Iteration MapReduce) framework to

improve a class of graph operators which can be described by

repeated matrix-vector multiplications. LI-MR considers locality

of subgraphs and adopts coarse granularity of communication

unit for MapReduce. In particular, for subgraphs, only par-

tial operations need synchronization. We propose a method to

implement random data access on Hadoop by outputting the

results to HBase. With the support of range query provided

by HBase, LI-MR allows subgraphs to fulﬁl computation task

with enough information in main memory. Because the locality

feature of subgraphs, the info for the computation is limited. In

this way, LI-MR framework combines in-memory computation

with MapReduce model for graph algorithms.

I. INTRODUCTION

With the rapid development of Internet, there appears more

and more very large-scale web applications. Parallel graph

processing techniques is necessary for web-scale applications,

and thus draw more and more attention of researchers recently.

For example, Google proposes Pregel [13] for large-scale

graph processing, which uses vertex-centric method to process

graphs and uses messages as communication. Since Pregel is

designed for sparse graphs, performance suffers when most

vertices continuously send messages to most other vertexes.

Therefore, the scalability of Pregel is under doubt for graphs

with more than billions of vertexes. An increasing popular

large-scale data processing paradigm is MapReduce[8] pro-

gramming, whereby processing is speciﬁed by map process

and reduce process.But MapReduce can not support iterative

applications well for large invariant input data walking around

the network repeatedly, and therefore not suitable for a large

number of graph algorithms that essentially employ iterative

procedures. Hadoop[1] divides the input to a MapReduce job

into ﬁxed-size pieces called input splits, or just splits. Hadoop

creates one map task for each split, which runs the user-

deﬁned map function for each record in the split. This kind of

scheduling for map task is suitable for large batch-processing

applications such as log analysis. However, it is not suitable

for graph applications that allocate resources only according

to the measure of sizes.

To support iterative algorithms, HaLoop [6]] improves the

performance of iterative applications by adopting map cache

and reduce cache to avoid invariant input data transfer in the

network repeatedly during multiple iterations. HaLoop does

not consider the computing locality for graph applications. The

workload is decided by several factors such as the kind of the

graph algorithm and the graph structure. We argue that the

graph partitioning technique is essential for graph algorithms

on Hadoop.

In this paper, we present LI-MR (Local Iteration MapRe-

duce), an improved MapReduce platform to better exploit the

locality of graph processing. The main idea of LI-MR is to

divide original graph into several subgraphs, each of which

can be processed by map function with the help of a mapper

cache. The principle of LI-MR is to avoid the transferring of

invariant data in Hadoop and to provide the variant data for

multi-iteration graph applications with the help of HBase and

a local cache. Previous solutions for graph computation on

Hadoop generally use one pass MapReduce job to implement

the join of multiple resources, it may lead large amount of

redundant I/O operations especially with the small variant data.

For the LI-MR framework, there is a distinction made in the

relevant data space, i.e, the original data G, and the data that

requires updating during the iterative computation. The ﬁrst

data is called ”invariant” data and second ”variant” data. Our

key observation is that the invariant data can permanently re-

side in the compute nodes and need not be communicated, but

the variant data may need to be communicated to the compute

nodes. For the variant data, we propose a global index structure

supported by HBase that can be fetched by the compute nodes

when necessary. The strength of MapReduce, however, lies in

the fact that it uses both sequential and parallel computation.

We explore the locality by fetching the HBase once and

serving the sequential subgraph computation (perhaps) several

times. For this purpose, we utilize a local cache which resides

on the mapper. By implementating some graph partitioning

strategies and proper execution order, subgraphs can share

the information fetched from HBase. In this way, our LI-MR

framework combines the in-memory graph computation and

therefore MapReduce can beneﬁt from exploring the locality

of graph computation.

To examplify our idea, we consider several important web

applications.

There is a class of graph algorithms falling into GIM-

V[11] operator, such as PageRank, HADI[12] for diameter

estimation, Random Walk with Restart[14], Adsorption[5] and

MAD[16]. GIM-V is a generalization of normal matrix-vector

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38647039

粉丝: 7
资源: 943

优化图挖掘算法：MapReduce上的局部性探索

探索计算能力的计算模型Models of Computation Exploring the Power of Computing

Exploring Coperate BI of SQL Server 2014 on Technet vLab

Exploring Structural Consistency in Graph Regularized Joint Spectral-Spatial Sparse Coding for Hyperspectral Image Classification

Exploring Effect of Location Number on Map-Based Graphical Password Authentication

Exploring user topic influence for group recommendation on learning resources

Exploring the Power of ChatGPT 9781484295281.pdf

A Computable Universe_ Understanding and Exploring Nature as Computation

Exploring the location of object deleted by seam-carving

Al大模型，Exploring the Limits of Transfer Learning

Exploring Power Query & Power Map of SQL Server 2014 on Technet vLab

最新资源