Exploring Computation Locality of Graph Mining
Algorithms on MapReduce
Qiuhong Li, Ke Dai, Wei Wang, Peng Wang
School of Computer Science, Fudan University, Shanghai, China
Email: {09110240012, ke, weiwang1, pengwang5}@fudan.edu.cn
Abstract—Previous implementations of graph mining algo-
rithms on MapReduce ignore the characteristic of locality in
distributed systems. For distributed systems, locality means the
operations take place in local computing nodes without the
communication with remote computing nodes. In this paper
we present LI-MR (Local Iteration MapReduce) framework to
improve a class of graph operators which can be described by
repeated matrix-vector multiplications. LI-MR considers locality
of subgraphs and adopts coarse granularity of communication
unit for MapReduce. In particular, for subgraphs, only par-
tial operations need synchronization. We propose a method to
implement random data access on Hadoop by outputting the
results to HBase. With the support of range query provided
by HBase, LI-MR allows subgraphs to fulfil computation task
with enough information in main memory. Because the locality
feature of subgraphs, the info for the computation is limited. In
this way, LI-MR framework combines in-memory computation
with MapReduce model for graph algorithms.
I. INTRODUCTION
With the rapid development of Internet, there appears more
and more very large-scale web applications. Parallel graph
processing techniques is necessary for web-scale applications,
and thus draw more and more attention of researchers recently.
For example, Google proposes Pregel [13] for large-scale
graph processing, which uses vertex-centric method to process
graphs and uses messages as communication. Since Pregel is
designed for sparse graphs, performance suffers when most
vertices continuously send messages to most other vertexes.
Therefore, the scalability of Pregel is under doubt for graphs
with more than billions of vertexes. An increasing popular
large-scale data processing paradigm is MapReduce[8] pro-
gramming, whereby processing is specified by map process
and reduce process.But MapReduce can not support iterative
applications well for large invariant input data walking around
the network repeatedly, and therefore not suitable for a large
number of graph algorithms that essentially employ iterative
procedures. Hadoop[1] divides the input to a MapReduce job
into fixed-size pieces called input splits, or just splits. Hadoop
creates one map task for each split, which runs the user-
defined map function for each record in the split. This kind of
scheduling for map task is suitable for large batch-processing
applications such as log analysis. However, it is not suitable
for graph applications that allocate resources only according
to the measure of sizes.
To support iterative algorithms, HaLoop [6]] improves the
performance of iterative applications by adopting map cache
and reduce cache to avoid invariant input data transfer in the
network repeatedly during multiple iterations. HaLoop does
not consider the computing locality for graph applications. The
workload is decided by several factors such as the kind of the
graph algorithm and the graph structure. We argue that the
graph partitioning technique is essential for graph algorithms
on Hadoop.
In this paper, we present LI-MR (Local Iteration MapRe-
duce), an improved MapReduce platform to better exploit the
locality of graph processing. The main idea of LI-MR is to
divide original graph into several subgraphs, each of which
can be processed by map function with the help of a mapper
cache. The principle of LI-MR is to avoid the transferring of
invariant data in Hadoop and to provide the variant data for
multi-iteration graph applications with the help of HBase and
a local cache. Previous solutions for graph computation on
Hadoop generally use one pass MapReduce job to implement
the join of multiple resources, it may lead large amount of
redundant I/O operations especially with the small variant data.
For the LI-MR framework, there is a distinction made in the
relevant data space, i.e, the original data G, and the data that
requires updating during the iterative computation. The first
data is called ”invariant” data and second ”variant” data. Our
key observation is that the invariant data can permanently re-
side in the compute nodes and need not be communicated, but
the variant data may need to be communicated to the compute
nodes. For the variant data, we propose a global index structure
supported by HBase that can be fetched by the compute nodes
when necessary. The strength of MapReduce, however, lies in
the fact that it uses both sequential and parallel computation.
We explore the locality by fetching the HBase once and
serving the sequential subgraph computation (perhaps) several
times. For this purpose, we utilize a local cache which resides
on the mapper. By implementating some graph partitioning
strategies and proper execution order, subgraphs can share
the information fetched from HBase. In this way, our LI-MR
framework combines the in-memory graph computation and
therefore MapReduce can benefit from exploring the locality
of graph computation.
To examplify our idea, we consider several important web
applications.
There is a class of graph algorithms falling into GIM-
V[11] operator, such as PageRank, HADI[12] for diameter
estimation, Random Walk with Restart[14], Adsorption[5] and
MAD[16]. GIM-V is a generalization of normal matrix-vector