VPCH: A Consistent Hashing Algorithm for Better Load Balancing in a Hadoop
Environment
Qi Liu*, Weidong Cai, Jian Shen, Baowei Wang,
Zhangjie Fu
Department of Computer and Software
Nanjing University of Information Science and
Technology
Nanjing, China
qrankl@163.com
Nigel Linge
The University of Salford
Salford, Greater Manchester, UK
n.linge@salford.ac.uk
Abstract—MapReduce (MR) is a popular programming model
for the purposes of processing large data sets among data
clusters or grids, e.g. a Hadoop environment. Load balancing
as a key factor affecting the performance of map resource
distribution, has recently gained high concerns to optimize.
Current MR processes in the realization of distributing tasks
to clusters use hashing with random modulo operations, which
can lead to uneven data distribution and inclined loads,
thereby obstruct the performance of the entire distribution
system. In this paper, a virtual partition consistent hashing
(VPCH) algorithm is proposed for the reduce stage of MR
processes, in order to achieve such a trade-off on job
allocation. According to the results, using our method can
reduce task execution time with or without MJR
(mapreduce.job.reduce.slowstart.completedmaps) parameter
set.
Keywords- Map Reduce; Load Balancing; Consistent
Hashing;
I. INTRODUCTION
In recent years, with the explosive growth of data and its
processes in the Internet, cloud computing has been widely
studied in both academia and industry, in order to provide
users such a distributed system with on-demand services,
computing abilities and storage resources. Proposed by
Google in 2004, MapReduce (MR) [1] has become the most
popular distributed computing model used in a cloud
environment, where large-scale datasets can be
handled/processed using map and reduce procedures in the
cloud infrastructure transparently.
Besides map and reduce, other internal processes
integrated into MR have also been analyzed and optimized
[2]. Taking the partition procedure as an instance, Hadoop
uses a hash function with modulo operations to calculate
partition keys, such that same number of packets maintained
in each group can be guaranteed. However, such a method
can cause tilted allocation of tasks to different reducers. In
this paper, a new scheme called VPCH is proposed to
implement virtual partitioning. Such a scheme can ensure the
load of each reducer during the reduce phase is relatively
balanced. The total execution time is actually reduced due to
even distribution of task to each reducer.
A practical Hadoop environment has been implemented
in our laboratory, retaining both original and refined
partitioning schemes so users can select and compare either
of them for their actual tasks and/or performance evaluation.
According to our results, the VPCH algorithm has depicted
shorten execution time on both reduce phases and entire task
completion.
The rest sections of the paper are organized as followed.
Related work is given in Section II, followed by Section III,
where our load balancing approach is detailed. In Section IV,
testing environment and corresponding scenarios are design
for the verification and evaluation of the VPCH approach.
Finally, conclusion and future work on load balancing
strategies in a cloud platform are discussed in Section V.
II. R
ELATED WORK
There are various approaches to obtain work load in the
data node of a MR system. Authors in [3] tried to repartition
tasks from slow workers to faster ones by monitoring real-
time Map and Reduce jobs to ensure that all available nodes
can finish the jobs at the same time. This method can handle
all kinds of load deflection, but it changes Hadoop greatly
with complicated modification/configuration, as well as extra
network cost on the redistribution of tasks.
Some studies address the issues of mapping sets of tasks
onto sets of processors according to context information (e.g.
execution history [4], sampled data skew [5], etc.), such that
overall execution time can be minimized. However, these
methods failed to have their models verified via a practical
cloud platform, which consequently ignored corresponding
configuration in a cloud computing environment, e.g.
number of reducers.
Neural Network (NN) algorithms have been employed in
a cloud. An adaptively partition algorithm called HAP was
proposed in [6], where reduce jobs can be distributed based
on estimated work threshold using SVM in a heterogeneous
environment. Extra cost on execution time, on the other hand
is consumed when splitting and merging input K-V chains at
the reduce phase. In addition, training time of these models
needed to be taken into account as well.
2015 Third International Conference on Advanced Cloud and Big Data
978-1-4673-8537-4/15 $31.00 © 2015 IEEE
DOI 10.1109/CBD.2015.21
69
2015 Third International Conference on Advanced Cloud and Big Data
978-1-4673-8537-4/15 $31.00 © 2015 IEEE
DOI 10.1109/CBD.2015.21
69
2015 Third International Conference on Advanced Cloud and Big Data
978-1-4673-8537-4/15 $31.00 © 2015 IEEE
DOI 10.1109/CBD.2015.21
69