Microprocessors and Microsystems 47 (2016) 178–187
Contents lists available at ScienceDirect
Microprocessors and Microsystems
journal homepage: www.elsevier.com/locate/micpro
MapReduce short jobs optimization based on resource reuse
Yuliang Shi
a , ∗
, Kaihui Zhang
a
, Lizhen Cui
a
, Lei Liu
a
, Yongqing Zheng
a
, Shidong Zhang
a
,
Han Yu
b
a
Shandong University, Jinan, China
b
Nanyang Technological University, Singapore, Singapore
a r t i c l e i n f o
Article history:
Received 13 January 2016
Revised 18 April 2016
Accepted 17 May 2016
Available online 18 May 2016
Keywords:
Hadoop
Short job
Performance optimization
Resource utilization
Task scheduling
a b s t r a c t
Hadoop is an open-source implementation of MapReduce serving for processing large datasets in a mas-
sively parallel manner. It was designed aiming at executing large-scale jobs in an enormous number of
computing nodes offering computing and storage. However, Hadoop is frequently employed to process
short jobs. In practice, short jobs suffer from poor response time and run inefficiently. To fill this gap,
this paper analyses the process of job execution and depicts the existing issues why short jobs run ineffi-
ciently in Hadoop. According to the characteristic of task execution in multi-wave under cluster overload,
we develop a mechanism in light of resource reuse to optimize short jobs execution. This mechanism
can reduce the frequency of resource allocation and recovery. Experimental results suggest that the de-
veloped mechanism based on resource reuse is able to improve effectiveness of the resource utilization.
In addition, the runtime of short jobs can be significantly reduced.
©2016 Elsevier B.V. All rights reserved.
1. Introduction
Numbers of enterprises, financial institutions and media orga-
nizations are under the pressure of processing large-scale datasets,
however, conventional data processing tools and computing mod-
els do not handle it. Hadoop which is an open-source implemen-
tation of MapReduce [1] proposed by Google provides an effective
solution to handle large-scale datasets. MapReduce Jobs submitted
to Hadoop are divided into Map tasks and Reduce tasks which run
in a massively parallel manner on multiple nodes, so that the run-
times of jobs are reduced significantly. Hadoop hides many details
of parallel computing, such as distributing data blocks to comput-
ing nodes, rerunning failed tasks, and makes users focus on specific
business logic processing. Moreover, Hadoop provides a good scala-
bility, high-availability and fault tolerance, which make Hadoop be-
come the mainstream computing framework to run data-intensive
and compute-intensive applications. So the academic community
begins to pay close attention to Hadoop and deals with many prob-
lems, such as unfairness [2–5] , stragglers [4,6,7] , and data skew
[8–11] .
Hadoop was originally designed for long-run jobs on a large
number of computing nodes, but it is often used to handle short
∗
Corresponding author.
E-mail addresses: shiyuliang@sdu.edu.cn (Y. Shi), kaihuizhang@126.com
(K. Zhang), clz@sdu.edu.cn (L. Cui), l.liu@sdu.edu.cn (L. Liu),
zhengyongqing@dareway.com.cn (Y. Zheng), zsd@sdu.edu.cn (S. Zhang),
han.yu@ntu.edu.sg (H. Yu).
jobs in practice. The runtime of short job is less than a specified
threshold set by users. Short jobs can be distinguished from long
jobs by the size of input datasets, the number of tasks job divided,
the resource required by task, the runtime of task and the run-
time users expect. Since Hadoop does not take the characteristics
of short jobs into account, short jobs run inefficiently.
The hardware configurations of nodes in a cluster, job schedul-
ing algorithm and cluster load are crucial factors affecting job per-
formance. When scheduling task, Hadoop assumes that nodes in
a cluster are homogeneous. With the gradual expansion of a clus-
ter scale, however, hardware configurations of new nodes which
are added to a cluster are significantly higher than the configura-
tions of old ones. Therefore, tasks jobs split run efficiently on new
nodes compared with on old nodes. In the case of cluster heavy
load, tasks which jobs split cannot obtain sufficient resources im-
mediately to run and a part of tasks are put into a waiting queue.
The running tasks release the occupied resources when tasks fin-
ish. Hadoop picks up an appropriate task from waiting queue and
assigns available resources to it according to scheduling algorithm
specified by user. So, if the amount of resources which tasks re-
quest exceeds the amount of available resources a cluster offers,
tasks are executed in multiple waves. In TaoBao’s Hadoop cluster,
over 70% of Map tasks run more than two waves. Therefore, cluster
load has a decisive influence on the response time and runtime of
a job.
The purpose of this paper is to improve the execution perfor-
mance of short jobs. This paper analyzes the execution process
of jobs and describes the disadvantages of executing short jobs in
http://dx.doi.org/10.1016/j.micpro.2016.05.007
0141-9331/© 2016 Elsevier B.V. All rights reserved.