Scheduling Real-time Workflow on
MapReduce-based Cloud
Fei Teng and Hao Yang and Tianrui Li and Yan Yang and Zhao Li
School of Information Science and Technology
Southwest Jiaotong University
Chengdu, China, 610031
Email: fteng@swjtu.edu.cn
Abstract—As a popular programming model in cloud-based
data processing environment, MapReduce and its open source
implementation Hadoop, are widely applied both in industry
and academic researches. A key challenge in MapReduce-based
cloud is the ability to automatically control resource allocations to
real-time workflows for achieving their custom-defined deadlines.
Current researches on deadline-related MapReduce schedulers
only support soft real-time scheduling, where the extension
of the deadline is allowed. In this paper, the hard real-time
scheduling problem with a strict deadline on MapReduce-based
cloud is studied. We propose a SPS scheduler that can guarantee
job completion time before the specified deadline for real-time
workflows. SPS supports job preemption with low context-switch
overhead so that it can make online scheduling decision when
workflows randomly arrive in cloud. Experiments on Hadoop
show that SPS effectively meets the deadline constraint even if
the workflow demands exceed the cluster resources.
I. INTRODUCTION
Nowadays, Data is being generated, disseminated and con-
sumed at an unprecedented speed, and has become a powerful
technology that can shed light on understanding the world
and inform better decisions. Most of the Internet companies
operate as data factories, collecting data streams by countless
digital devices and producing kinds of intelligent computing
services. These cloud services are evaluated by Quality of
Service (QoS), among which the response time is a key
performance metric [1]. For example, in some search engine
companies, data mining on public opinion monitoring is op-
erated at a fixed interval, and each mining operation should
complete before the arrival of next data. Two distinct features
should be paid attention to. First, these services are repeated
in a fixed period, working as a sequence of concatenated
workflow. Second, the predecessor service terminates when
the successor arrives, which means the periodic service has an
infrangible deadline. This paper targets on scheduling the real-
time services, where the quality of service depends on both
functional correctness and time that the service is delivered.
In industry, a number of public cloud service providers have
rolled out hosted versions of MapReduce-based clusters as
their data processing frameworks. For example, the world’s
largest Hadoop cluster is run by Facebook to operate thou-
sands of online social media applications per second. When a
workflow submitted to Hadoop, MapReduce runtime packages
each service request as a MapReduce job that contains a
large number of map and reduce tasks. Scheduler inside the
runtime is responsible for scheduling tasks to compute nodes
and monitoring their completion, which exerts great impact
on the QoS and the performance of cluster. In this paper,
we study the hard real-time scheduling issue on MapReduce-
based cloud, and propose Shortest Period Scheduler (SPS) to
guarantee response time before a custom defined deadline.
Besides that, it is computationally efficient, and can support
online scheduling for dynamic arrival of service requests.
The rest of this paper is organized as follows. Section
2 reviews related work. Section 3 formulates the real-time
scheduling problem for MapReduce-based cloud. Section 4
proposes a practical scheduler SPS, and implement it on
Hadoop. Experimental results are evaluated in Section 5.
Finally, Section 6 summarizes the paper.
II. RELATED WORK
MapReduce is a programming model originally designed by
Google to exploit large clusters to perform parallel computa-
tions. The basis of MapReduce model is a map function and a
reduce function [2]. Each data split running map function is a
separate map task. All the output of map tasks are partitioned
and parsed to reduce tasks, and the output of the reduce tasks
is appended to a final output file for reduce partition.
Apache Hadoop is an open source implementation of
Google MapReduce-based cloud, which basically contains
two elementary modules. The one is Hadoop Distributed File
System (HDFS) that provides high-throughput access to appli-
cation data, and the other is MapReduce that supports parallel
processing of large data sets. The default MapReduce sched-
uler on Hadoop is based on First In First Out (FIFO) algorithm
where jobs are executed in the order of their submission. The
pluggable scheduler permits the development of schedulers
optimized for the particular workload and application [3].
Yahoo’s Fair scheduler and Facebook’s Capacity scheduler are
implemented to share the cluster among different cloud users
[4]. Later on, Wolf [5] develops the CIRCUMFLEX scheduler
that allows users to adjust the priority levels assigned to their
jobs.
To shorten the job makespan, Zaharia[6] proposes delay
scheduling to address the conflict between data locality and
fairness. Both Tarazu scheduler [7] and network-aware sched-
uler [8] solve the delay problem caused by the straggler task.
With a target deadline, co-scheduler [9] proposes a latency
978-1-4799-0048-0/13/$31.00 ©2013 IEEE
117