MAPREDUCE DELAY SCHEDULING WITH DEADLINE CONSTRAINT
distributed autonomous resources of different administrative domains. Meta-scheduler [20] does not
have the control of the clusters that it is difficult to predict resource availability and allocate resource.
In general, these works all focus on cluster-level and coarser-grain optimization. Unlike Grid sys-
tems, computing resources and storage resources in MapReduce system are closely coupled and
centrally managed that it needs fine-grain data-driven scheduling strategies. Therefore, traditional
grid-enabled data-aware researches cannot be directly applied to MapReduce systems.
Recent works focusing on improving data locality of MapReduce systems can be categorized into
‘task scheduling’ and ‘data placement’.
Task scheduling: One way to improve data locality is to schedule tasks according to system sta-
tus including both available slots and data distribution. Zhang et al. [8] improve data locality by
introducing a next-k-node method to predict resource availability of nodes. It uses a FIFO-like
job queue that emphasizes job priority, but it fails to enable overall job execution, which lim-
its system throughput. Zhang et al. [9] prove that simulatively scheduling multiple tasks help to
improve data locality. Zaharia et al. [10] present data locality optimization and FairShare delay
scheduling that use job delays to pursue high system data locality. According to their efforts, delay
scheduling can achieve almost 100% data locality. Luo et al. [16] develop a hierarchical MapReduce
framework for cross-domain MapReduce Execution. It produces high level data-aware scheduling
across clusters.
Data placement: The other way to increase data locality is to place or rearrange data blocks
according to task requirements. Seo et al. [15] propose an inter-block perfecting mechanism to
accelerate remotely executed Map tasks. However, it cannot prevent job remote execution and still
requires data block transfer, which increases network load. Xie et al. [11] improve data locality
through data placement in a heterogeneous environment. It indirectly improves data locality and
system throughput from the data storage level.
Among all these works, that of Zaharia et al. [10] is the most widely cited and related to our work.
However, it is based on static delay time. According to Zaharia et al. [10] and the Hadoop source
code (0.21.0), delay scheduling has three levels of delay (local, rack, and any) with progressive
delay times (4.5 by default). It does not consider runtime resource status and resource competition.
There is an obvious shortcoming of this work that static delay time cannot adapt to dynamically
changed resource availability in runtime. Moreover, their model does not consider job deadline, and
it is based on FairShare scheduling that takes overall slowdown as cost [8].
There are some works that acknowledge the problem of job deadlines and focuss on improving
service quality through performance-driven [18] scheduling and deadline scheduling [17]. Kc and
Anyanwu [17] keep deadline constraints by introducing an advanced scheduling strategy to con-
trol Map and Reduce procedure in high level. Polo et al. [18] introduce a dynamic job priority
mechanism and dynamic scheduling policy to meet performance requirement of MapReduce jobs.
However, Kc and Anyanwu [17] and Polo et al. [18] share a common shortcoming that both of
them failed to solve the critical issue of data locality. They pay attention only to job deadline while
neglecting the strong dependence between tasks and data in MapReduce systems. It will definitely
affect job executing.
To sum up, there are several unsolved problems in existing works:
(i) Earlier theoretical work [19] fails to embody the native advantage of MapReduce systems that
data and computing resources are closely coupled.
(ii) Delay decisions are not based on dynamic resource availability and do not consider runtime
resource status and resource competition. Inappropriate delay will affect job execution or break
job deadline.
(iii) Existing performance-driven scheduling methods [17, 18] failed to consider data dependence
of Map tasks and lead to longer job executing time.
This paper proposes a DLD scheduling algorithm to address previous issues. DLD integrates
both data-aware scheduling and performance-driven scheduling. It improves MapReduce system
throughput through advanced delay scheduling based on real-time resource availability estimation
and deadline mechanism.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe