DLD算法：具有截止日期约束的MapReduce延迟调度优化

需积分: 5 185 浏览量更新于2024-08-12 收藏 2.08MB PDF 举报

“MapReduce延迟调度具有期限约束，旨在优化大规模数据密集型问题的解决方案。通过引入一种名为DLD（Deadline-Enabled Delay）的调度算法，该方法在考虑实时资源可用性和资源竞争的同时，确保作业的期限约束得以满足。研究表明，DLD算法提高了资源可用性估计的准确性（达到92%），并平均降低了22%的工作周转时间，同时保持了88%的数据局部性。” MapReduce是一种分布式计算模型，被广泛用于处理大数据集。它通过将任务分解为可并行处理的“map”和“reduce”阶段，极大地提升了处理效率。然而，随着作业数量的增长，调度策略成为决定系统性能的关键因素。延迟调度是MapReduce中提升数据局部性的常用策略，它允许某些作业稍后执行，以便在同一节点上处理相关的数据块，从而减少网络传输，提高效率。但过度的延迟可能降低系统吞吐量，甚至破坏作业的优先级顺序。针对这一问题，本文提出的DLD算法引入了作业期限的概念。它在做出延迟决策时，不仅考虑当前的资源状态，还分析了资源竞争情况，以确保作业能够在预设的期限内完成。这种智能调度方式能够动态适应系统变化，优化作业的执行顺序，从而提高整体性能。实验结果显示，DLD算法在保持高数据局部性的同时，显著降低了作业的周转时间，这表明它在处理大量并发作业时能更有效地管理资源。与传统方法相比，DLD算法在周转时间和资源利用率方面表现出优越性，证明了其在满足作业期限约束下的高效性。此外，DLD算法的资源可用性估计准确度高达92%，这意味着它能更精准地预测系统何时有足够资源来执行作业，进一步减少了不必要的等待时间。这种精确的预测能力对于保持系统稳定运行和避免资源浪费至关重要。 DLD算法为MapReduce调度提供了一个有效的解决方案，它平衡了数据局部性、系统性能和作业期限，对于处理数据密集型应用的大型分布式系统来说，是一个有价值的优化工具。

MAPREDUCE DELAY SCHEDULING WITH DEADLINE CONSTRAINT

distributed autonomous resources of different administrative domains. Meta-scheduler [20] does not

have the control of the clusters that it is difﬁcult to predict resource availability and allocate resource.

In general, these works all focus on cluster-level and coarser-grain optimization. Unlike Grid sys-

tems, computing resources and storage resources in MapReduce system are closely coupled and

centrally managed that it needs ﬁne-grain data-driven scheduling strategies. Therefore, traditional

grid-enabled data-aware researches cannot be directly applied to MapReduce systems.

Recent works focusing on improving data locality of MapReduce systems can be categorized into

‘task scheduling’ and ‘data placement’.

Task scheduling: One way to improve data locality is to schedule tasks according to system sta-

tus including both available slots and data distribution. Zhang et al. [8] improve data locality by

introducing a next-k-node method to predict resource availability of nodes. It uses a FIFO-like

job queue that emphasizes job priority, but it fails to enable overall job execution, which lim-

its system throughput. Zhang et al. [9] prove that simulatively scheduling multiple tasks help to

improve data locality. Zaharia et al. [10] present data locality optimization and FairShare delay

scheduling that use job delays to pursue high system data locality. According to their efforts, delay

scheduling can achieve almost 100% data locality. Luo et al. [16] develop a hierarchical MapReduce

framework for cross-domain MapReduce Execution. It produces high level data-aware scheduling

across clusters.

Data placement: The other way to increase data locality is to place or rearrange data blocks

according to task requirements. Seo et al. [15] propose an inter-block perfecting mechanism to

accelerate remotely executed Map tasks. However, it cannot prevent job remote execution and still

requires data block transfer, which increases network load. Xie et al. [11] improve data locality

through data placement in a heterogeneous environment. It indirectly improves data locality and

system throughput from the data storage level.

Among all these works, that of Zaharia et al. [10] is the most widely cited and related to our work.

However, it is based on static delay time. According to Zaharia et al. [10] and the Hadoop source

code (0.21.0), delay scheduling has three levels of delay (local, rack, and any) with progressive

delay times (4.5 by default). It does not consider runtime resource status and resource competition.

There is an obvious shortcoming of this work that static delay time cannot adapt to dynamically

changed resource availability in runtime. Moreover, their model does not consider job deadline, and

it is based on FairShare scheduling that takes overall slowdown as cost [8].

There are some works that acknowledge the problem of job deadlines and focuss on improving

service quality through performance-driven [18] scheduling and deadline scheduling [17]. Kc and

Anyanwu [17] keep deadline constraints by introducing an advanced scheduling strategy to con-

trol Map and Reduce procedure in high level. Polo et al. [18] introduce a dynamic job priority

mechanism and dynamic scheduling policy to meet performance requirement of MapReduce jobs.

However, Kc and Anyanwu [17] and Polo et al. [18] share a common shortcoming that both of

them failed to solve the critical issue of data locality. They pay attention only to job deadline while

neglecting the strong dependence between tasks and data in MapReduce systems. It will deﬁnitely

affect job executing.

To sum up, there are several unsolved problems in existing works:

(i) Earlier theoretical work [19] fails to embody the native advantage of MapReduce systems that

data and computing resources are closely coupled.

(ii) Delay decisions are not based on dynamic resource availability and do not consider runtime

resource status and resource competition. Inappropriate delay will affect job execution or break

job deadline.

(iii) Existing performance-driven scheduling methods [17, 18] failed to consider data dependence

of Map tasks and lead to longer job executing time.

This paper proposes a DLD scheduling algorithm to address previous issues. DLD integrates

both data-aware scheduling and performance-driven scheduling. It improves MapReduce system

throughput through advanced delay scheduling based on real-time resource availability estimation

and deadline mechanism.

DOI: 10.1002/cpe

剩余12页未读，继续阅读

weixin_38609128

粉丝: 7
资源: 906

DLD算法：具有截止日期约束的MapReduce延迟调度优化

最新资源