自适应调度算法优化启动时间

69 浏览量更新于2024-08-26 收藏 1.76MB PDF 举报

"这篇研究论文探讨了一种自适应调度算法，旨在减少启动时间，特别是针对大数据处理中的MapReduce框架。作者包括Zhuo Tang、Lingang Jiang、Junqing Zhou、Kenli Li和Keqin Li，分别来自湖南大学信息科学与工程学院和纽约州立大学新帕尔茨分校计算机科学系。文章在Future Generation Computer Systems期刊上发表，讨论了系统浪费时间导致任务等待过久的问题，并提出一种模型，可以根据作业上下文动态确定减少任务的启动时间。SARS（Self-Adaptive Reduce Scheduling）算法被提出，能够优化调度，降低作业的Reduce完成时间。关键词包括大数据、Hadoop、MapReduce、Reduce以及自适应任务调度。" MapReduce是一种由Google提出的编程模型，广泛应用于大规模数据密集型云计算平台的实现，尤其是Hadoop生态系统中。它将复杂的数据处理任务分解为两个主要阶段：Map和Reduce。Map阶段将大任务拆分为小任务并并行处理，而Reduce阶段则负责聚合Map阶段的结果，提供最终答案。然而，MapReduce框架在处理大量数据时面临的一个挑战是启动时间的延迟。这主要是由于Reduce任务必须等待所有Map任务完成后才能开始，导致任务等待时间过长，从而降低了整体系统效率。论文指出，这种等待现象是由于不灵活的调度策略导致的资源浪费。为了解决这个问题，研究人员提出了SARS（Self-Adaptive Reduce Scheduling）算法。SARS是一种自适应的调度策略，它能根据当前的作业上下文动态调整Reduce任务的启动时间，而不必等到所有Map任务完全结束。这一创新在于，它允许Reduce任务提前启动，利用部分已经完成的Map输出，从而减少了总体的作业完成时间。 SARS算法的优化效果在于，通过减少Reduce阶段的等待时间，可以显著提高系统吞吐量和资源利用率。这意味着在处理大数据工作负载时，系统能够更快地完成任务，缩短整体处理时间，对于需要实时或近实时处理的场景尤其有益。此外，SARS算法的自适应性使其能够应对不同的工作负载和系统条件变化，增强了系统的稳定性和性能。这种动态调度方法不仅减少了任务等待时间，还可能减少了系统中的任务争抢和资源空闲，进一步提升了Hadoop集群的性能。这项研究为大数据处理的效率提升提供了一个新的解决方案，尤其是在面对MapReduce框架的启动时间问题时。SARS算法通过自适应的策略，优化了Reduce任务的调度，为云环境中的大规模数据处理带来了更高效、更灵活的处理方式。

Future Generation Computer Systems 43–44 (2015) 51–60

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

A self-adaptive scheduling algorithm for reduce start time

Zhuo Tang

a,∗

, Lingang Jiang

, Junqing Zhou

, Kenli Li

, Keqin Li

a,b

College of Information Science and Engineering, Hunan University, Changsha 410082, China

Department of Computer Science, State University of New York, New Paltz, NY 12561, USA

h i g h l i g h t s

• This paper illustrates the reasons of the system slots waster for reduces tasks waiting around.

• The model can determine the start time of reduce tasks, dynamically according to job context.

• As an optimal scheduling algorithm, SARS can decrease the reduce completion time for jobs.

a r t i c l e i n f o

Article history:

Received 28 December 2013

Received in revised form

1 August 2014

Accepted 15 August 2014

Available online 25 August 2014

Keywords:

Big data

Hadoop

MapReduce

Reduce

Self-adaptive

Task scheduling

a b s t r a c t

MapReduce is by far one of the most successful realizations of large-scale data-intensive cloud computing

platforms. When to start the reduce tasks is one of the key problems to advance the MapReduce

performance. The existing implementations may result in a block of reduce tasks. When the output of map

tasks become large, the performance of a MapReduce scheduling algorithm will be influenced seriously.

Through analysis for the current MapReduce scheduling mechanism, this paper illustrates the reasons of

system slot resources waste, which results in the reduce tasks waiting around, and proposes an optimal

reduce scheduling policy called SARS (Self Adaptive Reduce Scheduling) for reduce tasks’ start times in

the Hadoop platform. It can decide the start time point of each reduce task dynamically according to

each job context, including the task completion time and the size of map output. Through estimating job

completion time, reduce completion time, and system average response time, the experimental results

illustrate that, when comparing with other algorithms, the reduce completion time is decreased sharply.

It is also proved that the average response time is decreased by 11% to 29%, when the SARS algorithm is

applied to the traditional job scheduling algorithms FIFO, FairScheduler, and CapacityScheduler.

1. Introduction

MapReduce is an excellent model for distributed computing,

introduced by Google in 2004 [1]. It has emerged as an impor-

tant and widely used programming model for distributed and

parallel computing, due to its ease of use, generality, and scalabil-

ity. Among its open source implementation versions, Hadoop has

been widely used in industry around the whole world [2] and has

been used/extended by scientists as the base of their own research

work [3]. It has been applied to the production environments, such

as Google, Yahoo, Amazon, Facebook, and so on. Because of the

short development time, Hadoop can be improved in many aspects,

such as intermediate data management and reduce tasks schedul-

ing [4]. This paper mainly focuses on the reduce scheduling prob-

lem, which refers to the starting times of the reduce tasks.

∗

Corresponding author. Tel.: +86 18627568501.

E-mail address: ztang@hnu.edu.cn (Z. Tang).

Map and Reduce are the two sections in a MapReduce schedul-

ing algorithm. In Hadoop, each task contains three functioning

phases: copy, sort, and reduce [5]. The goal of the copy phase is

to read the map tasks’ outputs. The sort phase is to sort the inter-

mediate data which are produced by map tasks and will be the in-

put to the reduce phase. Finally, the eventual results are produced

through the reduce phase, where the copy and sort phases are to

preprocess the input data of reduce. In real applications, copying

and sorting may consume considerable amount of time, especially

in the copy phase. In the theoretical model, the reduce functions

start only if all map tasks are finished [6]. However, in the Hadoop

implementation, all copy actions of reduce tasks will start when

the first map action is finished [7]. But in a slot duration, if there is

any map task still running, the copy actions will wait around. This

will lead to the waste of the reduce slot resources.

The existing MapReduce program frameworks often treat the

jobs as a whole process. However, the differences between the map

and reduce tasks are not considered. Since map and reduce task

execution times are not related, it is not accurate to compute the

http://dx.doi.org/10.1016/j.future.2014.08.011

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38550459

粉丝: 4
资源: 956

自适应调度算法优化启动时间

动态时间窗下的相控阵雷达自适应调度算法.pdf

多功能相控阵雷达自适应调度算法研究.pdf

相控阵雷达自适应调度算法仿真.pdf

群控电梯有什么自适应调度算法的策略和代码

给出一个立体车库自适应调度的优化算法

相控阵雷达自适应调度原理口语表达

短视频传输调度算法的国内外研究现状

linux操作系统课程设计磁盘调度

基于机器学习的短视频传输调度算法

同步异步自适应非自适应

最新资源