优化MapReduce短任务：基于资源重用

127 浏览量更新于2024-08-27 收藏 975KB PDF 举报

"MapReduce短期作业优化基于资源重用的研究论文" 在大数据处理领域，Hadoop是一个广泛应用的开源MapReduce实现，旨在以大规模并行的方式处理海量数据。它设计的初衷是为了在众多提供计算和存储能力的计算节点上执行大规模作业。然而，现实情况中，Hadoop也常常被用于处理短时作业。尽管如此，短时作业在响应时间和运行效率方面往往表现不佳。这篇研究论文“基于资源重用的MapReduce短期作业优化”深入分析了作业执行过程，并揭示了Hadoop处理短作业时存在的问题。文章指出，由于MapReduce框架的启动开销大，对于执行时间相对较短的作业，这些开销可能导致作业的整体效率显著降低。此外，资源分配的不均衡以及作业调度策略的不足也可能加剧这一问题，进一步降低了集群资源的利用率。为了改善这种情况，论文提出了一个针对短时作业的优化策略，该策略的核心是资源重用。通过重新利用完成作业后仍然闲置的计算资源，而不是立即释放它们，可以减少频繁启动新作业时的初始化开销，从而提高作业的响应时间和整体系统效率。这种方法尤其适用于那些频繁提交且执行时间较短的作业，如数据分析、日志处理等场景。论文还探讨了任务调度算法的改进，以更好地适应短时作业的特性。可能的解决方案包括优先级调度、预调度以及智能的资源分配策略，这些策略旨在最小化短作业的等待时间，同时保持整个集群的资源利用率。作者们可能进行了实验验证，比较了优化策略与现有Hadoop默认设置下的性能差异，以证明提出的优化方法的有效性。这篇论文针对Hadoop处理短时作业的效率问题，提出了基于资源重用的优化方案，旨在提高系统的响应速度和资源利用率。这一研究对于理解MapReduce在处理短时作业时面临的挑战，以及如何改进现有框架以适应这些挑战，具有重要的理论和实践价值。通过实施这些优化措施，大数据处理环境中的短时作业执行性能有望得到显著提升。

Microprocessors and Microsystems 47 (2016) 178–187

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier.com/locate/micpro

MapReduce short jobs optimization based on resource reuse

Yuliang Shi

a , ∗

, Kaihui Zhang

, Lizhen Cui

, Lei Liu

, Yongqing Zheng

, Shidong Zhang

Han Yu

Shandong University, Jinan, China

Nanyang Technological University, Singapore, Singapore

a r t i c l e i n f o

Article history:

Received 13 January 2016

Revised 18 April 2016

Accepted 17 May 2016

Available online 18 May 2016

Keywords:

Hadoop

Short job

Performance optimization

Resource utilization

Task scheduling

a b s t r a c t

Hadoop is an open-source implementation of MapReduce serving for processing large datasets in a mas-

sively parallel manner. It was designed aiming at executing large-scale jobs in an enormous number of

computing nodes offering computing and storage. However, Hadoop is frequently employed to process

short jobs. In practice, short jobs suffer from poor response time and run ineﬃciently. To ﬁll this gap,

this paper analyses the process of job execution and depicts the existing issues why short jobs run ineﬃ-

ciently in Hadoop. According to the characteristic of task execution in multi-wave under cluster overload,

we develop a mechanism in light of resource reuse to optimize short jobs execution. This mechanism

can reduce the frequency of resource allocation and recovery. Experimental results suggest that the de-

veloped mechanism based on resource reuse is able to improve effectiveness of the resource utilization.

In addition, the runtime of short jobs can be signiﬁcantly reduced.

1. Introduction

Numbers of enterprises, ﬁnancial institutions and media orga-

nizations are under the pressure of processing large-scale datasets,

however, conventional data processing tools and computing mod-

els do not handle it. Hadoop which is an open-source implemen-

tation of MapReduce [1] proposed by Google provides an effective

solution to handle large-scale datasets. MapReduce Jobs submitted

to Hadoop are divided into Map tasks and Reduce tasks which run

in a massively parallel manner on multiple nodes, so that the run-

times of jobs are reduced signiﬁcantly. Hadoop hides many details

of parallel computing, such as distributing data blocks to comput-

ing nodes, rerunning failed tasks, and makes users focus on speciﬁc

business logic processing. Moreover, Hadoop provides a good scala-

bility, high-availability and fault tolerance, which make Hadoop be-

come the mainstream computing framework to run data-intensive

and compute-intensive applications. So the academic community

begins to pay close attention to Hadoop and deals with many prob-

lems, such as unfairness [2–5] , stragglers [4,6,7] , and data skew

[8–11] .

Hadoop was originally designed for long-run jobs on a large

number of computing nodes, but it is often used to handle short

∗

Corresponding author.

E-mail addresses: shiyuliang@sdu.edu.cn (Y. Shi), kaihuizhang@126.com

(K. Zhang), clz@sdu.edu.cn (L. Cui), l.liu@sdu.edu.cn (L. Liu),

zhengyongqing@dareway.com.cn (Y. Zheng), zsd@sdu.edu.cn (S. Zhang),

han.yu@ntu.edu.sg (H. Yu).

jobs in practice. The runtime of short job is less than a speciﬁed

threshold set by users. Short jobs can be distinguished from long

jobs by the size of input datasets, the number of tasks job divided,

the resource required by task, the runtime of task and the run-

time users expect. Since Hadoop does not take the characteristics

of short jobs into account, short jobs run ineﬃciently.

The hardware conﬁgurations of nodes in a cluster, job schedul-

ing algorithm and cluster load are crucial factors affecting job per-

formance. When scheduling task, Hadoop assumes that nodes in

a cluster are homogeneous. With the gradual expansion of a clus-

ter scale, however, hardware conﬁgurations of new nodes which

are added to a cluster are signiﬁcantly higher than the conﬁgura-

tions of old ones. Therefore, tasks jobs split run eﬃciently on new

nodes compared with on old nodes. In the case of cluster heavy

load, tasks which jobs split cannot obtain suﬃcient resources im-

mediately to run and a part of tasks are put into a waiting queue.

The running tasks release the occupied resources when tasks ﬁn-

ish. Hadoop picks up an appropriate task from waiting queue and

assigns available resources to it according to scheduling algorithm

speciﬁed by user. So, if the amount of resources which tasks re-

quest exceeds the amount of available resources a cluster offers,

tasks are executed in multiple waves. In TaoBao’s Hadoop cluster,

over 70% of Map tasks run more than two waves. Therefore, cluster

load has a decisive inﬂuence on the response time and runtime of

a job.

The purpose of this paper is to improve the execution perfor-

mance of short jobs. This paper analyzes the execution process

of jobs and describes the disadvantages of executing short jobs in

http://dx.doi.org/10.1016/j.micpro.2016.05.007

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38638004

粉丝: 3
资源: 900

优化MapReduce短任务：基于资源重用

MapReduce: Simplified Data Processing on Large Clusters 英文原文

论文：MapReduce: Simplified Data Processing on Large Clusters

Parallel extreme learning machine for regression based on MapReduce

MapReduce: Simplified Data Processing on Large Clusters

A distributed rule execution mechanism based on MapReduce in sematic web reasoning

Efficient Similarity Join Based on Earth Mover's Distance Using MapReduce

MapReduce: Simplified Data Processing on Large Clusters翻译

MapReduce_ Simplified Data Processing on Large Clusters.pdf

MapReduce-Simplified Data Processing on Large Clusters.pdf

MapReduce: Simplified Data Processing on Large Clusters中文版

最新资源