MapReduce云平台上的硬实时工作流调度策略

165 浏览量更新于2024-08-26 收藏 645KB PDF 举报

"本文主要探讨了在基于MapReduce的云环境中如何有效地调度实时工作流，以确保满足严格的截止期限。作者提出了一种名为SPS的新调度器，该调度器支持硬实时调度，允许在必要时中断任务以优化整体工作流完成时间。" MapReduce是一种广泛应用于云基础数据处理环境的编程模型，尤其在工业界和学术研究中具有显著地位。开源实现Hadoop是MapReduce最著名的实现之一。然而，MapReduce云平台面临的一个关键挑战是如何自动控制资源分配，以满足实时工作流自定义定义的截止期限。当前关于与截止期限相关的MapReduce调度器的研究主要集中在软实时调度，即允许一定程度的截止期限扩展。然而，这种策略并不适用于那些具有严格截止期限的硬实时工作流。硬实时调度要求工作流的每个任务必须在预设的严格截止期限前完成，否则可能导致严重后果。本文针对这一问题进行了深入研究，提出了一个名为SPS（可能是"Smart Preemptive Scheduler"的缩写）的调度算法。SPS调度器引入了任务抢占机制，可以在低上下文切换开销的情况下做出在线调度决策。这意味着当实时工作流随机到达云端时，SPS能够迅速调整资源分配，确保任务能在指定的截止期限之前完成。 SPS调度器的设计考虑了云环境的动态性和不确定性，通过智能地中断和重新调度工作流中的任务，来优化整体完成时间。这种预占策略有助于在面对工作负载变化和资源限制时，保持系统的响应性和效率。此外，由于实时工作流的特性，SPS还需要解决公平性、效率和工作流依赖性等问题。它需要确保每个任务的执行顺序符合其依赖关系，同时在不影响整体性能的前提下尽可能减少上下文切换次数，以降低系统开销。这篇研究论文为基于MapReduce的云环境提供了硬实时调度的新视角，其提出的SPS调度器有望改善实时工作流的处理效率和可靠性。这项工作对于进一步优化云环境中的资源管理，尤其是对于那些对时间敏感的应用程序，具有重要的理论和实践价值。

Scheduling Real-time Workﬂow on

MapReduce-based Cloud

Fei Teng and Hao Yang and Tianrui Li and Yan Yang and Zhao Li

School of Information Science and Technology

Southwest Jiaotong University

Chengdu, China, 610031

Email: fteng@swjtu.edu.cn

Abstract—As a popular programming model in cloud-based

data processing environment, MapReduce and its open source

implementation Hadoop, are widely applied both in industry

and academic researches. A key challenge in MapReduce-based

cloud is the ability to automatically control resource allocations to

real-time workﬂows for achieving their custom-deﬁned deadlines.

Current researches on deadline-related MapReduce schedulers

only support soft real-time scheduling, where the extension

of the deadline is allowed. In this paper, the hard real-time

scheduling problem with a strict deadline on MapReduce-based

cloud is studied. We propose a SPS scheduler that can guarantee

job completion time before the speciﬁed deadline for real-time

workﬂows. SPS supports job preemption with low context-switch

overhead so that it can make online scheduling decision when

workﬂows randomly arrive in cloud. Experiments on Hadoop

show that SPS effectively meets the deadline constraint even if

the workﬂow demands exceed the cluster resources.

I. INTRODUCTION

Nowadays, Data is being generated, disseminated and con-

sumed at an unprecedented speed, and has become a powerful

technology that can shed light on understanding the world

and inform better decisions. Most of the Internet companies

operate as data factories, collecting data streams by countless

digital devices and producing kinds of intelligent computing

services. These cloud services are evaluated by Quality of

Service (QoS), among which the response time is a key

performance metric [1]. For example, in some search engine

companies, data mining on public opinion monitoring is op-

erated at a ﬁxed interval, and each mining operation should

complete before the arrival of next data. Two distinct features

should be paid attention to. First, these services are repeated

in a ﬁxed period, working as a sequence of concatenated

workﬂow. Second, the predecessor service terminates when

the successor arrives, which means the periodic service has an

infrangible deadline. This paper targets on scheduling the real-

time services, where the quality of service depends on both

functional correctness and time that the service is delivered.

In industry, a number of public cloud service providers have

rolled out hosted versions of MapReduce-based clusters as

their data processing frameworks. For example, the world’s

largest Hadoop cluster is run by Facebook to operate thou-

sands of online social media applications per second. When a

workﬂow submitted to Hadoop, MapReduce runtime packages

each service request as a MapReduce job that contains a

large number of map and reduce tasks. Scheduler inside the

runtime is responsible for scheduling tasks to compute nodes

and monitoring their completion, which exerts great impact

on the QoS and the performance of cluster. In this paper,

we study the hard real-time scheduling issue on MapReduce-

based cloud, and propose Shortest Period Scheduler (SPS) to

guarantee response time before a custom deﬁned deadline.

Besides that, it is computationally efﬁcient, and can support

online scheduling for dynamic arrival of service requests.

The rest of this paper is organized as follows. Section

2 reviews related work. Section 3 formulates the real-time

scheduling problem for MapReduce-based cloud. Section 4

proposes a practical scheduler SPS, and implement it on

Hadoop. Experimental results are evaluated in Section 5.

Finally, Section 6 summarizes the paper.

II. RELATED WORK

MapReduce is a programming model originally designed by

Google to exploit large clusters to perform parallel computa-

tions. The basis of MapReduce model is a map function and a

reduce function [2]. Each data split running map function is a

separate map task. All the output of map tasks are partitioned

and parsed to reduce tasks, and the output of the reduce tasks

is appended to a ﬁnal output ﬁle for reduce partition.

Apache Hadoop is an open source implementation of

Google MapReduce-based cloud, which basically contains

two elementary modules. The one is Hadoop Distributed File

System (HDFS) that provides high-throughput access to appli-

cation data, and the other is MapReduce that supports parallel

processing of large data sets. The default MapReduce sched-

uler on Hadoop is based on First In First Out (FIFO) algorithm

where jobs are executed in the order of their submission. The

pluggable scheduler permits the development of schedulers

optimized for the particular workload and application [3].

Yahoo’s Fair scheduler and Facebook’s Capacity scheduler are

implemented to share the cluster among different cloud users

[4]. Later on, Wolf [5] develops the CIRCUMFLEX scheduler

that allows users to adjust the priority levels assigned to their

jobs.

To shorten the job makespan, Zaharia[6] proposes delay

scheduling to address the conﬂict between data locality and

fairness. Both Tarazu scheduler [7] and network-aware sched-

uler [8] solve the delay problem caused by the straggler task.

With a target deadline, co-scheduler [9] proposes a latency

117

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38631389

粉丝: 6
资源: 891

MapReduce云平台上的硬实时工作流调度策略

行业文档-设计装置-云平台MapReduce工作流调度优化方法.zip

行业分类-设备装置-一种基于Hadoop云计算平台的MapReduce作业流式调度方法及调度系统.zip

云平台MapReduce工作流调度优化方法研究

20、MapReduce 工作流介绍

基于MapReduce的大数据在线聚集优化设计.zip

基于MapReduce和GPU双重并行计算的云计算模型.pdf

基于MapReduce架构的就地化分布式母线保护研究.pdf

用于多个MapReduce作业的任务调度算法.pdf

MapReduce工作负载优化：基于成本的配置与工作流优化

MapReduce洗牌调度优化：3/2近似算法与网络效率提升

最新资源