深度学习驱动的分布式机器学习任务调度优化：Harmony框架

需积分: 0 113 浏览量更新于2024-08-05 收藏 4.48MB PDF 举报

"郑康鹏关于基于深度学习的分布式机器学习任务调度策略" 在现代的分布式机器学习集群中，各种分布式机器学习工作负载如语音识别、机器翻译等日益增多。服务器共享提高了资源利用率，但同时也可能导致同一节点上的不同机器学习任务之间产生性能干扰，从而降低整体效率。现有的集群调度系统（例如Mesos）在任务调度时并未充分考虑这种干扰，导致资源效率不理想。在学术界，已有研究关注到干扰感知的任务放置策略，但这些方法通常依赖于详尽的工作负载分析和干扰建模，这并不适用于所有情况。这篇论文提出了名为“Harmony”的深度学习驱动的机器学习集群调度器，其目标是在最小化任务间干扰的同时最大化性能，即缩短训练完成时间。 Harmony的核心是一个精心设计的深度强化学习（DRL）框架。DRL是一种人工智能技术，它允许智能体通过与环境的互动学习最优策略。在Harmony中，这个智能体负责决定如何在集群中分配机器学习训练任务。为了实现这一目标，DRL模型被增强了一个奖励机制，该机制会根据任务完成时间、资源利用率和干扰水平来动态调整策略。奖励机制的设计至关重要，因为它直接影响智能体的学习效果。当任务完成时间减少或资源利用率达到更高水平时，智能体会接收到正向反馈，从而调整其决策策略以寻求更高的奖励。同时，如果干扰增加，智能体会学习避免此类决策，以降低性能下降的风险。 Harmony的另一个关键点在于，它不需要对所有可能的工作负载进行详细建模，而是通过学习来自集群的历史数据来适应不断变化的环境。这种方法使得Harmony具备了自适应性和泛化性，能够在未知工作负载条件下做出有效的任务调度决策。 "Deep Learning-based Job Placement in Distributed Machine Learning Clusters" 这篇论文展示了深度学习如何用于解决分布式机器学习集群中的任务调度问题，通过干扰感知和强化学习，实现更优的资源分配和性能提升。Harmony的提出，为处理复杂、动态的集群环境提供了一种新颖且具有潜力的方法。

Deep Learning-based Job Placement in Distributed

Machine Learning Clusters

Yixin Bao

∗

, Yanghua Peng

∗

, Chuan Wu

∗

Department of Computer Science, The University of Hong Kong, Email: {yxbao,yhpeng,cwu}@cs.hku.hk

Abstract—Production machine learning (ML) clusters com-

monly host a variety of distributed ML workloads, e.g., speech

recognition, machine translation. While server sharing among

jobs improves resource utilization, interference among co-located

ML jobs can lead to signiﬁcant performance downgrade. Ex-

isting cluster schedulers (e.g., Mesos) are interference-oblivious

in their job placement, causing suboptimal resource efﬁciency.

Interference-aware job placement has been studied in the lit-

erature, but was treated using detailed workload proﬁling and

interference modeling, which is not a general solution. This paper

presents Harmony, a deep learning-driven ML cluster scheduler

that places training jobs in a manner that minimizes interference

and maximizes performance (i.e., training completion time).

Harmony is based on a carefully designed deep reinforcement

learning (DRL) framework augmented with reward modeling.

The DRL employs state-of-the-art techniques to stabilize training

and improve convergence, including actor-critic algorithm, job-

aware action space exploration and experience replay. In view

of a common lack of reward samples corresponding to different

placement decisions, we build an auxiliary reward prediction

model, which is trained using historical samples and used for

producing reward for unseen placement. Experiments using real

ML workloads in a Kubernetes cluster of 6 GPU servers show

that Harmony outperforms representative schedulers by 25% in

terms of average job completion time.

I. INTRODUCTION

Nowadays most leading IT companies operate machine

learning (ML) clusters of GPU servers. Various ML workloads

are run on the cluster, to support the company’s services.

For example, an online news headline company may run

language models for news parsing, text classiﬁcation for fake

news detection, and personalized recommendation system for

advertisement display.

To train large datasets or large models, the ML workloads

are commonly run using distributed ML frameworks, e.g.,

TensorFlow [1], MXNet [2] and Caffe2 [3]. In a distributed

ML job, the dataset is divided and trained by separate workers,

which exchange calculated model parameters with each other

(either directly or through parameter servers (PSs)) to derive

the global parameters. The workers and PSs may well be

distributed onto different physical servers, when they cannot

be completely hosted on one server, or to maximize resource

fragment utilization on servers [4].

It is a fundamental challenge faced by cluster operators

how to efﬁciently place different ML jobs onto servers to

achieve high resource efﬁciency and training throughput.

This work was supported in part by grants from Hong Kong RGC under

the contracts HKU 17204715, 17225516, C7036-15G (CRF), grants NSFC

61628209 and HKU URC Matching Funding.

Many existing cluster schedulers (e.g., Borg [5], Mesos [6])

tend to allocate more resources to the jobs than server resource

capacity, in terms of resources such as CPU and memory, to

maximize resource utilization (assuming not all jobs use their

required resources fully at all time). However, even without

over-subscription of resources, co-located ML jobs on the

same server may interfere with each other negatively and

experience performance unpredictability. This is because the

jobs share underlying resources such as CPU caches, disk

I/O, network I/O and buses (e.g., QPI, PCIe), besides the

resources typically considered by modern cluster schedulers.

For example, when the GPU cards on a server are allocated

to different ML jobs, the PCIe bus is shared when the jobs

shufﬂe data between their allocated CPU and GPU; the QPI

bus is shared when two allocated GPUs are not attached to the

same CPU in the non-uniform memory access architecture.

Different levels of interference (i.e., resource contention)

occur when different types of ML jobs are co-located, de-

pending on the models being trained and behavior of the

training programs written by the users. Some ML jobs are

CPU intensive, e.g., CTC [7]; some are disk I/O intensive,

e.g., AlexNet [8], due to reading images for preprocessing; and

some have a high network bandwidth consumption level, due

to a large model size (number of parameters) and small mini-

batch sizes (leading to more frequent parameter exchanges

among workers), such as VGG-16 [9].

It is a natural idea to co-locate jobs with low levels of inter-

ference to optimize performance. However, existing schedulers

used in practical ML clusters (e.g., Yarn [10], Mesos [6]) are

largely interference-oblivious, due mainly to the difﬁculty of

obtaining potential interference levels of many jobs. In the

literature, a number of work have showcased the potential and

effectiveness of interference-aware scheduling, e.g., consider-

ing network contention in MapReduce jobs [11] [12], cache

access intensity of HPC jobs [13]. These studies build an

explicit interference model of the target performance based

on certain observations/assumptions and rely on hand-crafted

heuristics for incorporating interference in scheduling [11]

[13] [14]. They often require detailed application proﬁling

under tens of interference sources, and careful optimization

of coefﬁcients in the performance model or thresholds in

the heuristics accordingly. Generality is an issue with these

white-box approaches: when the workload type or hardware

conﬁguration changes, the heuristics may not work well.

In this paper, we pursue a black-box approach for ML job

placement that embraces interference while not relying on

下载后可阅读完整内容，剩余8页未读，立即下载

我就是月下

粉丝: 31
资源: 336

深度学习驱动的分布式机器学习任务调度优化：Harmony框架

44-郑康鹏-Cortez-2017-Resource-central1

22-郑康鹏-The Mystery Machine：End-to-end performance analysis of la

穿戴搭配系统 SSM毕业设计 源码+数据库+论文（JAVA+SpringBoot+Vue.JS）.zip

数据号高级上号器.apk

五杆并联机器人的优化matlab代码.rar

两相交错并联buck boost变器仿真 采用双向DCDC，管子均为双向管 模型内包含开环，电压单环，电压电流双闭环三种控制方

vs2019 professional

自适应推理时间计算优化：大型语言模型动态评估与采样策略研究

【旅行商问题】GA-based-TSP.zip

牛顿拉夫逊法在动力系统上的应用Matlab代码.rar

最新资源

穿戴搭配系统 SSM毕业设计源码+数据库+论文（JAVA+SpringBoot+Vue.JS）.zip

两相交错并联buck boost变器仿真采用双向DCDC，管子均为双向管模型内包含开环，电压单环，电压电流双闭环三种控制方