Deep Learning-based Job Placement in Distributed
Machine Learning Clusters
Yixin Bao
∗
, Yanghua Peng
∗
, Chuan Wu
∗
∗
Department of Computer Science, The University of Hong Kong, Email: {yxbao,yhpeng,cwu}@cs.hku.hk
Abstract—Production machine learning (ML) clusters com-
monly host a variety of distributed ML workloads, e.g., speech
recognition, machine translation. While server sharing among
jobs improves resource utilization, interference among co-located
ML jobs can lead to significant performance downgrade. Ex-
isting cluster schedulers (e.g., Mesos) are interference-oblivious
in their job placement, causing suboptimal resource efficiency.
Interference-aware job placement has been studied in the lit-
erature, but was treated using detailed workload profiling and
interference modeling, which is not a general solution. This paper
presents Harmony, a deep learning-driven ML cluster scheduler
that places training jobs in a manner that minimizes interference
and maximizes performance (i.e., training completion time).
Harmony is based on a carefully designed deep reinforcement
learning (DRL) framework augmented with reward modeling.
The DRL employs state-of-the-art techniques to stabilize training
and improve convergence, including actor-critic algorithm, job-
aware action space exploration and experience replay. In view
of a common lack of reward samples corresponding to different
placement decisions, we build an auxiliary reward prediction
model, which is trained using historical samples and used for
producing reward for unseen placement. Experiments using real
ML workloads in a Kubernetes cluster of 6 GPU servers show
that Harmony outperforms representative schedulers by 25% in
terms of average job completion time.
I. INTRODUCTION
Nowadays most leading IT companies operate machine
learning (ML) clusters of GPU servers. Various ML workloads
are run on the cluster, to support the company’s services.
For example, an online news headline company may run
language models for news parsing, text classification for fake
news detection, and personalized recommendation system for
advertisement display.
To train large datasets or large models, the ML workloads
are commonly run using distributed ML frameworks, e.g.,
TensorFlow [1], MXNet [2] and Caffe2 [3]. In a distributed
ML job, the dataset is divided and trained by separate workers,
which exchange calculated model parameters with each other
(either directly or through parameter servers (PSs)) to derive
the global parameters. The workers and PSs may well be
distributed onto different physical servers, when they cannot
be completely hosted on one server, or to maximize resource
fragment utilization on servers [4].
It is a fundamental challenge faced by cluster operators
how to efficiently place different ML jobs onto servers to
achieve high resource efficiency and training throughput.
This work was supported in part by grants from Hong Kong RGC under
the contracts HKU 17204715, 17225516, C7036-15G (CRF), grants NSFC
61628209 and HKU URC Matching Funding.
Many existing cluster schedulers (e.g., Borg [5], Mesos [6])
tend to allocate more resources to the jobs than server resource
capacity, in terms of resources such as CPU and memory, to
maximize resource utilization (assuming not all jobs use their
required resources fully at all time). However, even without
over-subscription of resources, co-located ML jobs on the
same server may interfere with each other negatively and
experience performance unpredictability. This is because the
jobs share underlying resources such as CPU caches, disk
I/O, network I/O and buses (e.g., QPI, PCIe), besides the
resources typically considered by modern cluster schedulers.
For example, when the GPU cards on a server are allocated
to different ML jobs, the PCIe bus is shared when the jobs
shuffle data between their allocated CPU and GPU; the QPI
bus is shared when two allocated GPUs are not attached to the
same CPU in the non-uniform memory access architecture.
Different levels of interference (i.e., resource contention)
occur when different types of ML jobs are co-located, de-
pending on the models being trained and behavior of the
training programs written by the users. Some ML jobs are
CPU intensive, e.g., CTC [7]; some are disk I/O intensive,
e.g., AlexNet [8], due to reading images for preprocessing; and
some have a high network bandwidth consumption level, due
to a large model size (number of parameters) and small mini-
batch sizes (leading to more frequent parameter exchanges
among workers), such as VGG-16 [9].
It is a natural idea to co-locate jobs with low levels of inter-
ference to optimize performance. However, existing schedulers
used in practical ML clusters (e.g., Yarn [10], Mesos [6]) are
largely interference-oblivious, due mainly to the difficulty of
obtaining potential interference levels of many jobs. In the
literature, a number of work have showcased the potential and
effectiveness of interference-aware scheduling, e.g., consider-
ing network contention in MapReduce jobs [11] [12], cache
access intensity of HPC jobs [13]. These studies build an
explicit interference model of the target performance based
on certain observations/assumptions and rely on hand-crafted
heuristics for incorporating interference in scheduling [11]
[13] [14]. They often require detailed application profiling
under tens of interference sources, and careful optimization
of coefficients in the performance model or thresholds in
the heuristics accordingly. Generality is an issue with these
white-box approaches: when the workload type or hardware
configuration changes, the heuristics may not work well.
In this paper, we pursue a black-box approach for ML job
placement that embraces interference while not relying on