Rapid Deployment of Anomaly Detection Models
for Large Number of Emerging KPI Streams
Jiahao Bu
†¶
, Ying Liu
†¶
, Shenglin Zhang
‡∗
, Weibin Meng
†¶
, Qitong Liu
§
, Xiaotian Zhu
§
, Dan Pei
†¶
†
Tsinghua University
‡
Nankai University
§
Tencent
¶
Beijing National Research Center for Information Science and Technology (BNRist)
Email: bjh16@mails.tsinghua.edu.cn, liuying@cernet.edu.cn, zhangsl@nankai.edu.cn, mwb16@mails.tsinghua.edu.cn
{cloudliu, markxtzhu}@tencent.com, peidan@tsinghua.edu.cn
Abstract—Internet-based services monitor and detect anoma-
lies on KPIs (Key Performance Indicators, say CPU utilization,
number of queries per second, response latency) of their appli-
cations and systems in order to keep their services reliable. This
paper identifies a common, important, yet little-studied problem
of KPI anomaly detection: rapid deployment of anomaly detection
models for large number of emerging KPI streams, without manual
algorithm selection, parameter tuning, or new anomaly labeling for
any newly emerging KPI streams. We propose the first framework
ADS (Anomaly Detection through Self-training) that tackles the
above problem, via clustering and semi-supervised learning. Our
extensive experiments using real-world data show that, with the
labels of only the 5 cluster centroids of 70 historical KPI streams,
ADS achieves an averaged best F-score of 0.92 on 81 new KPI
streams, almost the same as a state-of-art supervised approach,
and greatly outperforming a state-of-art unsupervised approach
by 61.40% on average.
I. INTRODUCTION
Internet-based services (e.g., online games, online shopping,
social networks, search engine) monitor KPIs (Key Perfor-
mance Indicators, say CPU utilization, number of queries per
second, response latency) of their applications and systems in
order to keep their services reliable. Anomalies on KPI (e.g.,
a spike or dip in a KPI stream) likely indicate underlying
failures on Internet services [1]–[5], such as server failures,
network overload, external attacks, and should be accurately
and rapidly detected.
Despite the rich body of literature in KPI anomaly detec-
tion [2], [6]–[12], there remains one common and important
scenario that has not been studied or well-handled by any
of these approaches. Specifically, when large number of KPI
streams emerge continuously and frequently, operators need to
deploy accurate anomaly detection models for these new KPI
streams as quickly as possible (e.g., within 3 weeks at most),
in order to avoid that Internet-based services suffer from false
alarms (due to low precision) and/or missed alarms (because
of low recall) and in turn impact on user experience and
revenue. Large number of new KPI streams emerge due to the
following two reasons. First, new products can be frequently
launched, such as in gaming platform. For example, in a top
gaming company G studied in this paper, on average over
ten new games are launched per quarter, which results in
more than 6000 new KPI streams per 10 days on average.
∗
Shenglin Zhang is the corresponding author.
Second, with the popularity of DevOps and micro-service,
software upgrades become more and more frequent [13], many
of which result in the pattern changes of existing KPI streams,
making the previous anomaly detection algorithms/parameters
outdated.
Unfortunately, none of the existing anomaly detection ap-
proaches, including traditional statistical algorithms, super-
vised learning, and unsupervised learning, are feasible to
deal with the above scenario well. For traditional statistical
algorithms [6]–[9], to achieve the best accuracy, operators have
to manually select an anomaly detection algorithm and tune
its parameters for each KPI stream, which is infeasible for the
large number of emerging KPI streams. Supervised learning
based methods [2], [10] require manually labeling anomalies
for each new KPI stream, which is not feasible for the
large number of emerging KPI streams either. Unsupervised
learning based methods [11], [12] do not require algorithm
selection, parameter tuning, or manual labels, but they either
suffer from low accuracy [14] or require large amounts of
training data for each new KPI stream (e.g., six months
worth of data) [12], which do not satisfy the requirement of
rapid deployment (e.g., within 3 weeks) of accurate anomaly
detection.
In this paper, we propose ADS, the first framework that
enables the rapid deployment of anomaly detection models
(say at most 3 weeks) for large number of emerging KPI
streams, without manual algorithm selection, parameter tuning,
or new anomaly labeling for any newly emerging KPI streams.
Our idea of ADS is based on the following two observations.
(1) In practice, many KPI streams (e.g., the number of queries
per server in a well load balanced server cluster) are similar
due to their implicit associations and similarities, thus poten-
tially we can use the similar anomaly detection algorithms
and parameters for these similar KPI streams. (2) Clustering
methods such as ROCKA [15] can be used to cluster many
KPI streams into clusters according to their similarities. The
number of clusters are largely determined by the nature of
the service (e.g., shopping, gaming, social network, search)
and the type of KPIs (e.g., number of queries, CPU usage,
memory usage), but not by the scale of the entire system.
Thus for a given service, the number of clusters can be orders
of magnitude smaller than the number of KPI streams, and
there is a good chance that a newly emerging KPI stream falls
978-1-5386-6808-5/18/$31.00 ©2018 IEEE