ADS框架：大规模新兴KPI流的快速异常检测

需积分: 9 64 浏览量更新于2024-08-11 收藏 359KB PDF 举报

"这篇研究论文探讨了在大量新兴KPI流中快速部署异常检测模型的挑战，并提出了一种名为ADS（Anomaly Detection through Self-training）的框架，旨在解决新出现的KPI流的自动异常检测问题，无需人工选择算法、参数调整或为新KPI流添加异常标签。" 正文: 随着互联网服务的发展，关键性能指标（KPIs）的监控与异常检测变得至关重要，因为这直接影响到应用和服务的可靠性。这些KPI可能包括CPU利用率、每秒查询次数、响应延迟等。然而，当面临大量新兴KPI流时，传统的异常检测方法往往难以适应，因为它们通常需要人工介入，如选择适合的检测算法、微调参数以及为每个新KPI流标注异常数据。这篇研究论文"为大量新兴KPI流快速部署异常检测模型"深入研究了这一问题，并提出了一个创新的解决方案——ADS框架。ADS框架的目标是实现自动化，以应对不断变化和增加的KPI监测需求，减少对专业人员的依赖，提高效率。该框架的核心在于自我训练机制，它能够利用已有的历史数据和已知异常来训练模型，并将其应用到新的KPI流上，而无需额外的人工干预。 ADS的工作流程可能包括以下几个步骤： 1. **数据预处理**：收集并整合来自不同KPI流的历史数据，进行必要的清洗和标准化，以便于后续分析。 2. **自我训练模型**：利用已有异常标签的数据，通过监督学习方法训练初始异常检测模型。自我训练意味着模型可以自我优化，不断学习和适应新的KPI特性。 3. **模型迁移与适应**：将训练好的模型应用于新的KPI流，通过在线学习或迁移学习的方式，快速适应新流的特性和异常模式。 4. **反馈循环**：根据新KPI流上的检测结果，不断调整和优化模型，形成一个自我改进的循环。 5. **实时监控**：实时监控新KPI流，一旦检测到异常，立即触发警报，帮助运维人员快速定位问题。 ADS框架的提出，为大规模KPI异常检测提供了有效的工具，有助于提升互联网服务的稳定性和可靠性。同时，它也降低了运营成本，因为减少了对专家的依赖和手动工作。未来的研究可能会进一步探索如何在保证检测精度的同时，提高模型的自适应能力和泛化性能，以应对更复杂、多变的KPI环境。

Rapid Deployment of Anomaly Detection Models

for Large Number of Emerging KPI Streams

Jiahao Bu

†¶

, Ying Liu

†¶

, Shenglin Zhang

‡∗

, Weibin Meng

†¶

, Qitong Liu

, Xiaotian Zhu

, Dan Pei

†¶

†

Tsinghua University

‡

Nankai University

Tencent

Beijing National Research Center for Information Science and Technology (BNRist)

Email: bjh16@mails.tsinghua.edu.cn, liuying@cernet.edu.cn, zhangsl@nankai.edu.cn, mwb16@mails.tsinghua.edu.cn

{cloudliu, markxtzhu}@tencent.com, peidan@tsinghua.edu.cn

Abstract—Internet-based services monitor and detect anoma-

lies on KPIs (Key Performance Indicators, say CPU utilization,

number of queries per second, response latency) of their appli-

cations and systems in order to keep their services reliable. This

paper identiﬁes a common, important, yet little-studied problem

of KPI anomaly detection: rapid deployment of anomaly detection

models for large number of emerging KPI streams, without manual

algorithm selection, parameter tuning, or new anomaly labeling for

any newly emerging KPI streams. We propose the ﬁrst framework

ADS (Anomaly Detection through Self-training) that tackles the

above problem, via clustering and semi-supervised learning. Our

extensive experiments using real-world data show that, with the

labels of only the 5 cluster centroids of 70 historical KPI streams,

ADS achieves an averaged best F-score of 0.92 on 81 new KPI

streams, almost the same as a state-of-art supervised approach,

and greatly outperforming a state-of-art unsupervised approach

by 61.40% on average.

I. INTRODUCTION

Internet-based services (e.g., online games, online shopping,

social networks, search engine) monitor KPIs (Key Perfor-

mance Indicators, say CPU utilization, number of queries per

second, response latency) of their applications and systems in

order to keep their services reliable. Anomalies on KPI (e.g.,

a spike or dip in a KPI stream) likely indicate underlying

failures on Internet services [1]–[5], such as server failures,

network overload, external attacks, and should be accurately

and rapidly detected.

Despite the rich body of literature in KPI anomaly detec-

tion [2], [6]–[12], there remains one common and important

scenario that has not been studied or well-handled by any

of these approaches. Speciﬁcally, when large number of KPI

streams emerge continuously and frequently, operators need to

deploy accurate anomaly detection models for these new KPI

streams as quickly as possible (e.g., within 3 weeks at most),

in order to avoid that Internet-based services suffer from false

alarms (due to low precision) and/or missed alarms (because

of low recall) and in turn impact on user experience and

revenue. Large number of new KPI streams emerge due to the

following two reasons. First, new products can be frequently

launched, such as in gaming platform. For example, in a top

gaming company G studied in this paper, on average over

ten new games are launched per quarter, which results in

more than 6000 new KPI streams per 10 days on average.

∗

Shenglin Zhang is the corresponding author.

Second, with the popularity of DevOps and micro-service,

software upgrades become more and more frequent [13], many

of which result in the pattern changes of existing KPI streams,

making the previous anomaly detection algorithms/parameters

outdated.

Unfortunately, none of the existing anomaly detection ap-

proaches, including traditional statistical algorithms, super-

vised learning, and unsupervised learning, are feasible to

deal with the above scenario well. For traditional statistical

algorithms [6]–[9], to achieve the best accuracy, operators have

to manually select an anomaly detection algorithm and tune

its parameters for each KPI stream, which is infeasible for the

large number of emerging KPI streams. Supervised learning

based methods [2], [10] require manually labeling anomalies

for each new KPI stream, which is not feasible for the

large number of emerging KPI streams either. Unsupervised

learning based methods [11], [12] do not require algorithm

selection, parameter tuning, or manual labels, but they either

suffer from low accuracy [14] or require large amounts of

training data for each new KPI stream (e.g., six months

worth of data) [12], which do not satisfy the requirement of

rapid deployment (e.g., within 3 weeks) of accurate anomaly

detection.

In this paper, we propose ADS, the ﬁrst framework that

enables the rapid deployment of anomaly detection models

(say at most 3 weeks) for large number of emerging KPI

streams, without manual algorithm selection, parameter tuning,

or new anomaly labeling for any newly emerging KPI streams.

Our idea of ADS is based on the following two observations.

(1) In practice, many KPI streams (e.g., the number of queries

per server in a well load balanced server cluster) are similar

due to their implicit associations and similarities, thus poten-

tially we can use the similar anomaly detection algorithms

and parameters for these similar KPI streams. (2) Clustering

methods such as ROCKA [15] can be used to cluster many

KPI streams into clusters according to their similarities. The

number of clusters are largely determined by the nature of

the service (e.g., shopping, gaming, social network, search)

and the type of KPIs (e.g., number of queries, CPU usage,

memory usage), but not by the scale of the entire system.

Thus for a given service, the number of clusters can be orders

of magnitude smaller than the number of KPI streams, and

there is a good chance that a newly emerging KPI stream falls

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38627213

粉丝: 1

ADS框架：大规模新兴KPI流的快速异常检测

快速部署大量新兴KPI流的异常检测模型

基于深度学习的无监督KPI异常检测.pdf

AI-Graph:自动化识别异常电信流量图和KPI图

预测分析中的模型监控：实时检测模型漂移与性能退化的策略

【网络行为分析】：基于PacketFence的异常检测与响应机制

【异常检测关键】：确保今日头条BP高清版稳定运行的核心技术

【考务系统数据流的标准化设计】：构建统一的数据流模型，简化系统集成

YOLOv8模型评估与验证：确保检测质量的全面方法论

配置管理的度量与KPI

【VoLTE丢包率优化的KPI设定】：为网络性能设定正确的关键指标

最新资源