数值大数据的密度基离群值检测算法

142 浏览量更新于2024-08-26 收藏 470KB PDF 举报

"基于密度的数值海量数据离群值检测方法是针对大数据背景下离群值检测的挑战，提出的一种高效且精确的算法。该方法融合了密度分布、最大距离乘积、k-means聚类、LOF（局部离群因子）以及MapReduce编程模型，旨在提升离群值检测的效率和准确性。" 在大数据环境下，离群值检测是挖掘有价值信息的关键步骤，因为离群值可能表示异常行为或关键事件。本文提出的算法首先利用密度分布选择高密度集作为聚类中心的候选集，这种方法可以降低k-means算法迭代次数，从而提高整体计算效率。候选集的选取基于数据点的邻域密度，高密度区域更可能包含潜在的聚类中心。接着，文章采用了k-means聚类算法对原始数据进行预处理。k-means是一种广泛应用的无监督学习算法，它通过迭代寻找数据点的最优分组，使得同一组内的数据点彼此相似，而不同组间的数据点相异。预处理阶段的聚类有助于简化数据结构，为后续的离群值检测奠定基础。为了进一步优化中心点的选择，文章引入了最大距离乘积的方法。这种方法考虑了数据点之间的距离关系，选择最能代表数据集分布的中心点，有助于减少错误分类和提高聚类质量。在数据处理过程中，作者将整个聚类流程与MapReduce编程模型相结合。MapReduce是一种分布式计算模型，由Google提出，适用于大规模数据集的处理。Map阶段将数据拆分成小块并分配到多个节点进行并行计算，Reduce阶段则将计算结果汇总。这种并行化处理方式极大地提升了离群值检测的可扩展性和计算速度。最后，基于密度的LOF算法用于确定最终的离群值。LOF算法衡量一个数据点相对于其邻居的局部密度，如果一个点的局部密度远低于周围点，那么它很可能是一个离群值。通过计算所有数据点的LOF值，可以筛选出那些显著偏离正常模式的点，即为离群值。实验结果验证了这种方法的有效性，最大距离乘积的聚类中心初始化方法提升了聚类效率，提出的算法在数值属性离群值检测中展现出了更高的精度、可扩展性和加速比。这使得该方法在处理大规模数值数据时具有显著的优势，对于实时监控、网络入侵检测、金融欺诈检测等领域有广泛的应用前景。

2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

A Density-based method for Outlier Detecting on

Numerical Massive Data

Keyan Cao

∗

, Anchen Miao

∗

, Ning Jin

∗

, Yuanwei Qi

∗

and Ibrahim Musa

∗

Information & Control Engineering Faculty

Shenyang Jianzhu University

Shenyang, China

Abstract—As large amounts of data have been generated with

the development of the Internet, outliers detecting and effective

information obtaining on it become an important issue. For

this question, this paper presents a density-based algorithm

for numerical attribute anomaly points. A high-density set is

selected as the candidate set of clustering centers based on density

distribution in order to decrease the times of iterations of k-

means algorithm and improve the efﬁciency, then the initial

centers are chosen by the method depending on the maximum

distance product. After that, the data is preprocessed by k-means

clustering algorithm. The whole process of clustering is combined

with MapReduce programming model. A candidate set of abnor-

mal points is obtained from each cluster by appropriate pruning

method. At last, the ultimate outliers are determined according to

the density-based LOF algorithm. The experimental results show

that the maximum-distance-product-based initializing method for

clustering centers improves the efﬁciency of clustering and the

algorithm we proposed has better accuracy, expansibility and

speedup ratio on outlier detection for numerical attributes.

Index Terms—Clustering; maximum distance product; k-

means; LOF; MapReduceClustering; maximum distance prod-

uct; k-means; LOF; MapReduce

I. INTRODUCTION

Outlier detection, aiming at discovering the speciﬁc behav-

ior and potential, hidden and valuable information in the data,

is a hot issue of data managing [1]. According to the deﬁnition

of outlier by Hawkins, an outlier is an observation point which

deviates so great from other points that we doubt that it was

generated by a different mechanism, we call this point an

outlier, or isolated point, abnormal point. Nowadays outlier

detection technology is widely used in network intrusion

detection, fraud detection, medical diagnosis and etc [3].

In recent years, many researchers have presented a large

number of outlier detection algorithms, including statistical-

based methods, distance-based methods, density-based meth-

ods, cluster-based methods and etc. The earliest one was

statistical-based method [4], which had the main idea that

ﬁrst assuming the given data set obeys a certain distribution

model (e.g. normal distribution, Poisson distribution and etc.),

then using the inconsistency test to analyze the model, and

ﬁnally determining the objects that very deviates from the

distribution curve as outliers. Distance-based method was

presented ﬁrst by Knorr and Ng [5, 6]. The distance between

data objects was represented according to a model, and the

outliers were the objects that had a larger distance in the data

set than others. The detection methods mainly include index-

based algorithms, nested-loop methods and etc. To resolve the

problem that distance-based methods could not detect local

outliers, Breuning [7] gave the deﬁnition of local outliers

and density-based outliers and the special measure unit, local

anomaly coefﬁcient(LOF) [8]. The LOF algorithm solved the

measurement and mining of the local anomaly. The larger LOF

is, the more possible it is abnormal, else it may be normal.

Cluster-based methods divided the data set into some clusters,

and the data objects which did not belong to any cluster are

outliers [9]. Cluster-based methods could discover not only

clusters but also outliers. However, its main purpose was to

obtain clusters.

According to their kinds of attribute, outliers can be divided

into categorical attribute outliers and numerical attribute out-

liers. Most of the real data is numerical attribute data, thus

this paper mainly focuses on numerical attribute data outlier

detection. Our algorithm ﬁrstly chooses a high density data

set, and then clustering after the initial centers being selected

by maximum distance product, whichs whole process being

parallelized based on MapReduce [10]. After that, it generates

a candidate set by the pruning rules according to the results

of clustering, and ﬁnally calculates the LOF values of the

candidate set by LOF algorithm to further identify outliers.

II. P

REPROCESSING

This process is mainly based on k-means clustering algo-

rithm for data preprocessing. In order to decrease the iteration

times of k-means algorithm and increase the efﬁciency, ﬁrstly,

a high-density set is selected as a candidate set of centers based

on density distribution, then a method based on maximum

distance product to choose the initial centers is given, the

whole process of clustering is parallelized and ﬁnally, a

pruning strategy is proposed for generating candidate set of

outliers from the formed clusters.

A. Initialization of k-means centers

The traditional k-means algorithm assigns initial centers

randomly, which leads to a large number of iteration times,

therefore, this algorithm is sensitive to initial centers. This

paper has the main purpose of ﬁnding a method for selecting

initial clustering centers, which makes clustering more efﬁ-

cient. The density distribution of our data is shown in Fig.1.

The goal is to ﬁnd a set of initial centers that can reﬂect

the distribution characteristics of the data, and then divide the

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38578242

粉丝: 3
资源: 945

数值大数据的密度基离群值检测算法

汽车价格离群值检测数据集

基于质心的离群值检测方法

数据挖掘算法

电子科技大学数据挖掘复习资料

数据挖掘 概念与技术 中文第三版）

Python金融数据分析中的异常检测与异常值处理

快速精通哨兵一号数据Snap预处理：一步到位的数据清洗与标准化入门指南

【数据源整合高级教程】：在多源数据中发现异常的策略与实践

使用Python进行数据分析与可视化

【异常检测关键】：确保今日头条BP高清版稳定运行的核心技术

最新资源

数据挖掘概念与技术中文第三版）