基于改进粗糙集聚类的局部异常检测方法

2星需积分: 10 193 浏览量更新于2024-09-15 收藏 280KB PDF 举报

"异常数据检测方法的提出与应用" 异常数据检测是数据分析领域的一个关键环节，它涉及到对数据集中不寻常或偏离常规模式的数据点的识别。这些异常数据点可能是由于测量误差、系统故障、欺诈行为或其他未知因素导致的。在某些情况下，异常数据被视为噪声，可以忽略不计，但在许多其他情况下，异常数据却能揭示系统中的异常行为或重要的事件，如网络攻击、设备故障或市场变化。本文提出了一种基于改进粗糙聚类的局部离群因子检测新方法。首先，定义了基于核函数的数据点密度，这是检测异常数据的基础。核函数通常用于估计数据点之间的相似性，能够处理非线性关系和复杂分布。通过引入权重，这种方法可以更精确地调整粗糙k-均值算法，从而更好地处理数据的密集性和稀疏性。粗糙k-均值算法是一种聚类方法，其特点是具有一定的不确定性，允许数据点在一定程度上属于多个簇。通过引入权重，可以更准确地确定每个数据点所属的簇，以及它在簇内的相对位置，这对于识别局部离群因子至关重要。局部离群因子是相对于其邻近数据点来说更为异常的点，它们可能在全局上并不明显，但在局部区域内显得与众不同。新方法通过改进的粗糙k-均值算法生成的簇来计算局部离群因子得分。这个得分反映了数据点在簇内的异常程度。实验结果表明，该方法在合成数据集和真实世界数据集上不仅具有较高的检测准确性，而且计算效率较高，适用于大数据集的实时分析。关键词涵盖了数据挖掘、异常检测、聚类、粗糙k-均值和密度等核心概念。数据挖掘是发现数据中隐藏模式的过程，异常检测是其重要组成部分；聚类是数据预处理的关键技术，粗糙k-均值则提供了一种处理数据不确定性的手段；而密度则提供了评估数据点异常程度的度量标准。这篇论文提出的新方法通过结合核函数、权重调整和粗糙k-均值算法，提高了异常数据检测的准确性和效率，尤其对于发现局部离群因子具有显著优势。这一方法对于那些期望从异常数据中获取有价值信息的领域，如金融风控、网络安全和工业监控，具有广泛的应用前景。

Discovering Local Outlier Based on Rough

Clustering

Hongjuan Mi

Information Engineering School, Lanzhou Commercial College

Lanzhou Gansu, PR China

hjm347@hotmail.com

Abstract—The density at a data point is defined based on kernel

function. And we introduce weight to refine rough k-means

algorithm. Then we construct the formula for calculating local

outlier score based on the clusters generated by the refined rough

k-means algorithm. We use a synthetic data set and a real-world

data set to verify that the new technique for local outliers

detection is not only accurate but also efficient.

Keywords- data mining; outlier detection; clustering; rough k-

means; density

I. INTRODUCTION

An outlier in a data set is defined as a data point that is very

different from the remainder of the dataset based on some

measure. In some data mining applications, outliers are

neglected as noises because of the attention to the majority of

objects. However, “One person’s noise is another person’s

signal”. For some applications, outliers, rare events, often

contain useful information on abnormal behavior of the system

described by the dataset. Up till now，some approaches have

been proposed on outlier detection such as statistical model-

based, depth-based, distance-based, and density-based

approach. Besides, clustering algorithms such as BIRCH,

ROCK, DBSCAN, DENCLUE, CURE, and SNN density-

based specifically include techniques for handling outliers.

Researchers have attempted to apply these algorithms for

detecting outliers to tasks such as fraud detection, intrusion

detection, data cleaning, public health, medical treatment,

ecosystem disturbances, video surveillance, medicine, and

weather prediction.

The statistical-model-based outlier detection assumes data

to follow a one-parametric distribution. Such an approach does

not even work well in moderately multivariate spaces. To

improve the situation, the depth-based methods in

computational statistics have been developed. These depth-

based methods avoid the problem of distribution fitting.

However, they are not expected to be practical for more than 4

dimensions for large datasets [1].

To overcome these

limitations, researchers have turned to various non-parametric

approaches, including distance-based approaches [2], and

density-based approaches [3].

Knorr and Ng first presented distance-based outlier. They

defined a point as a distance-based outlier if at least a user-

defined fraction of the points in the dataset are further away

than some user-defined minimum distance from that point.

Although there exist several variations on the idea of distance-

based outlier detection, the basic notion is an object with an

outlier, which is abnormal and distant from most points. One of

the simplest ways is to apply the distance to the k-nearest

neighbor. In [4] the notion of distance based on outliers is

extended by applying the distance to the kth-nearest neighbor

to rank the outliers. A very efficient algorithm to compute the

top-n outliers in this ranking is given. However, this approach

typically takes O (m

) time (where m is the number of objects).

Also the approach is sensitive to the choice of parameters.

Furthermore, it cannot handle data sets with regions of widely

differing densities.

The approaches mentioned above regard being an outlier as

a binary property. For many applications, the situation is more

complex, and it becomes more meaningful to assign each

object a degree of being an outlier.

Density-based approaches to outlier detection grew out of

DBSCAN, in which a local outlier factor (LOF) is computed for

each point [3]. The LOF of an object p is the average of the

ratio of the local reachability density of p to those of p’s

MinPts-nearest neighbors. However, the size of a point’s

neighbors is determined by the area containing a user-supplied

minimum number of points (MinPts), the selection of which is

difficult. Like distance-based approaches, these approaches

also have O (m

) time complexity.

Cluster analysis finds groups of strongly related objects,

while anomaly detection finds objects that are not strongly

related to other objects. Doubtlessly, clustering can be used for

outlier detection. A systematic approach is to first cluster all

data points and then to assess the degree of which a data point

belongs to a cluster in order to classify a point as an outlier.

Obviously, the clustering heavily impacts the quality of outliers.

Rough set provides the ability to deal with incomplete and

approximate information. Lingras and West provided rough k-

means cluster algorithm [5]. Peters did some refinements based

on analyzed the algorithm.

In rough clustering each cluster has two approximations, a

lower and an upper approximation. The lower approximation is

a subset of the upper approximation. The members of the

lower approximation belong certainly to the cluster; therefore

they cannot belong to any other cluster. The data objects in an

upper approximation may belong to the cluster. Since their

The work was supported by Gansu Natural Science Foundation (3ZS051-

A25-045)

下载后可阅读完整内容，剩余3页未读，立即下载

jhy19890103

粉丝: 0
资源: 11

基于改进粗糙集聚类的局部异常检测方法

异常行为检测

网络异常行为侦测系统

时序数据异常检测

localoutlierfactor 异常数据检测

ocavm可以检测出特征性不强的异常数据吗

用机器学习的方法构建检测器模型检测异常数据

LOF代码检测一维异常数据

如何使用MATLAB实现基于格拉布斯准则的异常值检测？请提供具体的MATLAB代码实现。

如何利用MATLAB进行基于格拉布斯准则的异常值检测？请给出详细的实现方法和示例代码。

在MATLAB中如何编写代码以应用格拉布斯准则来检测并处理数据集中的异常值？请提供一个详细的实现步骤和示例代码。

最新资源