大规模分布式数据集中的异常检测：基于细胞密度的方法

需积分: 10 116 浏览量更新于2024-09-08 收藏 679KB PDF 举报

"该文章探讨了在大型分布式数据集中检测异常值的问题，提出了一种基于细胞密度的异常检测机制（CDOD），旨在解决大量分布式数据集中的异常检测挑战，同时避免数据集成带来的问题，如大规模数据处理和敏感信息泄露。" 在大数据时代，异常值（Outliers）的识别是知识发现和数据挖掘（KDD）领域中的关键概念。异常值通常表示数据中的异常实例或观察结果，可能蕴含着重要的信息或者揭示潜在的问题。在诸如大型分布式系统、数据挖掘、无线传感器网络、健康监控、环境科学和统计等多个研究领域，异常检测都有着广泛的应用。传统的异常检测方法，尤其是基于密度的方法（Density-based），如DBSCAN（Density-Based Spatial Clustering of Applications with Noise），在识别异常点方面表现出了很好的鲁棒性。然而，随着每天生成的分布式数据量日益庞大，直接在单个数据库中寻找偏离的数据变得愈发困难。数据集成成为一大挑战，不仅需要处理海量数据，还可能导致数据安全性和敏感信息的泄露。针对这一问题，文章提出的Cell Density based Outlier Detection（CDOD）机制采用集中式检测范式，避免了昂贵的数据集成过程，同时减少了信息泄露的风险。CDOD方法将数据空间划分为小的“细胞”（Cells），并计算每个细胞的密度，通过比较细胞间的密度差异来识别可能的异常点。这种方法能够有效地处理大型分布式数据库中的数据，并在大量数据库、实例和属性中展现出对异常值的稳健检测能力。实验结果显示，CDOD在多种场景下都能有效地检测出异常值，证明了其在处理大规模数据集时的适用性和准确性。这种方法对于实时监控、故障检测以及预防性维护等应用具有重要意义，因为它能帮助用户及时发现并分析可能存在的异常情况，从而做出相应的决策。文章通过提出CDOD机制，为大型分布式数据集中的异常检测提供了一种有效且安全的解决方案，对大数据环境下的数据分析和处理做出了重要贡献。

International Journal of Computer Applications (0975 - 8887)

Volume 122 - No. 8, July 2015

Spotting Outliers in Large Distributed Datasets using

Cell Density based Approach

A.Rama Satish

Associate Professor, Dept. of CSE

DVR & Dr HS MIC College of Technology

Kanchikacherla,Krishna District,A.P., India

Dr.P.Bala Krishna Prasad

Principal

Eluru College of Engineering & Technology

Eluru , Krishna Distrct, A.P., India

ABSTRACT

Outliers are abnormal instances or observations. Detecting data

outliers is a very important concept in Knowledge data discovery.

Outlier detection has been studied in the context of a large number

of research areas like large distributed systems, data mining,

wireless sensor networks(WSN), health monitoring, environmental

science, statistics, etc., Density based (DB) outlier detection

techniques are robust in detecting outliers. In many applications,

too much voluminous distributed data is generating every day.

Finding deviating observations in the large distributed database

rather than in any individual database is not a simple task.

Integrating distributed database cause two major problems. First,

render massive data from different databases. In addition, data

integration may cause violation of data security and leakage of

sensitive information. In this work we propose cell density based

mechanism for outlier detection (CDOD) in large distributed

databases. A centralized detection paradigm is used; it allows

overcoming the expensive data integration and information

leakage. The experimental results show robustness for ﬁnding

outliers in large number of databases, instances and attributes.

Keywords

Data Mining, KDD,Large distributed databases, Density based

outlier detection.

1. INTRODUCTION

Outlier detection is great signiﬁcant research problem in data

mining. This mainly aims to detect a speciﬁc number of

data objects that are considerably dissimilar, exceptional and

inconsistent with respect to the majority records in the input

databases[3, 6]. Outliers arise due to machine level errors,

changes in system behaviour, fraudulent behaviour, human errors,

system faults, or simply through natural deviations in populations.

Detection of potential outliers is important for identifying the errors

and removes their contaminating effect on the dataset to make

the data clean for processing. Outlier detection methods can be

classiﬁed between univariate methods and multivariate method.

Different approaches are devised based on different assumptions

to detect outliers. The best way of detecting outliers in distributed

databases is global versus local outlier detection approach. All

data objects are considered as reference set in global approaches

but the local approaches contains a (small) subset of data objects.

The general design of outlier detection technique contains the

primary ingredients of nature of data, outlier detection technique,

knowledge disciplines, application domains, ﬁnally outliers.

Knowledge Disciplines

Nature of Data Outliers

Application Domains

Outlier Detection Techniques

Fig. 1. A General Design of an Outlier Detection Technique

Figure 1 illustrated that any outlier detection technique has the

following primary ingredients:

—Nature of data, nature of outliers, and other constraints and

assumptions that collectively constitute the problem formulation.

—Application domain in which the technique is applied. Some of

the techniques are developed in a more generic fashion but are

still feasible in one or more domains while others directly target

a particular application domain.

—The concept and ideas used from one or more knowledge

disciplines.

In many applications, too much voluminous distributed data is

generating every day. The increase in number of applications

it is necessary to collect and store a large amount of data

in multiple proprietary or distributed databases for knowledge

discovery. Credit card transactions are scattered across a number of

distributed community data centres[18]. Detecting irregular credit

card spending patterns is the best example for outlier detection in

large distributed database. These kinds of abnormalities are called

global outliers. Integrating distributed database cause two major

problems. First, render massive data from different databases. In

addition, data integration may cause violation of data security and

leakage of sensitive information. Finding deviating observations in

the large distributed database rather than in any individual database

is not a simple task.For the past decades, most of the existing

outlier detecting research work is focused on the centralized outlier

detection mechanism where all the data are stored and processed in

a central manner. Optimizing or boosting techniques are required

下载后可阅读完整内容，剩余6页未读，立即下载

uuddoop

粉丝: 1

大规模分布式数据集中的异常检测：基于细胞密度的方法

Trainable Frontend For Robust and Far-Field Keyword Spotting.pdf

Trainable_Segmentation-master.zip

Trainable Relation Extraction framework-开源

Sequential Cells

arduino ASRPRO

给我一个用Scala编写的复杂一点的和药相关的spark实例，包含代码和数据

arraylist和linklist区别 面试

rtb10.4.mltbx

连续在线关键字识别是什么

最新资源

arraylist和linklist区别面试