International Journal of Computer Applications (0975 - 8887)
Volume 122 - No. 8, July 2015
Spotting Outliers in Large Distributed Datasets using
Cell Density based Approach
A.Rama Satish
Associate Professor, Dept. of CSE
DVR & Dr HS MIC College of Technology
Kanchikacherla,Krishna District,A.P., India
Dr.P.Bala Krishna Prasad
Principal
Eluru College of Engineering & Technology
Eluru , Krishna Distrct, A.P., India
ABSTRACT
Outliers are abnormal instances or observations. Detecting data
outliers is a very important concept in Knowledge data discovery.
Outlier detection has been studied in the context of a large number
of research areas like large distributed systems, data mining,
wireless sensor networks(WSN), health monitoring, environmental
science, statistics, etc., Density based (DB) outlier detection
techniques are robust in detecting outliers. In many applications,
too much voluminous distributed data is generating every day.
Finding deviating observations in the large distributed database
rather than in any individual database is not a simple task.
Integrating distributed database cause two major problems. First,
render massive data from different databases. In addition, data
integration may cause violation of data security and leakage of
sensitive information. In this work we propose cell density based
mechanism for outlier detection (CDOD) in large distributed
databases. A centralized detection paradigm is used; it allows
overcoming the expensive data integration and information
leakage. The experimental results show robustness for finding
outliers in large number of databases, instances and attributes.
Keywords
Data Mining, KDD,Large distributed databases, Density based
outlier detection.
1. INTRODUCTION
Outlier detection is great significant research problem in data
mining. This mainly aims to detect a specific number of
data objects that are considerably dissimilar, exceptional and
inconsistent with respect to the majority records in the input
databases[3, 6]. Outliers arise due to machine level errors,
changes in system behaviour, fraudulent behaviour, human errors,
system faults, or simply through natural deviations in populations.
Detection of potential outliers is important for identifying the errors
and removes their contaminating effect on the dataset to make
the data clean for processing. Outlier detection methods can be
classified between univariate methods and multivariate method.
Different approaches are devised based on different assumptions
to detect outliers. The best way of detecting outliers in distributed
databases is global versus local outlier detection approach. All
data objects are considered as reference set in global approaches
but the local approaches contains a (small) subset of data objects.
The general design of outlier detection technique contains the
primary ingredients of nature of data, outlier detection technique,
knowledge disciplines, application domains, finally outliers.
Knowledge Disciplines
Nature of Data Outliers
Application Domains
Outlier Detection Techniques
Fig. 1. A General Design of an Outlier Detection Technique
Figure 1 illustrated that any outlier detection technique has the
following primary ingredients:
—Nature of data, nature of outliers, and other constraints and
assumptions that collectively constitute the problem formulation.
—Application domain in which the technique is applied. Some of
the techniques are developed in a more generic fashion but are
still feasible in one or more domains while others directly target
a particular application domain.
—The concept and ideas used from one or more knowledge
disciplines.
In many applications, too much voluminous distributed data is
generating every day. The increase in number of applications
it is necessary to collect and store a large amount of data
in multiple proprietary or distributed databases for knowledge
discovery. Credit card transactions are scattered across a number of
distributed community data centres[18]. Detecting irregular credit
card spending patterns is the best example for outlier detection in
large distributed database. These kinds of abnormalities are called
global outliers. Integrating distributed database cause two major
problems. First, render massive data from different databases. In
addition, data integration may cause violation of data security and
leakage of sensitive information. Finding deviating observations in
the large distributed database rather than in any individual database
is not a simple task.For the past decades, most of the existing
outlier detecting research work is focused on the centralized outlier
detection mechanism where all the data are stored and processed in
a central manner. Optimizing or boosting techniques are required
1