Discovering Local Outlier Based on Rough
Clustering
Hongjuan Mi
Information Engineering School, Lanzhou Commercial College
Lanzhou Gansu, PR China
hjm347@hotmail.com
Abstract—The density at a data point is defined based on kernel
function. And we introduce weight to refine rough k-means
algorithm. Then we construct the formula for calculating local
outlier score based on the clusters generated by the refined rough
k-means algorithm. We use a synthetic data set and a real-world
data set to verify that the new technique for local outliers
detection is not only accurate but also efficient.
Keywords- data mining; outlier detection; clustering; rough k-
means; density
I. INTRODUCTION
An outlier in a data set is defined as a data point that is very
different from the remainder of the dataset based on some
measure. In some data mining applications, outliers are
neglected as noises because of the attention to the majority of
objects. However, “One person’s noise is another person’s
signal”. For some applications, outliers, rare events, often
contain useful information on abnormal behavior of the system
described by the dataset. Up till now,some approaches have
been proposed on outlier detection such as statistical model-
based, depth-based, distance-based, and density-based
approach. Besides, clustering algorithms such as BIRCH,
ROCK, DBSCAN, DENCLUE, CURE, and SNN density-
based specifically include techniques for handling outliers.
Researchers have attempted to apply these algorithms for
detecting outliers to tasks such as fraud detection, intrusion
detection, data cleaning, public health, medical treatment,
ecosystem disturbances, video surveillance, medicine, and
weather prediction.
The statistical-model-based outlier detection assumes data
to follow a one-parametric distribution. Such an approach does
not even work well in moderately multivariate spaces. To
improve the situation, the depth-based methods in
computational statistics have been developed. These depth-
based methods avoid the problem of distribution fitting.
However, they are not expected to be practical for more than 4
dimensions for large datasets [1].
To overcome these
limitations, researchers have turned to various non-parametric
approaches, including distance-based approaches [2], and
density-based approaches [3].
Knorr and Ng first presented distance-based outlier. They
defined a point as a distance-based outlier if at least a user-
defined fraction of the points in the dataset are further away
than some user-defined minimum distance from that point.
Although there exist several variations on the idea of distance-
based outlier detection, the basic notion is an object with an
outlier, which is abnormal and distant from most points. One of
the simplest ways is to apply the distance to the k-nearest
neighbor. In [4] the notion of distance based on outliers is
extended by applying the distance to the kth-nearest neighbor
to rank the outliers. A very efficient algorithm to compute the
top-n outliers in this ranking is given. However, this approach
typically takes O (m
2
) time (where m is the number of objects).
Also the approach is sensitive to the choice of parameters.
Furthermore, it cannot handle data sets with regions of widely
differing densities.
The approaches mentioned above regard being an outlier as
a binary property. For many applications, the situation is more
complex, and it becomes more meaningful to assign each
object a degree of being an outlier.
Density-based approaches to outlier detection grew out of
DBSCAN, in which a local outlier factor (LOF) is computed for
each point [3]. The LOF of an object p is the average of the
ratio of the local reachability density of p to those of p’s
MinPts-nearest neighbors. However, the size of a point’s
neighbors is determined by the area containing a user-supplied
minimum number of points (MinPts), the selection of which is
difficult. Like distance-based approaches, these approaches
also have O (m
2
) time complexity.
Cluster analysis finds groups of strongly related objects,
while anomaly detection finds objects that are not strongly
related to other objects. Doubtlessly, clustering can be used for
outlier detection. A systematic approach is to first cluster all
data points and then to assess the degree of which a data point
belongs to a cluster in order to classify a point as an outlier.
Obviously, the clustering heavily impacts the quality of outliers.
Rough set provides the ability to deal with incomplete and
approximate information. Lingras and West provided rough k-
means cluster algorithm [5]. Peters did some refinements based
on analyzed the algorithm.
In rough clustering each cluster has two approximations, a
lower and an upper approximation. The lower approximation is
a subset of the upper approximation. The members of the
lower approximation belong certainly to the cluster; therefore
they cannot belong to any other cluster. The data objects in an
upper approximation may belong to the cluster. Since their
The work was supported by Gansu Natural Science Foundation (3ZS051-
A25-045)
978-1-4244-9857-4/11/$26.00 ©2011 IEEE