978-1-5386-8097-1/18/$31.00 ©2018 IEEE 911
2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)
A Density-based method for Outlier Detecting on
Numerical Massive Data
Keyan Cao
∗
, Anchen Miao
∗
, Ning Jin
∗
, Yuanwei Qi
∗
and Ibrahim Musa
∗
∗
Information & Control Engineering Faculty
Shenyang Jianzhu University
Shenyang, China
Abstract—As large amounts of data have been generated with
the development of the Internet, outliers detecting and effective
information obtaining on it become an important issue. For
this question, this paper presents a density-based algorithm
for numerical attribute anomaly points. A high-density set is
selected as the candidate set of clustering centers based on density
distribution in order to decrease the times of iterations of k-
means algorithm and improve the efficiency, then the initial
centers are chosen by the method depending on the maximum
distance product. After that, the data is preprocessed by k-means
clustering algorithm. The whole process of clustering is combined
with MapReduce programming model. A candidate set of abnor-
mal points is obtained from each cluster by appropriate pruning
method. At last, the ultimate outliers are determined according to
the density-based LOF algorithm. The experimental results show
that the maximum-distance-product-based initializing method for
clustering centers improves the efficiency of clustering and the
algorithm we proposed has better accuracy, expansibility and
speedup ratio on outlier detection for numerical attributes.
Index Terms—Clustering; maximum distance product; k-
means; LOF; MapReduceClustering; maximum distance prod-
uct; k-means; LOF; MapReduce
I. INTRODUCTION
Outlier detection, aiming at discovering the specific behav-
ior and potential, hidden and valuable information in the data,
is a hot issue of data managing [1]. According to the definition
of outlier by Hawkins, an outlier is an observation point which
deviates so great from other points that we doubt that it was
generated by a different mechanism, we call this point an
outlier, or isolated point, abnormal point. Nowadays outlier
detection technology is widely used in network intrusion
detection, fraud detection, medical diagnosis and etc [3].
In recent years, many researchers have presented a large
number of outlier detection algorithms, including statistical-
based methods, distance-based methods, density-based meth-
ods, cluster-based methods and etc. The earliest one was
statistical-based method [4], which had the main idea that
first assuming the given data set obeys a certain distribution
model (e.g. normal distribution, Poisson distribution and etc.),
then using the inconsistency test to analyze the model, and
finally determining the objects that very deviates from the
distribution curve as outliers. Distance-based method was
presented first by Knorr and Ng [5, 6]. The distance between
data objects was represented according to a model, and the
outliers were the objects that had a larger distance in the data
set than others. The detection methods mainly include index-
based algorithms, nested-loop methods and etc. To resolve the
problem that distance-based methods could not detect local
outliers, Breuning [7] gave the definition of local outliers
and density-based outliers and the special measure unit, local
anomaly coefficient(LOF) [8]. The LOF algorithm solved the
measurement and mining of the local anomaly. The larger LOF
is, the more possible it is abnormal, else it may be normal.
Cluster-based methods divided the data set into some clusters,
and the data objects which did not belong to any cluster are
outliers [9]. Cluster-based methods could discover not only
clusters but also outliers. However, its main purpose was to
obtain clusters.
According to their kinds of attribute, outliers can be divided
into categorical attribute outliers and numerical attribute out-
liers. Most of the real data is numerical attribute data, thus
this paper mainly focuses on numerical attribute data outlier
detection. Our algorithm firstly chooses a high density data
set, and then clustering after the initial centers being selected
by maximum distance product, whichs whole process being
parallelized based on MapReduce [10]. After that, it generates
a candidate set by the pruning rules according to the results
of clustering, and finally calculates the LOF values of the
candidate set by LOF algorithm to further identify outliers.
II. P
REPROCESSING
This process is mainly based on k-means clustering algo-
rithm for data preprocessing. In order to decrease the iteration
times of k-means algorithm and increase the efficiency, firstly,
a high-density set is selected as a candidate set of centers based
on density distribution, then a method based on maximum
distance product to choose the initial centers is given, the
whole process of clustering is parallelized and finally, a
pruning strategy is proposed for generating candidate set of
outliers from the formed clusters.
A. Initialization of k-means centers
The traditional k-means algorithm assigns initial centers
randomly, which leads to a large number of iteration times,
therefore, this algorithm is sensitive to initial centers. This
paper has the main purpose of finding a method for selecting
initial clustering centers, which makes clustering more effi-
cient. The density distribution of our data is shown in Fig.1.
The goal is to find a set of initial centers that can reflect
the distribution characteristics of the data, and then divide the