A New k-NN-Centroid-Inspired Density-Based Clustering Algorithm
Xiaochun Wang, Yiqin Chen
School of Software Engineering
Xi’an Jiaotong Unversity
Xi’an, China
{xiaocchunwang@mail, chenyiqin@stu}.xjtu.edu.cn
Xia Li Wang
School of Information Engineering
Changan University
Xi’an, China
xlwang@chd.edu.cn
Abstract—Density-based clustering algorithms are well known
for identifying clusters possessing very different local densities
and existing in different regions of data space. However, the
parameters required by most popular density-based clustering
algorithms, such as DBSCAN, are hard to determine but have
significant impacts on the clustering results. In this paper, we
present a new density-based clustering algorithm in which the
selection of appropriate parameters is less difficult but more
meaningful. Experiments performed on several datasets show
the effectiveness of our approach.
Keywords-density-based clustering; centroid; k nearest
neighbors; k nearest neighbors-based centroid
I. INTRODUCTION
Being an important branch of data mining techniques,
many clustering algorithms have been introduced in the past
few decades, including partition-based clustering [1,2],
density-based clustering [3,4], grid-based clustering [5,6],
graph-based clustering [7,8] and hierarchical clustering
[9,10]. By taking neighbourhood characteristics into account,
density-based clustering algorithms count to be the most
advanced and robust approach, and embody no principle
with clearly defined algorithmic properties. To find clusters
of different shapes, sizes, and densities, recent efforts focus
on their effectiveness and ability to discover clusters of
increasingly complex shapes [9,10], and the scalability to the
size of large datasets to be clustered [11].
Compared with other clustering algorithms, traditional
density-based clustering algorithms using only global
parameter settings have difficulty revealing the skewed
distribution often associated with modern large
multidimensional real world data sets, as illustrated in the
left plot of Fig.1, which is referred to in this paper as density-
separated clustering problems. In this illustration, there are
two kinds of boundary points, those which reside in the
denser cluster whose k nearest neighbors are all data points
in the same cluster and therefore have similar densities, and
those which reside on the touching side of two clusters of
different densities and, therefore, whose k nearest neighbors
come from different clusters and may have different
densities. Further, the values for input parameters required
by modern density-based clustering algorithms are usually
difficult to determine, and, for slightly different parameter
settings, very different partitions of a data set can be
produced, as manifested in the right plot of Fig.1, which is
referred to in the following as distance separated clustering
problems. In this case, two clusters of similar density exist.
If, for example, k is selected to be no more than 8, two
clusters can be separated easily. However, if k is set to be 16,
k-nearest neighbors based clustering algorithm can not set
these two clusters apart.
Figure 1. (left) density separated clusters, (right) distance separated
clusters
Motivated by these limitations, in this paper, we propose
a new algorithm for local-density based cluster analysis. In
this new algorithm, the concept of centroid based on k-
nearest neighbours of each object in a data set is introduced
whose distance to the data object is used to represents the
degree of how well a data point is surrounded by others, and
is next used to determine a suitable value for k which is
subsequently used to detect clusters in a data set. Another
challenge to meet in this paper is how to define a density
measure for data points residing on the touching side of two
clusters of different densities. For this matter, we propose the
concept of the first neighborhood of a data point.
In the following, we first provide a brief discussion of
some related work on density-based clustering in Section 2.
We then present the concept of k-nearest neighbours based
centroid and the proposed clustering strategy in Section 3. In
Section 4, a performance evaluation of the effectiveness of
the proposed algorithm is conducted. Finally, conclusions are
made in Section 5.
II. R
ELATED WORK
Being a very popular branch in data clustering, density
based clustering algorithms rely on a local density criterion
to form clusters and can detect clusters of different shapes,
sizes, and densities. Early density based clustering