algorithms is represented respectively by the mean value, the ‘‘mode’’ and the ‘‘medoid’’
of all objects belonging to the cluster. These algorithms are executed by two steps. First,
randomly generated k objects to represent the k clusters. Second, use an iterative control
strategy to optimize the objective function. Often, the function can be the average or the
sum of the distance between the objects to the representation object in a cluster.
CLARANS [17] is an improved k-medoid type algorithm and two parameters are needed
from the users. It is very suitable for the huge data set whose efficiency is much higher
than PAM and CLARA presented in [18]. Streaming clustering algorithms applied to
mining dynamic datasets are also very popular [6–10]. In [6], the authors proposed a
constant-factor approximation algorithm for the k-median problem based on the Small-
Space algorithm in the data stream model. The dataset needs to be scanned only once and,
however, the authors also show negative results implying that the algorithm can’t be
improved in a certain sense. A framework for clustering evolving data streams proposed in
[8] is divided into an online component which periodically stores detailed summary
statistics and an offline component which uses only this summary statistics to extract the
clusters. In addition, a pyramidal time framework is designed to store the ‘‘snapshots’’.
However, the framework is not suitable to arbitrary shaped clusters. As a result, a density-
based clustering algorithm over an evolving data stream with noise is proposed in [9]. A
novel clustering algorithm based on passing messages between data points is presented in
[19, 20]. The authors of [19] devised a method called ‘‘affinity propagation’’, which takes
as input measures of similarity between pairs of data points, to clustering. Two incre-
mental affinity propagation (IAP) clustering algorithms based on message passing are
proposed in [20]. One of the two algorithms is IAP clustering based on k-medoid
(IAPKM) and the other one is IAP clustering based on Nearest Neighbor Assignment
(IAPNA).
Except for the clustering algorithms introduced before, the algorithms that have the
most important influence to our clustering algorithm are DBSCAN [21] and OPTICS [5].
To our knowledge, DBSCAN is the first density-based clustering algorithm that is not
grid-based. Most of the density-based clustering algorithms are partitioning algorithms and
so is ICA. Different with the density in the field of grid-based clustering algorithms, the
density in DBSCAN is represented by two parameters e and MinPts that preset by the
users. The objects in a cluster are divided into two subsets: core objects and border
objects. For each core object in a cluster, the neighborhood of a radius e has to contain at
least MinPts objects; for each border object, there must be a corresponding core object
whose neighborhood contains the border object. Based on the definition of a cluster, the
author in [21] proposed the detailed pseudo-code. OPTICS is a more mature method that
can both automatically find the hierarchical clustering structure and present the inherent
structure of the data set to users. The same with DBSCAN, the users of OPTICS also need
to preset e and MinPts, though they are rather insensitive to the reachability-plot. The
authors in [5] suggest that the values of the parameters have just to be ‘‘large’’ enough to
yield a good result. Based on the reachability-plot, automatic techniques are proposed to
find the clusters. Compared with DBSCAN, in our opinion, the biggest advantage of
OPTICS is the reachability-plot that can present the inherent structure of the data set
intuitively. Note that, the reachability-plot is independent from the dimension of the data
set. The users can get a good clustering result by setting proper parameters based on the
reachability-plot.
J.-S. Fu et al.
123