XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 651
distance measures, especially the mean-based ones, were intro-
duced by Yager, with further discussion on their possible effect
to control the hierarchical clustering process [289].
The common criticism for classical HC algorithms is that they
lack robustness and are, hence, sensitive to noise and outliers.
Once an object is assigned to a cluster, it will not be considered
again, which means that HC algorithms are not capable of cor-
recting possible previous misclassification. The computational
complexity for most of HC algorithms is at least
and
this high cost limits their application in large-scale data sets.
Other disadvantages of HC include the tendency to form spher-
ical shapes and reversal phenomenon, in which the normal hier-
archical structure is distorted.
In recent years, with the requirement for handling large-scale
data sets in data mining and other fields, many new HC tech-
niques have appeared and greatly improved the clustering per-
formance. Typical examples include CURE [116], ROCK [117],
Chameleon [159], and BIRCH [295].
The main motivations of BIRCH lie in two aspects, the ability
to deal with large data sets and the robustness to outliers [295].
In order to achieve these goals, a new data structure, clustering
feature (CF) tree, is designed to store the summaries of the
original data. The CF tree is a height-balanced tree, with each
internal vertex composed of entries defined as
child
, where is a representation of the cluster and
is defined as
, where is the number of
data objects in the cluster,
is the linear sum of the objects,
and SS is the squared sum of the objects, child
is a pointer to the
th child node, and is a threshold parameter that determines
the maximum number of entries in the vertex, and each leaf
composed of entries in the form of
, where
is the threshold parameter that controls the maximum number of
entriesin theleaf. Moreover,the leavesmustfollowtherestriction
that the diameter
of each entry in the leaf is less than a threshold . The CF
tree structure captures the important clustering information of
the original data while reducing the required storage. Outliers
are eliminated from the summaries by identifying the objects
sparsely distributed in the feature space. After the CF tree is
built, an agglomerative HC is applied to the set of summaries to
perform global clustering. An additional step may be performed
to refine the clusters. BIRCH can achieve a computational
complexity of
.
Noticing the restriction of centroid-based HC, which is
unable to identify arbitrary cluster shapes, Guha, Rastogi, and
Shim developed a HC algorithm, called CURE, to explore more
sophisticated cluster shapes [116]. The crucial feature of CURE
lies in the usage of a set of well-scattered points to represent
each cluster, which makes it possible to find rich cluster shapes
other than hyperspheres and avoids both the chaining effect
[88] of the minimum linkage method and the tendency to favor
clusters with similar sizes of centroid. These representative
points are further shrunk toward the cluster centroid according
to an adjustable parameter in order to weaken the effects of
outliers. CURE utilizes random sample (and partition) strategy
to reduce computational complexity. Guha et al. also proposed
another agglomerative HC algorithm, ROCK, to group data
with qualitative attributes [117]. They used a novel measure
“link” to describe the relation between a pair of objects and their
common neighbors. Like CURE, a random sample strategy is
used to handle large data sets. Chameleon is constructed from
graph theory and will be discussed in Section II-E.
Relative hierarchical clustering (RHC) is another exploration
that considers both the internal distance (distance between a
pair of clusters which may be merged to yield a new cluster)
and the external distance (distance from the two clusters to the
rest), and uses the ratio of them to decide the proximities [203].
Leung et al. showed an interesting hierarchical clustering based
on scale-space theory [180]. They interpreted clustering using
a blurring process, in which each datum is regarded as a light
point in an image, and a cluster is represented as a blob. Li
and Biswas extended agglomerative HC to deal with both nu-
meric and nominal data. The proposed algorithm, called simi-
larity-based agglomerative clustering (SBAC), employs a mixed
data measure scheme that pays extra attention to less common
matches of feature values [183]. Parallel techniques for HC are
discussed in [69] and [217], respectively.
C. Squared Error—Based Clustering (Vector Quantization)
In contrast to hierarchical clustering, which yields a succes-
sive level of clusters by iterative fusions or divisions, partitional
clustering assigns a set of objects into
clusters with no hier-
archical structure. In principle, the optimal partition, based on
some specific criterion, can be found by enumerating all pos-
sibilities. But this brute force method is infeasible in practice,
due to the expensive computation [189]. Even for a small-scale
clustering problem (organizing 30 objects into 3 groups), the
number of possible partitions is
. Therefore, heuristic
algorithms have been developed in order to seek approximate
solutions.
One of the important factors in partitional clustering is the
criterion function [124]. The sum of squared error function is
one of the most widely used criteria. Suppose we have a set of
objects
, and we want to organize them
into
subsets . The squared error criterion
then is defined as
where
a partition matrix;
if cluster
otherwise
with
cluster prototype or centroid (means) matrix;
sample mean for the th cluster;
number of objects in the th cluster.
Note the relation between the sum of squared error criterion
and the scatter matrices defined in multiclass discriminant anal-
ysis [75],