聚类算法综述：从传统到现代应用

需积分: 10 6 浏览量更新于2024-07-16 收藏 1.71MB PDF 举报

"这篇PDF文档是《Survey of Clustering Algorithms》, 由Rui Xu和Donald Wunsch II撰写，发表在2005年的IEEETRANSACTIONSONNEURALNETWORKS期刊上，卷16，第3期。这篇论文是对聚类算法的综合概述，虽然有些陈旧，但对传统聚类方法有详尽的整理，特别适合用于追溯相关领域的研究文献。论文提到了聚类在统计学、计算机科学和机器学习等多个领域的应用，并在一些基准数据集、旅行商问题以及生物信息学中展示了聚类算法的实际运用。此外，还讨论了与聚类紧密相关的主题，如距离度量和聚类验证。关键词包括：自适应共振理论(ART)，聚类，聚类算法，聚类验证，神经网络，距离，自组织特征映射(SOFM)。" 这篇论文详细介绍了聚类分析这一在数据挖掘和机器学习中至关重要的技术。聚类是一种无监督学习方法，它将数据集中的对象依据相似性或距离进行分组，形成所谓的“簇”。论文涵盖了多种经典的聚类算法，这些算法可能包括： 1. K-Means：K-Means是最常用的聚类方法之一，基于欧几里得距离，通过迭代优化过程来分配每个数据点到最近的簇中心。 2. 层次聚类：包括凝聚型和分裂型两种，通过构建一棵层次树（Dendrogram）来展示数据的聚类结构。 3. DBSCAN（Density-Based Spatial Clustering of Applications with Noise）：一种基于密度的聚类算法，能发现任意形状的簇，并对噪声数据具有较好的处理能力。 4. ART（Adaptive Resonance Theory）：一种自适应神经网络模型，用于自动分类，其聚类能力取决于学习率和阈值参数。 5. SOM（Self-Organizing Feature Map）：自组织特征映射，是一种神经网络模型，通过竞争学习实现数据的二维映射，可以揭示数据的内在结构。 6. 距离度量：论文还讨论了如何衡量数据点之间的相似性，如欧几里得距离、曼哈顿距离、余弦相似度等，这些距离度量在选择合适的聚类算法时至关重要。 7. 聚类验证：在选择最佳聚类结果时，聚类验证方法如Calinski-Harabasz指数、Davies-Bouldin指数等可以帮助评估聚类的性能。 8. 应用实例：作者通过分析基准数据集（如UCI Machine Learning Repository中的数据）和实际问题（如旅行商问题的优化）来展示聚类的应用效果。 9. 生物信息学：聚类在生物信息学中也有广泛应用，例如基因表达数据的分析，帮助研究人员识别基因表达模式。尽管《Survey of Clustering Algorithms》这篇论文年代较早，但它提供了一个全面的聚类算法概览，对于理解早期的聚类方法和探索相关领域的经典研究非常有价值。通过引文，读者可以追踪到最新的研究成果和发展动态。

XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 651

distance measures, especially the mean-based ones, were intro-

duced by Yager, with further discussion on their possible effect

to control the hierarchical clustering process [289].

The common criticism for classical HC algorithms is that they

lack robustness and are, hence, sensitive to noise and outliers.

Once an object is assigned to a cluster, it will not be considered

again, which means that HC algorithms are not capable of cor-

recting possible previous misclassiﬁcation. The computational

complexity for most of HC algorithms is at least

and

this high cost limits their application in large-scale data sets.

Other disadvantages of HC include the tendency to form spher-

ical shapes and reversal phenomenon, in which the normal hier-

archical structure is distorted.

In recent years, with the requirement for handling large-scale

data sets in data mining and other ﬁelds, many new HC tech-

niques have appeared and greatly improved the clustering per-

formance. Typical examples include CURE [116], ROCK [117],

Chameleon [159], and BIRCH [295].

The main motivations of BIRCH lie in two aspects, the ability

to deal with large data sets and the robustness to outliers [295].

In order to achieve these goals, a new data structure, clustering

feature (CF) tree, is designed to store the summaries of the

original data. The CF tree is a height-balanced tree, with each

internal vertex composed of entries deﬁned as

child

, where is a representation of the cluster and

is deﬁned as

, where is the number of

data objects in the cluster,

is the linear sum of the objects,

and SS is the squared sum of the objects, child

is a pointer to the

th child node, and is a threshold parameter that determines

the maximum number of entries in the vertex, and each leaf

composed of entries in the form of

, where

is the threshold parameter that controls the maximum number of

entriesinthe leaf.Moreover, theleavesmust followtherestriction

that the diameter

of each entry in the leaf is less than a threshold . The CF

tree structure captures the important clustering information of

the original data while reducing the required storage. Outliers

are eliminated from the summaries by identifying the objects

sparsely distributed in the feature space. After the CF tree is

built, an agglomerative HC is applied to the set of summaries to

perform global clustering. An additional step may be performed

to reﬁne the clusters. BIRCH can achieve a computational

complexity of

Noticing the restriction of centroid-based HC, which is

unable to identify arbitrary cluster shapes, Guha, Rastogi, and

Shim developed a HC algorithm, called CURE, to explore more

sophisticated cluster shapes [116]. The crucial feature of CURE

lies in the usage of a set of well-scattered points to represent

each cluster, which makes it possible to ﬁnd rich cluster shapes

other than hyperspheres and avoids both the chaining effect

[88] of the minimum linkage method and the tendency to favor

clusters with similar sizes of centroid. These representative

points are further shrunk toward the cluster centroid according

to an adjustable parameter in order to weaken the effects of

outliers. CURE utilizes random sample (and partition) strategy

to reduce computational complexity. Guha et al. also proposed

another agglomerative HC algorithm, ROCK, to group data

with qualitative attributes [117]. They used a novel measure

“link” to describe the relation between a pair of objects and their

common neighbors. Like CURE, a random sample strategy is

used to handle large data sets. Chameleon is constructed from

graph theory and will be discussed in Section II-E.

Relative hierarchical clustering (RHC) is another exploration

that considers both the internal distance (distance between a

pair of clusters which may be merged to yield a new cluster)

and the external distance (distance from the two clusters to the

rest), and uses the ratio of them to decide the proximities [203].

Leung et al. showed an interesting hierarchical clustering based

on scale-space theory [180]. They interpreted clustering using

a blurring process, in which each datum is regarded as a light

point in an image, and a cluster is represented as a blob. Li

and Biswas extended agglomerative HC to deal with both nu-

meric and nominal data. The proposed algorithm, called simi-

larity-based agglomerative clustering (SBAC), employs a mixed

data measure scheme that pays extra attention to less common

matches of feature values [183]. Parallel techniques for HC are

discussed in [69] and [217], respectively.

C. Squared Error—Based Clustering (Vector Quantization)

In contrast to hierarchical clustering, which yields a succes-

sive level of clusters by iterative fusions or divisions, partitional

clustering assigns a set of objects into

clusters with no hier-

archical structure. In principle, the optimal partition, based on

some speciﬁc criterion, can be found by enumerating all pos-

sibilities. But this brute force method is infeasible in practice,

due to the expensive computation [189]. Even for a small-scale

clustering problem (organizing 30 objects into 3 groups), the

number of possible partitions is

. Therefore, heuristic

algorithms have been developed in order to seek approximate

solutions.

One of the important factors in partitional clustering is the

criterion function [124]. The sum of squared error function is

one of the most widely used criteria. Suppose we have a set of

objects

, and we want to organize them

into

subsets . The squared error criterion

then is deﬁned as

where

a partition matrix;

if cluster

otherwise

with

cluster prototype or centroid (means) matrix;

sample mean for the th cluster;

number of objects in the th cluster.

Note the relation between the sum of squared error criterion

and the scatter matrices deﬁned in multiclass discriminant anal-

ysis [75],

剩余33页未读，继续阅读

23WilliamDing

粉丝: 7
资源: 2

聚类算法综述：从传统到现代应用

A survey on partition clustering algorithms.pdf

论文研究-基于自适应权重的面板数据聚类方法.pdf

SIAM.Data.Clustering.Theory.Algorithms.and.Applications.May.2007.pdf

matlab做聚类分析.pdf

什么是聚类分析.pdf

机器学习-聚类分析.pdf

第六章 聚类分析.pdf

数据挖掘中聚类分析.pdf

基于划分的聚类算法.pdf

Berry_-_Survey.of.Text.Mining_Clustering,.Classification,.and.Retrieval

最新资源

第六章聚类分析.pdf