全面解析：聚类分析与应用

需积分: 15 124 浏览量更新于2024-08-02 收藏 6.86MB PDF 举报

"Clustering是关于聚类分析的首部全面性著作，涵盖了从基础到高级的各种聚类方法，包括但不限于亲和度度量、层次聚类、分区聚类、基于神经网络的聚类、基于核的聚类、序列数据聚类、大规模数据聚类、数据可视化、高维数据聚类以及聚类验证。本书适合不同水平和背景的读者，无需先前的聚类知识，通过丰富的实例和引用使得复杂的主题变得易于理解。由IEEE Press出版，并得到IEEE计算智能学会赞助。" 聚类（Clustering）是数据挖掘和机器学习领域中的核心概念，它是一种无监督学习方法，目的是将相似的数据分组到一起，形成所谓的“簇”（clusters）。这个过程可以帮助我们发现数据中的内在结构，揭示未知的模式和关系，而不依赖于预先定义的类别或标签。标题中提到的"Clustering"是指聚类分析的整体研究，这一领域广泛应用于各种场景，如市场细分、生物信息学、社交网络分析等。描述中指出，本书从基础开始介绍聚类，逐步深入到各种方法和技术： 1. **亲和度度量**（Proximity Measures）：这是衡量数据点之间相似性的关键，常见的度量有欧氏距离、曼哈顿距离、余弦相似度等。 2. **层次聚类**（Hierarchical Clustering）：分为凝聚型和分裂型，通过构建一个层次树（Dendrogram）来展示数据的聚类关系。 3. **分区聚类**（Partition Clustering）：如K-means算法，预先设定簇的数量，通过迭代优化分配每个数据点到最近的簇中心。 4. **基于神经网络的聚类**：利用神经网络的并行处理能力和学习能力进行聚类，例如自组织映射（SOM）。 5. **基于核的聚类**（Kernel-based Clustering）：通过核函数将数据映射到高维空间，以便在原始空间中难以区分的数据在新空间中变得可分。 6. **序列数据聚类**（Sequential Data Clustering）：针对时间序列或顺序数据的聚类，考虑数据点的顺序关系。 7. **大规模数据聚类**：处理大数据集的聚类算法，如分布式或近似方法，以应对内存限制和计算效率问题。 8. **数据可视化**：通过图形表示帮助理解聚类结果，如散点图、热力图等。 9. **高维数据聚类**：在高维空间中聚类面临“维度灾难”，需要采用降维技术如主成分分析（PCA）或其他特定的聚类策略。 10. **聚类验证**：评估聚类质量的过程，包括内部和外部验证指标，如轮廓系数、Calinski-Harabasz指数等。书中还提到了这本书由IEEE Press出版，这是一家知名的技术出版社，其出版的系列书籍专注于计算智能，这表明该书具有权威性和专业性。此外，还得到了IEEE计算智能学会的赞助，该学会是全球领先的计算智能研究和应用的专业组织。 "Clustering"这本书提供了对聚类分析全面而深入的洞察，对于想要学习和理解这一领域的读者来说，是一份宝贵的资源。通过阅读此书，读者可以掌握聚类的核心概念，熟悉各种聚类方法，并有能力应用这些知识解决实际问题。

6 CLUSTER ANALYSIS

on the discussion of feature extraction in Chapter 9 in the context of data

visualization and dimensionality reduction. Feature selection is more

often used in the context of supervised classiﬁ cation with class labels

available (Jain et al., 2000 ; Sklansky and Siedlecki, 1993 ). Jain et al.

(2000) , Liu and Yu (2005) , and Theodoridis and Koutroumbas (2006) all

provided good reviews of the feature selection techniques for supervised

learning. A method of simultaneous feature selection and clustering,

under the framework of ﬁ nite mixture models, was proposed in Law

et al. (2004) . Kim et al. (2000) employed the genetic algorithm for feature

selection in a K - means algorithm. Mitra et al. (2002) introduced a

maximum information compression index to measure feature similarity

and examine feature redundancy. More discussions on feature selection

in clustering were given in Dy and Brodley (2000) , Roth and Lange

(2004) , and Talavera (2000) .

2. Clustering algorithm design or selection . This step usually consists of

determining an appropriate proximity measure and constructing a crite-

rion function. Intuitively, data objects are grouped into different clusters

according to whether they resemble one another or not. Almost all clus-

tering algorithms are explicitly or implicitly connected to some particular

deﬁ nition of proximity measure. Some algorithms even work directly on

the proximity matrix, as deﬁ ned in Chapter 2 . Once a proximity measure

is determined, clustering could be construed as an optimization problem

with a speciﬁ c criterion function. Again, the obtained clusters are depen-

dent on the selection of the criterion function. The subjectivity of cluster

analysis is thus inescapable.

Data

Samples

Feature

Selection or

Extraction

Clustering

Algorithm Design

or Selection

Cluster

Validation

Result

Interpretation

Knowledge

Clusters

Fig. 1.2. Clustering procedure. The basic process of cluster analysis consists of four

steps with a feedback pathway. These steps are closely related to each other and deter-

mine the derived clusters.

DEFINITION OF CLUSTERS 7

Clustering is ubiquitous, and a wealth of clustering algorithms has

been developed to solve different problems from a wide variety of ﬁ elds.

However, there is no universal clustering algorithm to solve all problems.

“ It has been very difﬁ cult to develop a uniﬁ ed framework for reason-

ing about it (clustering) at a technical level, and profoundly diverse

approaches to clustering ” (Kleinberg, 2002 ). Therefore, it is important

to carefully investigate the characteristics of a problem in order to select

or design an appropriate clustering strategy. Clustering algorithms that

are developed to solve a particular problem in a specialized ﬁ eld usually

make assumptions in favor of the application of interest. For example,

the K - means algorithm is based on the Euclidean measure and hence

tends to generate hyperspherical clusters. However, if the real clusters

are in other geometric forms, K - means may no longer be effective, and

we need to resort to other schemes. Similar considerations must be kept

in mind for mixture - model clustering, in which data are assumed to come

from some speciﬁ c models that are already known in advance.

Cluster validation . Given a data set, each clustering algorithm can always

produce a partition whether or not there really exists a particular struc-

ture in the data. Moreover, different clustering approaches usually lead

to different clusters of data, and even for the same algorithm, the selec-

tion of a parameter or the presentation order of input patterns may affect

the ﬁ nal results. Therefore, effective evaluation standards and criteria are

critically important to provide users with a degree of conﬁ dence for the

clustering results. These assessments should be objective and have no

preferences to any algorithm. Also, they should be able to provide mean-

ingful insights in answering questions like how many clusters are hidden

in the data, whether the clusters obtained are meaningful from a practical

point of view or just artifacts of the algorithms, or why we choose one

algorithm instead of another. Generally, there are three categories of

testing criteria: external indices, internal indices, and relative indices.

They are deﬁ ned on three types of clustering structures, known as parti-

tional clustering, hierarchical clustering, and individual clusters (Gordon,

1998 ; Halkidi et al., 2002 ; Jain and Dubes, 1988 ). Tests for situations in

which no clustering structure exists in the data are also considered

(Gordon, 1998 ) but seldom used because users are usually conﬁ dent of

the presence of clusters in the data of interest. External indices are based

on some prespeciﬁ ed structure, which is the reﬂ ection of prior informa-

tion on the data and is used as a standard to validate the clustering solu-

tions. Internal tests are not dependent on external information (prior

knowledge). Instead, they examine the clustering structure directly from

the original data. Relative criteria emphasize the comparison of different

clustering structures in order to provide a reference to decide which one

may best reveal the characteristics of the objects. Cluster validation will

be discussed in Chapter 10 , with a focus on the methods for estimating

the number of clusters.

8 CLUSTER ANALYSIS

4. Result interpretation. The ultimate goal of clustering is to provide users

with meaningful insights from the original data so that they can develop

a clear understanding of the data and therefore effectively solve the

problems encountered. Anderberg (1973) saw cluster analysis as “ a device

for suggesting hypotheses. ” He also suggested that “ a set of clusters is not

itself a ﬁ nished result but only a possible outline. ” Experts in the relevant

ﬁ elds are encouraged to interpret the data partition, integrating other

experimental evidence and domain information without restricting their

observations and analyses to any speciﬁ c clustering result. Consequently,

further analyses and experiments may be required.

It is interesting to observe that the ﬂ ow chart in Fig. 1.2 also includes a

feedback pathway. Cluster analysis is not a one - shot process. In many circum-

stances, clustering requires a series of trials and repetitions. Moreover, there

are no universally effective criteria to guide the selection of features and

clustering schemes. Validation criteria provide some insights into the quality

of clustering solutions, but even choosing an appropriate criterion is a demand-

ing problem.

1.3. CLUSTERING APPLICATIONS

Clustering has been applied in a wide variety of ﬁ elds, as illustrated below with

a number of typical applications (Anderberg, 1973 ; Everitt et al., 2001 ; Harti-

gan, 1975 ).

Engineering (computational intelligence, machine learning, pattern rec-

ognition, mechanical engineering, electrical engineering). Typical appli-

cations of clustering in engineering range from biometric recognition and

speech recognition, to radar signal analysis, information compression,

and noise removal.

2. Computer sciences. We have seen more and more applications of cluster-

ing in web mining, spatial database analysis, information retrieval, textual

document collection, and image segmentation.

3. Life and medical sciences (genetics, biology, microbiology, paleontology,

psychiatry, clinic, phylogeny, pathology). These areas consist of the major

applications of clustering in its early stage and will continue to be one

of the main playing ﬁ elds for clustering algorithms. Important applica-

tions include taxonomy deﬁ nition, gene and protein function identiﬁ ca-

tion, disease diagnosis and treatment, and so on.

4. Astronomy and earth sciences (geography, geology, remote sensing).

Clustering can be used to classify stars and planets, investigate land for-

mations, partition regions and cities, and study river and mountain

systems.

剩余363页未读，继续阅读

xuxm2007

粉丝: 0

全面解析：聚类分析与应用

QT聚类(Quality Threshold Clustering)

层次聚类hierarchical-clustering

复杂网络中聚类系数与度度关联系数的matlab Clustering_Coefficient.rar

spectral clustering谱聚类_spectralclustering_聚类_谱聚类_

GMM聚类.zip_GMM_GMM 聚类_clustering_gmm聚类_聚类

层次聚类 hierarchical clustering

聚类_clustering_

clustering-master.zip_clustering matlab_子空间聚类_稀疏子空间_稀疏聚类_稀疏聚类 ma

spectural_clustering_聚类_谱聚类_谱聚类算法_

clusterfy.zip_clustering of signal_signal clustering_信号 聚类_信号聚类

最新资源

clusterfy.zip_clustering of signal_signal clustering_信号聚类_信号聚类