数据聚类：理论、算法与实践应用

需积分: 9 78 浏览量更新于2024-07-19 收藏 28.89MB PDF 举报

"《数据聚类：理论、算法与应用》是数据分类的经典参考资料，深入探讨了聚类分析的基本概念、常用方法以及在各种领域的实际应用。本书由Gan、Ma和Wu合著，旨在为统计学和应用概率论的交叉领域提供经济实惠且高质量的出版物，由美国统计协会和工业与应用数学学会联合出版。" 在数据挖掘和机器学习领域，聚类分析是一种重要的无监督学习方法，它通过寻找数据集中的内在结构和相似性，将数据自动分为不同的组或簇。《DataClustering: Theory, Algorithms, and Applications》一书详细阐述了这一主题的核心理论，包括： 1. **聚类基础**：书中可能涵盖了聚类的目标、类型（如层次聚类、划分聚类、基于密度的聚类等）以及评价聚类质量的指标（如轮廓系数、Calinski-Harabasz指数等）。 2. **聚类算法**：介绍了多种经典的聚类算法，如K-Means、DBSCAN（基于密度的聚类）、谱聚类、层次聚类（包括凝聚型和分裂型）等。这些算法的工作原理、优缺点以及适用场景都会被详细解析。 3. **应用案例**：书中的实际应用部分可能涉及到市场细分、生物信息学、图像分割、社交网络分析等多个领域，展示了如何将聚类技术应用于解决现实问题。 4. **距离度量和相似性**：聚类过程通常依赖于合适的距离或相似性度量，如欧氏距离、曼哈顿距离、余弦相似性等，书中可能讨论了这些度量的选择及其对聚类结果的影响。 5. **数据预处理**：在进行聚类之前，可能需要对数据进行清洗、标准化、降维等预处理步骤，这部分内容也是书中不可忽视的部分。 6. **聚类优化与性能评估**：书中可能会探讨如何优化聚类算法的性能，以及如何使用交叉验证和其他方法来评估聚类效果。 7. **最新进展**：除了基本理论和经典方法，书中可能还涵盖了聚类分析领域的最新研究和技术发展。《DataClustering》这本书不仅适合数据分析新手作为入门教材，也对有经验的数据科学家提供了深入理解聚类分析的宝贵资源，帮助读者在理论与实践中找到平衡，提升数据分析技能。

xvi List of Tables

7.6 The dissimilarity matrix of the data set given in Figure 7.9. ..........135

11.1 Description of the chameleon algorithm, where n is the number of data in the

database and m is the number of initial subclusters. ..............204

11.2 The properties of the ROCK algorithm, where n is the number of data points

in the data set, m

is the maximum number of neighbors for a point, and m

is the average number of neighbors. .......................208

14.1 Description of Gaussian mixture models in the general family. .........231

14.2 Description of Gaussian mixture models in the diagonal family. B is a diagonal

matrix. ......................................232

14.3 Description of Gaussian mixture models in the diagonal family. I is an identity

matrix. ......................................232

14.4 Four parameterizations of the covariance matrix in the Gaussian model and

their corresponding criteria to be minimized. .................234

15.1 List of some subspace clustering algorithms. ..................244

15.2 Description of the MAFIA algorithm. ......................259

17.1 Some indices that measure the degree of similarity between C and P based

on the external criteria. .............................303

19.1 Some MATLAB commands related to reading and writing ﬁles. ........344

19.2 Permission codes for opening a ﬁle in MATLAB. ...............345

19.3 Some values of precision for the fwrite function in MATLAB. .......346

19.4 MEX-ﬁle extensions for various platforms. ...................352

19.5 Some MATLAB clustering functions. ......................355

19.6 Options of the function pdist..........................357

19.7 Options of the function linkage. .......................358

19.8 Values of the parameter distance in the function kmeans..........360

19.9 Values of the parameter start in the function kmeans. ...........360

19.10 Values of the parameter emptyaction in the function kmeans. ......361

19.11 Values of the parameter display in the function kmeans...........361

20.1 Some members of the vector class. .......................365

20.2 Some members of the list class. .........................366

List of Algorithms

Algorithm 5.1 Nonmetric MDS .......................... 55

Algorithm 5.2 The pseudocode of the SOM algorithm .............. 58

Algorithm 7.1 The SLINK algorithm .......................139

Algorithm 7.2 The pseudocode of the CLINK algorithm .............142

Algorithm 8.1 The fuzzy k-means algorithm ...................154

Algorithm 8.2 Fuzzy k-modes algorithm .....................157

Algorithm 9.1 The conventional k-means algorithm ...............162

Algorithm 9.2 The k-means algorithm treated as an optimization problem ....163

Algorithm 9.3 The compare-means algorithm ...................165

Algorithm 9.4 An iteration of the sort-means algorithm ..............166

Algorithm 9.5 The k-modes algorithm .......................177

Algorithm 9.6 The k-probabilities algorithm ...................180

Algorithm 9.7 The k-prototypes algorithm ....................182

Algorithm 10.1 The VNS heuristic .........................187

Algorithm 10.2 Al-Sultan’s tabu search–based clustering algorithm ........188

Algorithm 10.3 The J -means algorithm ......................191

Algorithm 10.4 Mutation (s

) ...........................193

Algorithm 10.5 The pseudocode of GKA ......................194

Algorithm 10.6 Mutation (s

) in GKMODE ....................197

Algorithm 10.7 The SARS algorithm ........................201

Algorithm 11.1 The procedure of the chameleon algorithm ............204

Algorithm 11.2 The CACTUS algorithm ......................205

Algorithm 11.3 The dynamic system–based clustering algorithm .........206

Algorithm 11.4 The ROCK algorithm .......................207

Algorithm 12.1 The STING algorithm .......................210

Algorithm 12.2 The OptiGrid algorithm ......................211

Algorithm 12.3 The GRIDCLUS algorithm ....................213

Algorithm 12.4 Procedure NEIGHBOR_SEARCH(B,C) ..............213

Algorithm 12.5 The GDILC algorithm .......................215

Algorithm 13.1 The BRIDGE algorithm ......................221

Algorithm 14.1 Model-based clustering procedure .................238

Algorithm 14.2 The COOLCAT clustering algorithm ...............240

Algorithm 14.3 The STUCCO clustering algorithm procedure ...........241

Algorithm 15.1 The PROCLUS algorithm .....................247

xvii

Preface

Cluster analysis is an unsupervised process that divides a set of objects into homoge-

neous groups. There have been many clustering algorithms scattered in publications in very

diversiﬁed areas such as pattern recognition, artiﬁcial intelligence, information technology,

image processing, biology, psychology, and marketing. As such, readers and users often

ﬁnd it very difﬁcult to identify an appropriate algorithm for their applications and/or to

compare novel ideas with existing results.

In this monograph, we shall focus on a small number of popular clustering algorithms

and group them according to some speciﬁc baseline methodologies, such as hierarchical,

center-based, and search-based methods. We shall, of course, start with the common ground

and knowledge for cluster analysis, including the classiﬁcation of data and the correspond-

ing similarity measures, and we shall also provide examples of clustering applications to

illustrate the advantages and shortcomings of different clustering architectures and algo-

rithms.

This monograph is intended not only for statistics, applied mathematics, and computer

science senior undergraduates and graduates, but also for research scientists who need cluster

analysis to deal with data. It may be used as a textbook for introductory courses in cluster

analysis or as source material for an introductory course in data mining at the graduate level.

We assume that the reader is familiar with elementary linear algebra, calculus, and basic

statistical concepts and methods.

The book is divided into four parts: basic concepts (clustering, data, and similarity

measures), algorithms, applications, and programming languages. We now brieﬂy describe

the content of each chapter.

Chapter 1. Data clustering. In this chapter, we introduce the basic concepts of

clustering. Cluster analysis is deﬁned as a way to create groups of objects, or clusters,

in such a way that objects in one cluster are very similar and objects in different clusters

are quite distinct. Some working deﬁnitions of clusters are discussed, and several popular

books relevant to cluster analysis are introduced.

Chapter 2. Data types. The type of data is directly associated with data clustering,

and it is a major factor to consider in choosing an appropriate clustering algorithm. Five

data types are discussed in this chapter: categorical, binary, transaction, symbolic, and time

series. They share a common feature that nonnumerical similarity measures must be used.

There are many other data types, such as image data, that are not discussed here, though we

believe that once readers get familiar with these basic types of data, they should be able to

adjust the algorithms accordingly.

xix

xx Preface

Chapter 3. Scale conversion. Scale conversion is concerned with the transformation

between different types of variables. For example, one may convert a continuous measured

variable to an interval variable. In this chapter, we ﬁrst review several scale conversion

techniques and then discuss several approaches for categorizing numerical data.

Chapter 4. Data standardization and transformation. In many situations, raw data

should be normalized and/or transformed before a cluster analysis. One reason to do this is

that objects in raw data may be described by variables measured with different scales; another

reason is to reduce the size of the data to improve the effectiveness of clustering algorithms.

Therefore, we present several data standardization and transformation techniques in this

chapter.

Chapter 5. Data visualization. Data visualization is vital in the ﬁnal step of data-

mining applications. This chapter introduces various techniques of visualization with an

emphasis on visualization of clustered data. Some dimension reduction techniques, such as

multidimensional scaling (MDS) and self-organizing maps (SDMs), are discussed.

Chapter 6. Similarity and dissimilarity measures. In the literature of data clus-

tering, a similarity measure or distance (dissimilarity measure) is used to quantitatively

describe the similarity or dissimilarity of two data points or two clusters. Similarity and dis-

tance measures are basic elements of a clustering algorithm, without which no meaningful

cluster analysis is possible. Due to the important role of similarity and distance measures in

cluster analysis, we present a comprehensive discussion of different measures for various

types of data in this chapter. We also introduce measures between points and measures

between clusters.

Chapter 7. Hierarchical clustering techniques. Hierarchical clustering algorithms

and partitioning algorithms are two major clustering algorithms. Unlike partitioning algo-

rithms, which divide a data set into a single partition, hierarchical algorithms divide a data

set into a sequence of nested partitions. There are two major hierarchical algorithms: ag-

glomerative algorithms and divisive algorithms. Agglomerative algorithms start with every

single object in a single cluster, while divisive ones start with all objects in one cluster and

repeat splitting large clusters into small pieces. In this chapter, we present representations

of hierarchical clustering and several popular hierarchical clustering algorithms.

Chapter 8. Fuzzy clustering algorithms. Clustering algorithms can be classiﬁed

into two categories: hard clustering algorithms and fuzzy clustering algorithms. Unlike

hard clustering algorithms, which require that each data point of the data set belong to one

and only one cluster, fuzzy clustering algorithms allow a data point to belong to two or

more clusters with different probabilities. There is also a huge number of published works

related to fuzzy clustering. In this chapter, we review some basic concepts of fuzzy logic

and present three well-known fuzzy clustering algorithms: fuzzy k-means, fuzzy k-modes,

and c-means.

Chapter 9. Center-based clustering algorithms. Compared to other types of clus-

tering algorithms, center-based clustering algorithms are more suitable for clustering large

data sets and high-dimensional data sets. Several well-known center-based clustering algo-

rithms (e.g., k-means, k-modes) are presented and discussed in this chapter.

Chapter 10. Search-based clustering algorithms. A well-known problem associ-

ated with most of the clustering algorithms is that they may not be able to ﬁnd the globally

optimal clustering that ﬁts the data set, since these algorithms will stop if they ﬁnd a local

optimal partition of the data set. This problem led to the invention of search-based clus-

剩余487页未读，继续阅读

aeou123

粉丝: 1
资源: 19

数据聚类：理论、算法与实践应用

数据分析与数据挖掘算法：K-means和层次聚类算法详解【英文版】

"基于深度学习的高维数据聚类算法研究

C++中面向对象的数据聚类方法：《DataClusteringinC++.An.Object-Oriented》

Data Clustering Theory, Algorithms, and Applications

SIAM.Data.Clustering.Theory.Algorithms.and.Applications.May.2007.pdf

Time Series Chaos Theory: Expert Insights and Applications for Predicting Complex Dynamics

Information Theory, Inference, and Learning Algorithms

Information Theory, Inference, and Learning Algorithms 2015 v7.2版

Information Theory, Inference, and Learning Algorithms David J.C. MacKay

Error Correction coding——mathematical methods and algorithms

最新资源