理解谱聚类：一种现代聚类算法

需积分: 10 144 浏览量更新于2024-07-15 收藏 421KB PDF 举报

"Spectral Clustering 教程" 在近年来，谱聚类已成为最受欢迎的现代聚类算法之一。它易于实现，可以通过标准线性代数软件高效解决，并且常常优于传统的聚类算法，如k-means算法。尽管在初次接触时，谱聚类显得有些神秘，其工作原理并不立即清晰，但它的优势不容忽视。本教程的目的是提供对谱聚类的一些直觉理解。我们将探讨不同的图拉普拉斯算子及其基本性质，介绍最常见的谱聚类算法，并从头推导这些算法，通过几种不同的方法进行阐述。此外，我们还将讨论不同谱聚类算法的优点和缺点。谱聚类的核心在于将数据集视为一个图，其中每个数据点是图中的一个节点，节点之间的边表示它们的相似度或连接强度。图拉普拉斯算子在这一过程中扮演了关键角色，它描述了图中节点的相对位置和相互关系。常见的图拉普拉斯算子有标准图拉普拉斯（也称为凝聚拉普拉斯）和归一化图拉普拉斯。标准图拉普拉斯是图的度矩阵与邻接矩阵之差，而归一化图拉普拉斯则进一步考虑了节点的度，使得不同度的节点可以公平比较。这两种算子都可以用来定义图的特征值问题，解出特征向量，这些特征向量可以用于聚类。谱聚类的基本思想是找到图的前k个最小特征值对应的特征向量，然后将这些特征向量作为数据点的新坐标，接着在新坐标系下应用k-means或其他聚类方法。这种方法能够捕捉到数据的全局结构，对于非凸形状的簇特别有效。教程将详细解释如何构建图，如何计算图拉普拉斯，以及如何从这些拉普拉斯矩阵中提取关键信息来执行聚类。同时，将比较基于拉普拉斯特征向量的聚类方法与基于距离的方法，如k-means，指出它们在处理噪声、异常值和不均匀分布数据时的差异。关键词：谱聚类；图拉普拉斯这篇教程将深入浅出地讲解谱聚类的理论基础，通过实例展示其在实际问题中的应用，旨在帮助读者理解并掌握这一强大的聚类工具。无论你是初学者还是有经验的数据科学家，都将从这个全面的指南中受益。

4. 0 is an eigenvalue of L

with the constant one vector

as eigenvector. 0 is an eigenvalue of

sym

with eigenvector D

1/2

5. L

sym

and L

are positive semi-deﬁnite and have n non-negative real-valued eigenvalues 0 =

≤ . . . ≤ λ

Proof. Part (1) can be proved similarly to Part (1) of Proposition 1.

Part (2) can be seen immediately by multiplying the eigenvalue equation L

sym

w = λw with D

−1/2

from the left and substituting u = D

−1/2

Part (3) follows directly by multiplying the eigenvalue equation L

u = λu with D from the left.

Part (4): The ﬁrst statement is obvious as L

= 0, the second statement follows from (2).

Part (5): The statement about L

sym

follows from (1), and then the statement about L

follows from

(2). 2

As it is the case for the unnormalized graph Laplacian, the multiplicity of the eigenvalue 0 of the

normalized graph Laplacian is related to the numb er of connected components:

Proposition 4 (Number of connected components and spectra of L

sym

and L

) Let G be

an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of both L

and L

sym

equals the number of connect ed components A

, . . . , A

in the graph. For L

, the eigenspace

of 0 is spanned by the indicator vectors

of those components. For L

sym

, the eigenspace of 0 is

spanned by the vectors D

1/2

Proof. The proof is analogous to the one of Proposition 2, using Proposition 3. 2

4 Spectral C lust eri ng Algorithms

Now we would like to state the most common spectral clustering algorithms. For references and the

history of spectral clustering we refer to Section 9. We assume that our data consists of n “points”

, . . . , x

which can be arbitrary objects. We measure their pairwise similarities s

= s(x

, x

)

by some similarity function which is symmetric and non-negative, and we denote the corresponding

similarity matrix by S = (s

)

i,j=1...n

Unnormalized spectral clustering

Input: Similarity matrix S ∈

n×n

, number k of clusters to construct.

• Construct a similarity graph by one of the ways described in Section 2. Let W

be its weighted adjacency matrix.

• Compute the unnormalized Laplacian L.

• Compute the ﬁrst k eigenvectors u

, . . . , u

of L.

• Let U ∈

n×k

be the matrix containing the vectors u

, . . . , u

as columns.

• For i = 1, . . . , n, let y

∈

be the vector corresponding to the i-th row of U.

• Cluster the points (y

)

i=1,...,n

with the k-means algorithm into clusters

, . . . , C

Output: Clusters A

, . . . , A

with A

= {j| y

∈ C

There are two diﬀerent versions of normalized spectral clustering, depending which of the normalized

graph Laplacians is used. We name both algorithms after two popular papers, for more references and

history please see Sec tion 9.

Normalized spectral clustering according to Shi and Malik (2000)

Input: Similarity matrix S ∈

n×n

, number k of clusters to construct.

• Construct a similarity graph by one of the ways described in Section 2. Let W

be its weighted adjacency matrix.

• Compute the unnormalized Laplacian L.

• Compute the ﬁrst k generalized eigenvectors u

, . . . , u

of the generalized eigenprob-

lem Lu = λDu.

• Let U ∈

n×k

be the matrix containing the vectors u

, . . . , u

as columns.

• For i = 1, . . . , n, let y

∈

be the vector corresponding to the i-th row of U.

• Cluster the points (y

)

i=1,...,n

with the k-means algorithm into clusters

, . . . , C

Output: Clusters A

, . . . , A

with A

= {j| y

∈ C

Note that this algorithm uses the generalized eigenvectors of L, which according to Proposition 3

correspond to the eigenvectors of the matrix L

. So in fact, the algorithm works with eigenvectors of

the normalized Laplacian L

, and hence is called normalized spectral clustering. The next algorithm

also uses a normalized Laplacian, but this time the matrix L

sym

instead of L

. As we will see, this

algorithm needs to introduce an additional row normalization step which is not needed in the other

algorithms. The reasons will become clear in Section 7.

Normalized spectral clustering according to Ng, Jordan, and Weiss (2002)

Input: Similarity matrix S ∈

n×n

, number k of clusters to construct.

• Construct a similarity graph by one of the ways described in Section 2. Let W

be its weighted adjacency matrix.

• Compute the normalized Laplacian L

sym

• Compute the ﬁrst k eigenvectors u

, . . . , u

of L

sym

• Let U ∈

n×k

be the matrix containing the vectors u

, . . . , u

as columns.

• Form the matrix T ∈

n×k

from U by normalizing the rows to norm 1,

that is set t

= u

)

1/2

• For i = 1, . . . , n, let y

∈

be the vector corresponding to the i-th row of T .

• Cluster the points (y

)

i=1,...,n

with the k-means algorithm into clusters C

, . . . , C

Output: Clusters A

, . . . , A

with A

= {j| y

∈ C

All three algorithms stated above look rather similar, apart from the fact that they use three diﬀerent

graph Laplacians. In all three algorithms, the main trick is to change the representation of the abstract

data points x

to points y

∈

. It is due to the properties of the graph Laplacians that this change of

representation is useful. We will see in the next sections that this change of representation enhances

the cluster-properties in the data, so that clusters can be trivially detected in the new representation.

In particular, the simple k-means clustering algorithm has no diﬃculties to detect the clusters in this

new representation. Readers not familiar with k-means can read up on this algorithm in numerous

剩余31页未读，继续阅读

维纳斯的双臂

粉丝: 0
资源: 5

理解谱聚类：一种现代聚类算法

谱聚类教程（Tutorial To Spectral Clustering）

spectral clustering

A Tutorial on Network Embeddings

clustering

Dimension Reduction A Guided Tour

非负矩阵分解与谱聚类：PCA和机器学习教程

谱聚类算法Python

[net毕业设计]ASP.NET基于BS结构的实验室预约模型系统（源代码+论文）.zip

中医诊所系统，WPF.zip

[net毕业设计]ASP.NET淘宝店主交易管理系统的设计与实现（源代码+论文）.zip

最新资源