大规模图聚类：幂迭代聚类（PIC）方法

下载需积分: 50 | PDF格式 | 233KB | 更新于2024-09-10 | 196 浏览量 | 举报

"这篇资源包含了两篇关于幂迭代聚类的研究论文，由Frank Lin和William W. Cohen等人撰写，来自卡内基梅隆大学。幂迭代聚类（Power Iteration Clustering, PIC）是一种简单且可扩展的图聚类方法，它通过在数据的归一化对称相似矩阵上进行截断的幂迭代来找到低维数据嵌入。这种嵌入显示出了强大的聚类指示效果，通常在实际数据集上优于广泛使用的谱聚类方法如N-Cut，并且在大型数据集上的运行速度极快，比基于最先进的IRAM矩阵特征向量计算技术的N-Cut实现快1000倍以上。" 幂迭代聚类（Power Iteration Clustering, PIC）是一种新兴的聚类算法，其核心思想是利用幂迭代法处理数据的相似性矩阵。在传统的谱聚类方法中，如N-Cut，通常会计算数据矩阵的特征值和特征向量，这在大数据集上可能非常耗时。而PIC则采取了一种更高效的方式：它首先构造一个归一化的对称相似矩阵，这个矩阵反映了数据点之间的相对关系，然后通过幂运算迭代来逼近矩阵的主特征向量。由于只截取了低维部分，所以计算复杂度大大降低，使得该方法在处理大规模数据时具有显著优势。论文指出，这个低维嵌入实际上是一个有效的聚类标志，意味着数据点在嵌入空间中的位置能够很好地指示它们所属的潜在类别。在与N-Cut等传统谱聚类方法的比较中，PIC在各种实际数据集上表现出更好的性能，这表明它在处理非线性和复杂结构的数据时可能更具优势。此外，论文还探讨了PIC与其他聚类方法的联系，包括其与谱聚类的理论关系，以及与图割（graph cutting）方法的对比。这些比较有助于我们理解幂迭代聚类的内在机制，并可能启发未来聚类算法的改进和优化。这两篇论文对于理解和应用幂迭代聚类方法具有重要的价值，不仅提供了高效的聚类解决方案，而且为研究者提供了深入研究聚类问题的新视角。对于数据科学家、机器学习工程师和相关领域的研究者来说，这是一种值得探索和应用的新型聚类技术。

Power Iteration Clustering

Frank Lin frank@cs.cmu.edu

William W. Cohen wcohen@cs.cmu.edu

Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA

Abstract

We present a simple and scalable graph clus-

tering method called power iteration cluster-

ing (PIC). PIC ﬁnds a very low-dimensional

embedding of a dataset using truncated

power iteration on a normalized pair-wise

similarity matrix of the data. This em-

bedding turns out to be an eﬀective cluster

indicator, consistently outperforming widely

used spectral methods such as NCut on real

datasets. PIC is very fast on large datasets,

running over 1,000 times faster than an NCut

implementation based on the state-of-the-art

IRAM eigenvector computation technique.

1. Introduction

We present a simple and scalable clustering method

called power iteration clustering (hereafter PIC). In

essence, it ﬁnds a very low-dimensional data embed-

ding using truncated power iteration on a normalized

pair-wise similarity matrix of the data points, and this

embedding turns out to be an eﬀective cluster indica-

tor.

In presenting PIC, we make connections to and make

comparisons with spectral clustering, a well-known, el-

egant and eﬀective clustering method. PIC and spec-

tral clustering methods are similar in that both em-

bed data points in a low-dimensional subspace derived

from the similarity matrix, and this embedding pro-

vides clustering results directly or through a k-means

algorithm. They are diﬀerent in what this embedding

is and how it is derived. In spectral clustering the

embedding is formed by the bottom eigenvectors of

the Laplacian of an similarity matrix. In PIC the em-

bedding is an approximation to a eigenvalue-weighted

linear combination of all the eigenvectors of an nor-

Appearing in Proceedings of the 27

International Confer-

ence on Machine Learning, Haifa, Israel, 2010. Copyright

2010 by the author(s)/owner(s).

malized similarity matrix. This embedding turns out

to be very eﬀective for clustering, and in comparison

to spectral clustering, the cost (in space and time) of

explicitly calculating eigenvectors is replaced by that

of a small number of matrix-vector multiplications.

We test PIC on a number of diﬀerent types of datasets

and obtain comparable or better clusters than existing

spectral metho ds. However, the highlights of PIC are

its simplicity and scalability — we demonstrate that

a basic implementation of this method is able to par-

tition a network dataset of 100 million edges within

a few seconds on a single machine, without sampling,

grouping, or other preprocessing of the data.

This work is presented as follows: we begin by describ-

ing power iteration and how its convergence property

indicates cluster membership and how we can use it

to cluster data (Section 2). Then we show exp erimen-

tal results of PIC on a numb er of real and synthetic

datasets and compare them to those of spectral cluster-

ing, both in cluster quality (Section 3) and scalability

(Section 4). Next, we survey related work (Section 5),

diﬀerentiating PIC from clustering methods that em-

ploy matrix powering and from methods that modiﬁes

the “traditional” spectral clustering to improve on ac-

curacy or scalability. Finally, we conclude with why

we believe this simple and scalable clustering method

is very practical — easily implemented, parallelized,

and well-suited to very large datasets.

2. Power Iteration Clustering

2.1. Notation and Background

Given a dataset X = {x

, x

, ..., x

}, a similarity func-

tion s(x

, x

) is a function where s(x

, x

) = s(x

, x

)

and s ≥ 0 if i 6= j, and following previous work

(Shi & Malik, 2000), s = 0 if i = j. An aﬃnity matrix

A ∈ R

n×n

is deﬁned by A

= s(x

, x

). The de-

gree matrix D associated with A is a diagonal matrix

with d

Aij. A normalized aﬃnity matrix W is

deﬁned as D

−1

A. Below we will view W interchange-

ably as a matrix, and an undirected graph with nodes

下载后可阅读完整内容，剩余7页未读，立即下载

千寻千梦

粉丝: 241

大规模图聚类：幂迭代聚类（PIC）方法

幂迭代聚类两篇论文-

幂迭代聚类算法研究：两篇权威论文解析

基于改进模糊C均值聚类的TLS点云去噪与建模

论文研究-一种针对TPM的抗重放攻击方案.pdf

2019年APMCM（亚太地区数模竞赛）优秀论文-A96330.pdf

ba.rar_ba无标度网络_ba网络_matlab_无标度 MATLAB_论文复现

马尔可夫聚类算法介绍与应用

网络流量分析：聚类与高斯混合模型的应用

粒子群算法PSO相关代码及论文审查资源整理

MATLAB源代码实现ICML2021论文随机块模型最优非凸精确恢复

最新资源