数据驱动的最优图谱聚类方法

150 浏览量更新于2024-08-28 收藏 522KB PDF 举报

"带有最优图的统一谱聚类"是一篇深入探讨谱聚类算法在实际应用中优化策略的研究论文。传统谱聚类方法通常分为三个步骤：首先构建相似性图，然后学习连续的标签，最后通过k-means聚类对这些标签进行离散化。这种方法存在的主要问题在于预定义的相似性图可能并不适应数据的内在结构，可能导致信息损失和性能下降。作者认识到，相似性图的质量对谱聚类结果至关重要。为了克服这一局限，他们提出了一种新的方法，即自动从数据中学习最优化的相似性信息，并将这一过程与谱聚类的其他步骤相结合。这种方法的优势在于能够动态地适应数据特性，从而提高聚类的精度和稳定性。具体来说，他们的算法可能包含以下几个关键技术： 1. 自适应图学习：通过数据驱动的方式，算法能够实时分析数据中的特征和关系，动态生成最适合聚类任务的相似性矩阵，避免了固定阈值或距离度量可能导致的局限性。 2. 联合学习与优化：将相似性信息的学习与谱聚类模型的训练无缝集成，确保两者之间的协同优化，使得每一步骤都能最大限度地利用已获取的信息。 3. 连续标签处理：与传统的k-means不同，论文可能引入了连续的、非离散化的标签表示，这有助于捕捉数据中的细微变化，提升聚类的精细度。 4. 性能评估与改进：论文可能还包含了对聚类效果的严谨评估指标，如轮廓系数或NMI（Normalized Mutual Information），以及针对这些指标的优化策略，以确保最终聚类结果的质量。 5. 应用领域广泛：尽管文章标题强调的是统一谱聚类，但其优化策略可以应用于各种领域，包括图像分割、社交网络分析、生物信息学等，只要数据中存在可以挖掘的结构信息。这篇论文提供了一个创新的思路，通过自动化的方法解决传统谱聚类中的问题，旨在提升聚类的效率和准确性，这对于当前的机器学习和数据挖掘领域具有重要意义。

Uniﬁed Spectral Clustering with Optimal Graph

Zhao Kang,

Chong Peng,

Qiang Cheng,

Zenglin Xu

1∗

School of Computer Science and Engineering, University of Electronic Science and Technology of China, China

Department of Computer Science, Southern Illinois University, Carbondale, USA

Institute of Biomedical Informatics and Department of Computer Science, University of Kentucky, Lexington, USA

zkang@uestc.edu.cn, pchong@siu.edu, qiang.cheng@uky.edu, zlxu@uestc.edu.cn

Abstract

Spectral clustering has found extensive use in many areas.

Most traditional spectral clustering algorithms work in three

separate steps: similarity graph construction; continuous la-

bels learning; discretizing the learned labels by k-means clus-

tering. Such common practice has two potential ﬂaws, which

may lead to severe information loss and performance degra-

dation. First, predeﬁned similarity graph might not be optimal

for subsequent clustering. It is well-accepted that similarity

graph highly affects the clustering results. To this end, we

propose to automatically learn similarity information from

data and simultaneously consider the constraint that the sim-

ilarity matrix has exact c connected components if there are

c clusters. Second, the discrete solution may deviate from the

spectral solution since k-means method is well-known as sen-

sitive to the initialization of cluster centers. In this work, we

transform the candidate solution into a new one that better

approximates the discrete one. Finally, those three subtasks

are integrated into a uniﬁed framework, with each subtask it-

eratively boosted by using the results of the others towards

an overall optimal solution. It is known that the performance

of a kernel method is largely determined by the choice of ker-

nels. To tackle this practical problem of how to select the most

suitable kernel for a particular data set, we further extend our

model to incorporate multiple kernel learning ability. Exten-

sive experiments demonstrate the superiority of our proposed

method as compared to existing clustering approaches.

Introduction

Clustering is a fundamental technique in machine learning,

pattern recognition, and data mining (Huang et al. 2017). In

past decades, a variety of clustering algorithms have been

developed, such as k-means clustering and spectral cluster-

ing.

With the beneﬁts of simplicity and effectiveness, k-means

clustering algorithm is often adopted in various real-world

problems. To deal with the nonlinear structure of many prac-

tical data sets, kernel k-means (KKM) algorithm has been

developed (Sch

olkopf, Smola, and M

uller 1998), where data

points are mapped through a nonlinear transformation into a

higher dimensional feature space in which the data points

∗

Corresponding author.

 2018, Association for the Advancement of Artiﬁcial

are linearly separable. KKM usually achieves better perfor-

mance than the standard k-means. To cope with noise and

outliers, robust kernel k-means (RKKM) (Du et al. 2015) al-

gorithm has been proposed. In this approach, the squared 

norm of error construction term is replaced by 

2,1

norm.

RKKM demonstrates superior performance on a number of

benchmark data sets. The performance of such model-based

methods heavily depends on whether the data ﬁt the model.

Unfortunately, in most cases, we do not know the distribu-

tion of data in advance. To some extent, this problem is alle-

viated by multiple kernel learning.

Spectral clustering is another widely used clustering

method (Kumar, Rai, and Daume 2011). It enjoys the ad-

vantage of exploring the intrinsic data structures by ex-

ploiting the different similarity graphs of data points (Yang

et al. 2015). There are three kinds of similarity graph

constructing strategies: k-nearest-neighborhood (knn); -

nearest-neighborhood; The fully connected graph. Here,

some open issues arise (Huang, Nie, and Huang 2015): 1)

how to choose a proper neighbor number k or radius ;2)

how to select an appropriate similarity metric to measure

the similarity among data points; 3) how to counteract the

adverse effect of noise and outliers; 4) how to tackle data

with structures at different scales of size and density. Unfor-

tunately, all of these issues heavily inﬂuence the clustering

results (Zelnik-Manor and Perona 2004). Nowadays, many

data are often high dimensional, heterogeneous, and without

prior knowledge, and it is therefore a fundamental challenge

to deﬁne a pairwise similarity graph for effective spectral

clustering.

Recently, (Zhu, Change Loy, and Gong 2014) construct

robust afﬁnity graphs for spectral clustering by identifying

discriminative features. It adopts a random forest approach

based on the motivation that tree leaf nodes contain discrimi-

native data partitions, which can be exploited to capture sub-

tle and weak data afﬁnity. This approach shows better per-

formance than other state-of-the-art methods including the

Euclidean-distance-based knn (Wang et al. 2008), dominant

neighbourhoods (Pavan and Pelillo 2007), consensus of knn

(Premachandran and Kakarala 2013), and non-metric based

unsupervised manifold forests (Pei, Kim, and Zha 2013).

The second step of spectral clustering is to use the spec-

trum of the similarity graph to reveal the cluster structure of

the data. Due to the discrete constraint on the cluster labels,

The Thirty-Second AAAI Conference

on Artificial Intelligence (AAAI-18)

3366

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38509082

粉丝: 3
资源: 963

数据驱动的最优图谱聚类方法

谱聚类算法MATLAB

空间一致性约束谱聚类算法用于图像分割

谱聚类算法对图像进行分割

改进的谱聚类算法有哪些

谱聚类算法 csdn下载

谱聚类算法Python

1.什么是谱聚类算法 2.用python实现谱聚类算法并用IRIS数据集举例

谱聚类算法和聚类算法有什么区别

谱聚类和k-means聚类有啥关系

谱聚类函数matlab

最新资源