概率视角的新型聚类集成方法

PDF格式 | 487KB | 更新于2024-08-26 | 92 浏览量 | 举报

"这篇研究论文探讨了一种新的潜在聚类分析的概率方法，旨在通过整合多个聚类结果来得到更优的聚类方案。该方法基于概率模型假设每个聚类解决方案都由潜在的聚类模型生成，并受两个概率参数控制。论文提出将聚类集成问题转化为最大似然优化问题，并设计了一种EM（期望最大化）风格的算法来解决这个问题，能够自动确定聚类的数量。实验结果显示，提出的算法在性能上优于包括EAC-AL、CSPA、HGPA和MCLA在内的当前先进方法，并且在预测的聚类数量上表现出稳定性。" 在聚类分析中，目标是揭示数据集的内在结构，将相似的对象分组到一起，而不同组之间的对象则差异较大。随着大量聚类算法的发展，如K均值、层次聚类、DBSCAN等，如何从这些算法的多种聚类结果中获取更准确、更稳定的聚类信息成为了一个挑战。集群集成方法应运而生，它通过组合多个聚类结果来提高聚类质量。论文提出的概率方法提供了一种新颖的视角。它假设每个聚类解决方案背后都有一个潜在的聚类模型，这个模型由两个概率参数调控。这使得聚类集成问题可以转换成寻找使数据似然性最大的参数设置，即最大似然估计问题。EM算法是一种在处理含有隐变量的概率模型时非常有效的迭代算法，论文设计的EM风格算法就是用来解决这一优化问题的。 EM算法的基本思想是通过交替执行期望（E）步骤和最大化（M）步骤来逐步逼近参数的最大似然估计。在E步骤中，计算每个数据点属于每个潜在聚类的概率；在M步骤中，更新聚类模型的参数以最大化在当前估计下数据的期望对数似然。这种迭代过程持续进行，直到模型参数收敛或达到预设的停止条件。论文中的实验部分对比了提出的算法与其他四种先进的聚类集成方法，证明了新算法在性能上的优越性，不仅在聚类质量上超过已有的方法，还具备自动确定最佳聚类数的能力，这在实际应用中具有很大价值。此外，算法在聚类数量预测的稳定性方面也得到了验证，意味着它在不同数据集上能保持一致的表现，这对于聚类分析的可重复性和可靠性至关重要。这篇研究为聚类分析提供了一种概率模型为基础的集成方法，通过EM算法实现了对多个聚类结果的有效整合，为复杂数据集的聚类分析带来了新的思路和工具。这种方法的提出对于提升聚类分析的准确性和稳定性具有重要意义，有望在数据挖掘、机器学习等领域得到广泛应用。

Abstract

Facing a large number of clustering solutions,

cluster ensemble method provides an effective

approach to aggregating them into a better one. In

this paper, we propose a novel cluster ensemble

method from probabilistic perspective. It assumes

that each clustering solution is generated from a

latent cluster model, under the control of two

probabilistic parameters. Thus, the cluster ensemble

problem is reformulated into an optimization

problem of maximum likelihood. An EM-style

algorithm is designed to solve this problem. It can

determine the number of clusters automatically.

Experimenal results have shown that the proposed

algorithm outperforms the state-of-the-art methods

including EAC-AL, CSPA, HGPA, and MCLA.

Furthermore, it has been shown that our algorithm is

stable in the predicted numbers of clusters.

1 Introduction

The goal of cluster analysis is to discover the underlying

structure of a dataset (Jain et al., 1999; Jain, 2010). It

normally partitions a set of objects so that the objects within

the same group are similar while those from different groups

are dissimilar. A large number of clustering algorithms have

been proposed, e.g. k-Means, Spectral Clustering,

Hierarchical Clustering, Self-Organizing Maps, to name but

a few, yet no single one is able to successfully achieve this

goal for all datasets. On the same data, different algorithms,

or even multiple runs of the same algorithm with different

parameters, often lead to clustering solutions that are distinct

from each other.

Confronted with a large number of clustering solutions,

cluster ensemble or clustering aggregation methods have

emerged, which try to combine different clustering solutions

into a consensus one, in order to improve the quality of

component clustering solutions (Vega-Pons and

Ruiz-Shulcloper, 2011). Cluster ensemble methods usually

consist of two or three phases: the ensemble generation phase

to produce a variety of clustering solutions; then the

ensemble selection phase to select a subset of these clustering

solutions, which is optional; and finally the consensus phase

to induce a unified partition by combining the component

ones. In the generation phase, different clustering solutions

can be generated by different clustering algorithms, the same

algorithm with different parameter settings or initialization,

and injection of random disturbance into data set such as data

resampling (Minaei-Bidgoli et al., 2004), random projection

(Fern and Brodley, 2003), and random feature selection

(Strehl and Ghosh, 2002). Following the generation phase, an

optional enemble selection phase will select or prune these

clustering solutions according to their qualities and

diversities (Fern and Lin, 2008; Azimi and Fern, 2009).

In this paper, we focus on the final phase - clustering

combination. There are a lot of algorithms for the

combination, which can be categorized according to the kind

of information exploited. The algorithm proposed here falls

into the category making use of the pairwise similarities

between objects, which form a co-association matrix in the

context of cluster ensembles. Any clustering algorithm can

be applied on this new similarity matrix to find a consensus

partition. Evidence Accummulation Clustering (Fred and

Jain, 2005), or EAC in short, performs a hierarchical

clustering of average linkage (AL) or single linkage (SL) on

co-associationg matrix, where a maximum lifetime criterion

is proposed to determine the number of clusters.

Cluster-based Similarity Partitioning (CSPA) algorithm

(Strehl and Ghosh, 2002) uses a graph-paritioning algorithm

instead, but requires the number of clusters be specified

manually. Another algorithm HGPA (Strehl and Ghosh, 2002)

can be thought as an approximation to CSPA. Out of this

category, MCLA algorithm makes a clustering of clusters

based on the similarities between clusters, and then assigns

objects to its closest meta-cluster. For a thorough list of

related algorithms, please refer to the survey paper by

Vega-Pons and Ruiz-Shulcloper (2011).

Although these methods have achieved some success, they

are still deficient in several aspects: first, they lack of

theoretic underpinning; second, they think that all the

clustering solutions be of the same quality, and thus assign

the same weight to each clustering solution; last but not the

least, most of them (except EAC) require the number of

clusters to be specified manually. As to the maximum

lifetime criterion adopted by EAC algorithm, it is more or

less a rule-of-thumb that is lack of justification. As we shall

see in the experiments, the maximum lifetime criterion is

unstable in determining the number of clusters.

A Probabilistic Approach to Latent Cluster Analysis



Zhipeng Xie, Rui Dong, Zhengheng Deng, Zhenying He, Weidong Yang

School of Computer Science

Fudan University, Shanghai, China

{xiezp, 11210240011, 11210240082, zhenying, wdyang}@fudan.edu.cn

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

1813

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38534352

粉丝: 5

概率视角的新型聚类集成方法

数据挖掘：六种聚类分析方法与异常检测详解

掌握聚类分析技术的压缩包教程

数据挖掘中的聚类分析技术与应用

聚类分析方法有哪些.pdf

聚类分析方法与聚类算法对比

多通道信号聚类分析方法及应用案例解析

利用聚类分析挖掘红酒数据集中的潜在类别：深度分析与应用

spssau潜在聚类

什么是聚类分析.pdf

聚类分析读书报告.docx

最新资源