提升可靠性与局部多样性：局部加权集合聚类优化

133 浏览量更新于2024-08-26 收藏 2.59MB PDF 举报

局部加权集合聚类（Locally Weighted Ensemble Clustering）是一项近年来在数据挖掘和机器学习领域备受关注的研究方法，它旨在通过结合多个基础聚类结果，生成一个更优且鲁棒的最终聚类方案。这项技术的核心思想是利用多样性来提高聚类的性能，尤其在面对低质量基础聚类时，能够有效地增强整体的共识性。传统的集合聚类方法通常假设所有基础聚类同等重要，不考虑它们的质量差异。这可能导致在处理数据时受到个别错误或不准确聚类的影响，降低最终聚类的准确性。为了解决这个问题，研究人员已经尝试了全球评估和权重分配策略，试图赋予不同基础聚类不同的权重。然而，这些方法往往将每个基础聚类视为独立个体，忽视了同一基础聚类内部各簇的局部多样性。局部加权方法的创新之处在于它试图解决这个局限，即如何评价单个簇在基础聚类中的可靠性，并充分利用这些局部多样性来优化整个聚类过程。具体来说，这种方法可能会考虑以下几个关键点： 1. **可靠性评估**：通过引入某种形式的评估指标，如聚类稳定性、一致性或者基尼指数等，对每个基础聚类中的簇进行可靠性的量化。这有助于识别出那些在多次聚类尝试中表现稳定的簇，以及那些可能存在问题的簇。 2. **局部多样性挖掘**：在评价簇可靠性的同时，注意到基础聚类内部的多样性。这意味着不仅要考虑全局的平均表现，还要关注个体簇之间的差异，以捕捉数据内在的复杂结构和模式。 3. **动态权重分配**：根据簇的可靠性与局部多样性，动态地调整每个基础聚类的权重。高可靠性和多样性的簇应得到更高的权重，反之则较低。这样可以在合并过程中给予高质量的聚类更多决定权。 4. **适应性处理**：如果数据特征不可用或没有特定假设，局部加权方法可能依赖于无监督学习算法或者半监督学习策略，利用相似性度量或自组织过程来发现和利用数据内部的潜在结构。 5. **迭代优化**：通过迭代的方式，局部加权方法可以不断地调整基础聚类的权重和最终聚类结果，直到达到一个最优的共识状态。总结起来，局部加权集合聚类是一种先进的聚类技术，它通过精细处理基础聚类的可靠性及内部多样性，显著提高了聚类结果的稳健性和准确性，对于处理复杂、噪声较多的数据集具有显著优势。随着深度学习和大数据时代的到来，这类方法在实际应用中有着广阔的发展前景。

1462 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 5, MAY 2018

Iam-On et al. [23] reﬁned the CA matrix by considering the

shared neighbors between clusters to improve the consensus

results. Wang [24] introduced a dendrogram-like hierarchical

data structure termed CA-tree to facilitate the CA-based

ensemble clustering process. Lourenço et al. [29] proposed

a new ensemble clustering approach which is based on the

EAC paradigm and is able to determine the probabilistic

assignments of data objects to clusters. Liu et al. [31]

employed spectral clustering on the CA matrix and developed

an efﬁcient ensemble clustering approach termed spectral

ensemble clustering (SEC).

The graph partitioning-based approaches [17], [19],

[28], [30] address the ensemble clustering problem by con-

structing a graph model to reﬂect the ensemble information.

The consensus clustering is then obtained by parti-

tioning the graph into a certain number of segments.

Strehl and Ghosh [17] proposed three graph partitioning-

based ensemble clustering algorithms, i.e., CSPA, HGPA,

and MCLA. Fern and Brodley [19] constructed a bipartite

graph for the clustering ensemble by treating both clus-

ters and objects as nodes, and obtain the consensus clus-

tering by partitioning the bipartite graph. Yu et al. [28]

designed a double afﬁnity propagation-based ensemble clus-

tering framework, which is able to handle the noisy attributes

and obtain the ﬁnal clustering by the normalized cut

algorithm.

The median partition-based approaches [18], [21], [25], [32]

formulate the ensemble clustering problem into an optimiza-

tion problem, which aims to ﬁnd a median partition (or

clustering) by maximizing the similarity between this clus-

tering and the multiple base clusterings. The median partition

problem is NP-hard [21]. Finding the globally optimal solution

in the huge space of all possible clusterings is computation-

ally infeasible for large datasets. Cristofor and Simovici [18]

proposed to obtain an approximate solution using the genetic

algorithm, where clusterings are treated as chromosomes.

Topchy et al. [21] cast the median partition problem into a

maximum likelihood problem and approximately solve it by

the EM algorithm. Franek and Jiang [25] cast the median par-

tition problem into an Euclidean median problem by clustering

embedding in vector spaces. Huang et al. [32] formulated the

median partition problem into a binary linear programming

problem and obtained an approximate solution by means of

the factor graph theory.

These algorithms attempt to solve the ensemble clustering

problem in various ways [17]–[29], [31], [32]. However, one

common limitation to most of the existing methods is that

they generally treat all clusters and all base clusterings in the

ensemble equally and may suffer from low-quality clusters or

low-quality base clusterings. To partially address this limita-

tion, recently some weighted ensemble clustering approaches

have been presented [30], [37], [38]. Li and Ding [37] cast the

ensemble clustering problem into a non-negative matrix factor-

ization problem and proposed a weighted consensus clustering

approach, where each base clustering is assigned a weight in

order to improve the consensus result. Yu et al. [38] exploited

the feature selection techniques to weight and select the base

clusterings. In fact, clustering selection [38] can be viewed as a

0–1 weighting scheme, where 1 indicates selecting a clustering

and 0 indicates removing a clustering. Huang et al. [30]

proposed to evaluate and weight the base clusterings based

on the concept of normalized crowd agreement index (NCAI),

and devised two weighted consensus functions to obtain the

ﬁnal clustering result.

Although the above-mentioned weighted ensemble clus-

tering approaches [30], [37], [38] are able to estimate the

reliability of base clusterings and weight them accordingly, yet

they generally treat a base clustering as a whole and neglect

the local diversity of clusters inside the same base cluster-

ing. To explore the reliability of clusters, Alizadeh et al. [45]

proposed to evaluate clusters in the ensemble by averaging

normalized mutual information (NMI) [17] between cluster-

ings, which results in a very expensive computational cost and

is not feasible for large datasets. Zhong et al. [39] exploited

the Euclidean distances between objects to estimate the cluster

reliability, which needs access to the original data features and

is only applicable to numerical data. However, in the more

general formulation of ensemble clustering [17]–[25], [32],

the original data features are not available in the consensus

process. Moreover, by measuring the within-cluster similar-

ity based on Euclidean distances, the efﬁcacy of the method

in [39] heavily relies on some implicit assumptions about data

distribution, which places an unstable factor in the consen-

sus process. Different from [39], in this paper, our approach

requires no access to the original data features. We propose

to estimate the uncertainty of clusters by considering the

cluster labels in the entire ensemble based on an entropic

criterion, and then present the concept of ECI to evaluate

cluster reliability without making any assumptions on the

data distribution. Further, to obtain the consensus clustering

results, two novel consensus functions are developed based

on cluster uncertainty estimation and local weighting strat-

egy. Extensive experiments on a variety of real-world datasets

have shown that our approach exhibits signiﬁcant advantages

in clustering accuracy and efﬁciency over the state-of-the-art

approaches.

III. P

RELIMINARIES

A. Entropy

In this section, we brieﬂy review the concept of entropy. In

information theory [46], the entropy is a measure of the uncer-

tainty associated with a random variable. The formal deﬁnition

of entropy is provided in Deﬁnition 1.

Deﬁnition 1: For a discrete random variable X, the entropy

H(X) is deﬁned as

(

)

=−



x∈X

(

)

log

(

)

(1)

where X is the set of values that X can take, and p(x) is the

probability mass function of X.

The joint entropy is a measure of the uncertainty associated

with a set of random variables. The formal deﬁnition of joint

entropy is provided in Deﬁnition 2.

剩余13页未读，继续阅读

weixin_38603259

粉丝: 5
资源: 922

提升可靠性与局部多样性：局部加权集合聚类优化

LWMC：用于集成聚类的局部加权元聚类算法

一种新的加权模糊C中心聚类算法

模糊C均值聚类

顶点加权kmeans聚类

加权聚类系数可以大于1吗

R语言 加权聚类的code实现

加权聚类系数的归一化因子怎么求

使用Matlab或者Python实现可变加权的FCM聚类

什么是视图的局部聚类和全局聚类？如何学习不同局部聚类上的共享互信息？

基于属性加权的快速聚类算法

最新资源

R语言加权聚类的code实现