高维稀疏数据子空间聚类的熵权k-均值算法

需积分: 10 49 浏览量更新于2024-07-26 收藏 4.45MB PDF 举报

"这篇文档介绍了一种用于高维稀疏数据子空间聚类的熵权重k-means算法，特别适用于w-k-means算法的研究。在处理高维数据时，对象的聚类往往存在于子空间而非整个空间。例如，在文本聚类中，不同主题的文档集群由不同的关键词子集区分。这种现象在高维数据聚类中被称为数据稀疏问题。新算法通过计算每个簇中每个维度的权重，并利用这些权重识别区分不同簇的重要维度子集。这是通过将权重熵纳入k-means聚类过程的目标函数来实现的。此外，该算法还添加了一个步骤，用于自动计算每个簇所有维度的权重。实验结果证明了该方法的有效性。" 本文档主要讨论了高维稀疏数据的聚类挑战，特别是针对传统的k-means算法在处理这类数据时的局限性。作者提出了一个熵权重k-means（Entropy Weighting k-Means）算法，它是一种改进的k-means类型算法，专门设计用于子空间聚类。在高维数据中，数据点往往在特定的子空间内形成聚类，而不是在整个数据空间中。例如，文本数据中，不同的主题集群可能由不同的关键词子集定义。算法的关键创新在于引入了权重的概念，这有助于解决数据稀疏性问题。每个维度在每个簇中被赋予一个权重，这些权重反映了该维度在区分聚类中的重要性。通过将权重熵引入目标函数，算法能够自动识别并强调那些对聚类划分有贡献的维度。这不仅优化了聚类过程，还能帮助发现隐藏在高维稀疏数据中的结构。为了实现这个过程，算法在标准k-means迭代过程中增加了一个额外步骤，即计算和更新每个簇中所有维度的权重。这一步骤使得算法具有了自适应性，能够根据数据的特性调整权重，从而更准确地捕获子空间内的聚类结构。实验部分可能包含了在不同数据集上应用该算法的结果，对比了与传统k-means和其他聚类方法的性能。这些实验通常会评估算法的聚类质量、效率以及对数据稀疏性的鲁棒性。通过这些评估，作者可能展示了熵权重k-means算法在处理高维稀疏数据时的优越性和实用性。总结来说，这篇文档提供的熵权重k-means算法是针对高维稀疏数据的一种有效解决方案，它通过引入权重和熵的概念，增强了聚类算法在子空间聚类中的性能。这种方法对于数据挖掘、文本分析、图像处理等领域具有重要的理论和实践价值。

ðZ;f

l¼1

Þ¼

l¼1



zðjÞ¼zðj

Þ¼l

i¼1

ð

þ 

log 

Þþ log m

ð5Þ

Here,  is a positive parameter to control the strength of the

incentive for subspace clustering on more dimensions, n

the number of objects assigned to the lth cluster, 

is the

weight vector for the lth cluster for regulating the cluster

size, 

is the weight of the ith feature in the lth cluster, d

is the distance between the jth and the j

th objects along the

ith dimension. For instance, the distance is given by

 x

¼1

 x

In order to minimize (5) and find the solution clusters

efficiently, Friedman and Meulman proposed to use an

iterative approach to build a weighted dissimilarity matrix

among objects. Then, a hierarchical clustering method-

based nearest neighbors is used to cluster this matrix. The

computational process of COSA may not be scalable to large

data sets. Its computational complexity of building the

weighted dissimilarity matrix is OðhnmL þ n

mÞ (n is the

number of objects, m is the number of dimensionality, L is a

predefined parameter to find L nearest neighbors objects of a

given object, and h is the number of iterations), where the

first term of the complexity is for calculating weights of all

dimensions for each object, and the second term is for

creating the matrix. In other words, COSA may not be

practical for large-volume and high-dimensional data.

3ENTROPY WEIGHTING k-MEANS

In this section, we present a new k-means type algorithm for

soft subspace clustering of high-dimensional sparse data. In

the new algorithm, we consider that the weight of a

dimension in a cluster represents the probability of

contribution of that dimension in forming the cluster. The

entropy of the dimension weights represents the certainty

of dimensions in the identification of a cluster. Therefore,

we modify the objective function (1) by adding the weight

entropy term to it so that we can simultaneously minimize

the within cluster dispersion and maximize the negative

weight entropy to stimulate more dimensions to contribute

to the identification of clusters. In this way, we can avoid

the problem of identifying clusters by few dimensions in

sparse data.

The new objective function is written as follows:

F ðW;Z;Þ¼

l¼1

j¼1

i¼1



ðz

 x

þ 

i¼1



log 

ð6Þ

subject to

l¼1

¼ 1; 1  j  n; 1  l  k; w

2f0; 1g

i¼1



¼ 1; 1  l  k; 1  i  m; 0  

 1:

The first term in (6) is the sum of the within cluster

dispersions, and the second term the negative weight

entropy. The positive parameter  controls the strength of

the incentive for clustering on more dimensions.

Next, we present the entropy weighting k-means algo-

rithm (EWKM) to solve the above minimization problem.

3.1 EWKM Algorithm

Minimization of F in (6) with the constraints forms a class

of constrained nonlinear optimization problems whose

solutions are unknown. The usual method toward optimi-

zation of F is to use the partial optimization for , Z, and

W. In this method, we first fix Z and  and minimize the

reduced F with respect to W. Then, we fix W and  and

minimize the reduced F with respect to Z. Afterward, we

fix W and Z and minimize the reduced F to solve .

We can extend the standard k-means clustering process

to minimize F by adding an additional step in each iteration

to compute weights  for each cluster. The formula for

computing  is given in the following theorem:

Theorem 1. Given matrices W and Z are fixed, F is minimized if



exp

D





i¼1

exp

D





; ð7Þ

where

j¼1

ðz

 x

Proof. We use the Lagrangian multiplier technique to obtain

the following unconstrained minimization problem:

min F

ðf

g; f

gÞ ¼

l¼1

j¼1

i¼1



ðz

 x

þ 

i¼1



log 



l¼1



i¼1



 1

;

ð8Þ

where ½

; ...;

 is a vector containing the Lagrange

multipliers corresponding to the constraints. The opti-

mization problem in (8) can be decomposed into

k independent minimization problems:

min F

ð

;

Þ¼

j¼1

i¼1



ðz

 x

þ 

i¼1



log 

 

i¼1



 1

for l ¼ 1; ...;k. By setting the gradient of F

with respect

to 

and 

to zero, we obtain

@

i¼1



 1

¼ 0 ð9Þ

and

@

j¼1

ðz

 x

þ ð1 þ log 

Þ

¼ 0: ð10Þ

JING ET AL.: AN ENTROPY WEIGHTING k-Means ALGORITHM FOR SUBSPACE CLUSTERING OF HIGH-DIMENSIONAL SPARSE DATA 1029

剩余15页未读，继续阅读

dspok

粉丝: 0
资源: 2

高维稀疏数据子空间聚类的熵权k-均值算法

Kmeans聚类,kmeans聚类算法,matlab

canonical-entropy/nop-app-mall

Entropy Estimator based the Lempel-Ziv algorithm:Entropy Estimator based on the Lempel-Ziv algorithm-matlab开发

cross-entropy-for-combinatorics:附带“通过神经网络和LP解算器的组合学中的构造”手稿的代码

Transfer-Entropy-Using-Perron-Frobenius-Operator

entropy-23-00297-v2.pdf

entropy-0.1.5-py3-none-any

entropy-0.11-cp38-cp38-win32

entropy-0.1.5-py3-none-any.whl

k-means算法Java实现

最新资源