信息论导向的K均值聚类在图像索引中的应用

研究论文

175 浏览量更新于2024-08-29 收藏 2.25MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"面向信息理论的K均值聚类以进行图像索引" 本文是一篇研究论文，探讨了在图像索引中的信息理论K均值聚类方法。该研究由来自中国南京财经大学、北京航空航天大学和南京大学的研究者共同完成。论文主要关注的是如何利用K-means算法和信息论中的KL散度来处理高维数据，特别是由包特征（Bag-of-Features，BOF）模型表示的图像。 K-means聚类是一种广泛应用的无监督学习方法，用于将数据集分成K个不同的簇，其中每个数据点都分配给最近的簇中心。然而，在处理图像数据时，由于图像的高维度特性，传统的K-means算法可能会遇到挑战，尤其是在图像数据通常具有高度稀疏性的环境中。这意味着许多特征值可能为零，这在确定对象与质心关联时会引发问题。信息理论K均值（Info-Kmeans）是为了解决这一问题而提出的，它将KL散度（Kullback-Leibler Divergence）作为一种距离度量，而非传统的欧几里得距离。KL散度衡量了两个概率分布之间的差异，更适应于处理稀疏数据，因为它不考虑零值特征。在图像分析中，KL散度能够更好地捕捉不同特征向量之间的相似性和差异性，从而提高聚类效果。然而，高维度和稀疏性的图像数据对聚类算法提出了额外的挑战。为了应对这些挑战，论文可能探讨了变量邻域搜索（Variable Neighborhood Search, VNS）等优化策略。VNS是一种全局优化技术，通过改变搜索空间的邻域结构来探索解决方案空间，有助于在复杂的聚类任务中找到更好的簇结构。在摘要中，虽然没有详细描述实验结果和具体方法，但可以推测论文可能涉及以下几点内容：首先，论文可能阐述了如何使用KL散度改进K-means算法，以适应图像数据的稀疏性；其次，可能介绍了VNS如何被应用到Info-Kmeans中以增强聚类性能；最后，论文可能提供了实证分析，证明了这种方法相对于传统K-means在图像索引上的优势。这篇研究论文致力于开发一种更适应图像数据特性的聚类方法，通过结合信息理论和优化算法，为图像索引提供更高效和准确的解决方案。这对于大规模图像数据库的管理和检索有着重要的实际意义，特别是在计算机视觉、图像处理和信息检索领域。

资源详情

资源推荐

x and y, we have

DðxJyÞ¼

log

: ð2Þ

It is easy to note that in most cases DðxJyÞaDðyJxÞ, and

that DðxJyÞþDðyJzÞ Z DðxJzÞ cannot be guaranteed. So D is

not a metric. If we let ‘‘dist’’ be D in Eq. (1), we have the

objective function of information-theoretic K-means clus-

tering (Info-Kmeans) as follows:

: min

x2c

DðxJm

Þ, ð3Þ

where each instance x is normalized to a discrete dis-

tribution, and m

x2c

is the arithmetic

mean of instances assigned to cluster c

. Let 9  9 denote

the sum of feature values of a vector. Since 9x9 ¼ 1, we

have 9m

9 ¼ 1, 8k.

It has been pointed out that Info-Kmeans actually aims

to minimize the loss of mutual information between the

instance variable and the feature variable after the clus-

tering [3]. This illustrates why Info-Kmeans belongs to the

family of information-theoretic clustering.

2.2. The problem of Info-Kmeans

Though having clear physical meaning, Info-Kmeans has

long been criticized for performing relatively poorly on

high-dimen sional sparse data [28]. In this section, however,

we highlight an implementation challenge of Info-Kmeans.

We believe this challenge is one of the major factors that

degrade the performance of Info-Kmeans.

Assume that we use Info-Kmeans to cluster a text

corpus. To this end, we must compute the KL-divergence

between each text vector x and each centroid m

.In

practice, we usually let x ¼ x=9x9, and then compute

DðxJm

Þ by Eq. (2). Note that Eq. (2) implies that all the

feature values of x are positive real numbers. Unfortu-

nately, however, this is not the case for high-dimensional

data, which are usually very sparse in their feature space.

To illustrate this, we observe the computation of KL-

divergence in the ith dimension. Let D

denote x

logðx

k, i

Þ,

we have four scenarios as follows:

1. Case 1: x

4 0 and m

k, i

4 0. In this case, the computa-

tion of D

is straightforward, and the result can be any

real number.

2. Case 2: x

¼0 and m

k, i

¼ 0. In this case, we can simply

omit this feature, or equivalently let D

¼0.

3. Case 3: x

¼0 and m

k, i

4 0. In this case, logðx

k, i

Þ¼

log 0 ¼1, which implies that the direct computa-

tion is infeasible. However, by the L’ Hospital’s rule [1],

lim

x-0

x logðx=aÞ¼0(a4 0). So we can let x6x

and

a6m

k, i

, and thus have D

¼0.

4. Case 4: x

4 0 and m

k, i

¼ 0. In this case, D

¼þ1, which

is hard to handle in practice.

We summarize the four cases in Table 1.Ingeneral,for

Cases 1 and 2, the computation of D

is logically reasonable.

However, the computation of D

in Case 3 is somewhat

weird; that is, it cannot reveal any difference between x

and m

k, i

,althoughm

k, i

may deviate heavily from zero.

Nevertheless, the most difﬁcult case is Case 4. It will lead

to inﬁnite D and hinder the instance from being properly

assigned. This is particularly true for high-dimensional

sparse data, since the centroids of such data typically

contain many zero-value features. We call this problem

the ‘‘zero-feature dilemma’’.

Problem deﬁnition: Design a new information-theoretic

K-means algorithm, which can avoid the zero-feature

dilemma and is particularly suitable for high-dimensional

sparse-data clustering.

3. The SAIL algorithm

In this section, we propose a new algorithm called

Summation-bAsed Incremental Learning (SAIL), for Info-

Kmeans clustering.

3.1. SAIL: theoretical foundation

Let H(x) denote the Shannon entropy of a discrete

distribution x. We ﬁrst have the following lemma:

Lemma 1.

DðxJyÞ¼HðxÞþHðyÞþðxyÞ

HðyÞ: ð4Þ

Proof. Since HðyÞ¼

i ¼ 1

log y

, it is easy to have

HðyÞ¼ðlog y

, ..., log y

log e ð1, ..., 1Þ

Accordingly, we have

HðxÞþHðyÞþðxyÞ

HðyÞ

i ¼ 1

logðx

y

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

ðaÞ

log e 

i ¼ 1

ðx

y

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

ðbÞ

Since ðaÞ¼DðxJyÞ and (b)¼0 provided that

i ¼ 1

¼ 1, the lemma follows. &

Based on DðxJyÞ in Eq. (4), we now derive SAIL, a new

variant of Info-Kmeans. Speciﬁcally, we have the follow-

ing theorem:

Theorem 1. Given

x2c

, the objective function of

Info-Kmeans O

in Eq. (3) is equivalent to

: min

Hðm

Þ: ð5Þ

Proof. By Eq. (4), we have DðxJm

Þ¼HðxÞþHðm

Þþ

ðxm

Hðm

Þ. As a result,

x2c

DðxJm

Þ¼

Hðm

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

ðaÞ



HðxÞ

|ﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄ}

ðbÞ

Table 1

Four cases in KL-divergence computation.

Case i ii iii iv

4 0 ¼0 ¼0 4 0

k, i

4 0 ¼0 4 0 ¼0

ð1, þ1Þ ¼0 ¼0 þ1

J. Cao et al. / Signal Processing 93 (2013) 2026–20372028

剩余11页未读，继续阅读

weixin_38609913

粉丝: 7
资源: 930

信息论导向的K均值聚类在图像索引中的应用

Python-PyColorPalette能够通过K均值聚类过程从给定图像中提取主要颜色或特定索引处的颜色列表

k均值聚类 matlab

图像处理k均值聚类matlab

帮我写一个利用k均值聚类算法进行图像分割并且计算出图像分割后每个区域的像素值以及利用圆形度描绘出分割后的图像的每个区域的形状的matlab代码

iso k均值聚类 matlab

matlabk均值聚类算法

K 均值聚类 matlab

matlab k均值聚类并求聚类结果

matlab实现k均值聚类

matlab如何k均值聚类

k均值聚类算法实现图片分类 matlab

k-均值聚类matlab

k均值聚类函数 matlab

kmeans均值聚类算法matlab

怎么查看k均值聚类后的均值

使用python实现k均值聚类

k均值聚类算法，怎么查看每个样本所属簇的编号

模糊均值聚类uij推导

阐述K-近邻算法的核心思想，并用程序实现近邻法的快速算法，样本集划分可采用K均值聚类或其他聚类算法，搜索时采用分枝定界算法，给出代码，并用文字描述搜索过程。

已知相似度矩阵，如何用近邻传播聚类算法进行聚类分析并返回聚类结果和聚类中心索引

最新资源