自适应半监督聚类：多密度信息方法

下载需积分: 5 | PDF格式 | 2.45MB | 更新于2024-08-11 | 93 浏览量 | 举报

"这篇研究论文探讨了一种名为‘基于多密度信息的自适应半监督聚类方法’，它发表在2017年的《神经计算》（Neurocomputing）期刊上，由Yun Yang、Zongze Li、Wei Wang和Dapeng Tao等人撰写。该方法旨在解决多媒体数据挖掘中的聚类问题，特别是在海量多媒体数据背景下，如何有效地发现内在结构并压缩信息。" 正文：随着多媒体信息的爆炸性增长，多媒体数据挖掘已经成为研究的重点，而聚类作为其中的关键任务，对于揭示大量多媒体数据的内在结构和提炼信息具有重要意义。尽管已提出多种方法来提升聚类性能和准确性，但在半监督学习环境下，如何利用有限的标签信息来指导无监督聚类仍然是一个挑战。传统的无监督聚类方法往往依赖于全局假设，如数据分布的球形或椭球形，这在实际应用中可能并不适用。另一方面，完全监督学习则需要大量的标记数据，这在数据量庞大且标注成本高昂的情况下是不切实际的。因此，半监督聚类成为了一个折衷方案，它利用少量的已知标签信息来引导聚类过程，从而提高聚类的准确性和鲁棒性。本研究提出的自适应半监督聚类方法结合了密度基础和约束基础的聚类思想。密度基础聚类，如DBSCAN（Density-Based Spatial Clustering of Applications with Noise），能够识别不同密度的区域，避免了对簇形状的假设，适合处理噪声和不规则形状的簇。然而，DBSCAN在处理大规模数据时效率较低，且对参数敏感。通过引入多密度信息，该方法可以更好地适应数据的多样性和复杂性。约束基础聚类则考虑了预先存在的类别信息，通过添加约束条件来优化聚类结果。在本文中，作者可能利用这些约束信息来调整聚类过程，确保已知标签的样本被正确地分到相应的簇中，同时优化其他未标记样本的分配。此外，该方法的自适应特性意味着它能根据数据的特性动态调整参数，从而提高聚类效果。这种方法的创新之处在于它能够结合两种不同的聚类策略，同时利用半监督学习的优势，提高在大规模、高维度和复杂数据环境下的聚类性能。这项研究为多媒体数据挖掘提供了一种新的、有效的工具，尤其是在处理大量未标记数据时，它能更精确地捕捉数据的内在结构，并且在有限的监督信息下实现高效聚类。这种方法对于数据科学、图像分析、模式识别等领域有着广泛的应用前景。

Y. Yang et al. / Neurocomputing 257 (2017) 193–205 195

concepts: evidential clustering and constraint-based clustering. EV-

CLUS uses the Dempster–Shafer theory to assign a mass function to

each instance. It provides a credal partition, which subsumes the

notions of crisp, fuzzy and possibilistic partitions. CEVCLUS con-

sists of taking advantage of prior information. Such background

knowledge is integrated as an additional term in the cost function.

Given dataset X = [ x

]

i =1

and number of clusters K , CEVCLUS aims

to obtain a clustering partition by minimizing a loss function con-

sisted of two terms:

C EVC LU S

(M, a, b) = L

EV CLUS

(M, a, b) + α

2(| ML | + | CL | )

CONST

(M)

(2)

where L

EVCLUS

( M, a, b ) represents the loss of conventional evidential

clustering (EVCLUS), and is expressed as:

EV CLUS

(M, a, b) =



i< j



i< j



aC F

+ b + d



(3)

In Eq. (3) the matrix of mass functions M = ( m

) corresponds

to a credal partition. The mass m

represents the degree of belief

that the instance x

is assigned to the cluster C

, which is regarded

by the distance between x

and representative point mean ( C

) of

cluster C

. C F

= 1 − p l

i × j

(θ ) represents the degree of conﬂict be-

tween two instances. pl

i × j

( θ) is the plausibility that the instance

and x

are in the same cluster and p l

i × j

(

θ ) is the plausibility

that the two instances are in a different cluster. a and b are two

coeﬃcients.

In the second term of Eq. (2) α

2( | ML | + | CL | )

CONST

(M) repre-

sents the loss of violating pairwise constraints, α ≥ 0 is a hyper-

parameter that controls the importance of constraints including

Must-Link(ML) and Cannot-Link(CL). Therefore, L

CONST

( M ) can be

formulated as follows:

CONST

(M) =



, x

∈ ML

p l

i × j





+ 1 − p l

i × j

(

)



, x

∈ CL

p l

i × j

(

)

+ 1 − p l

i × j

(

θ ) (4)

2.4. Constrained clustering via spectral regularization (CCSR)

In this approach [25] , a regularization framework has been de-

veloped for spectral clustering [26] in use of constraint informa-

tion. It aims to adapt the spectral embedding to accord with pair-

wise constraints by optimization, which constructs a smooth data

representation space parameterized by a spectral embedding, and

is most consistent with the pairwise constraints.

Given dataset X = [ x

]

i =1

and number of clusters K , it initially

forms a sparse symmetric similarity matrix W = [ w

]. where w

de-

notes the similarity between x

and x

. W is assumed to be non-

negative and symmetric. Then the normalized graph Laplacian is

obtained by

= I − D

−1 / 2

W D

−1 / 2

, where I is the identity matrix of

size n × n and D = diag( d

, ..., d

) with d



j=1

. Next, the

m eigenvectors v

, …, v

is computed corresponding to the

ﬁrst m smallest eigenvalues. Given F

= ( v

, …, v

), the new data

representation is written by F = F

A , it indicates that obtaining a

representation in consistent with the prior information of pairwise

constraints is equivalent to determining a suitable coeﬃcient ma-

trix A . In fact, coeﬃcient matrix A can be obtained by minimizing

the following loss function:

L (A ) =



(i, j, t

) ∈ S



A A

− t



(5)

where u

denotes the i th row of F

, F

= ( u

, …, u

)

, S =

{ (i, j, t

) } is a set of pairwise constraints, where t

is a binary in-

dicator that takes 1 or 0 to indicate x

and x

belong to the same

cluster or not. Let M = A A

, it is positive semi-deﬁnite. Then the

objective function in Eq. (5) can be re-written as:

L (M) =



(i, j, t

) ∈ S



M u

− t



(6)

M can be obtained by solving the semi-deﬁnite problem derived

from Eq. (6) such detail is shown in [25] . Once M is obtained, the

clustering partition can be ﬁnally produced by applying any con-

ventional clustering algorithm such as K-means on the new data

representation that is constructed by rows of F

1/2

2.5. Semi-supervised Naïve Bayesian classiﬁer using EM

In this approach, semi-supervised learning task is performed by

applying naive Bayes to the case of labeled and unlabeled data

with Expectation-Maximization (EM) technique [27] . Firstly a naive

Bayes classiﬁer with parameters of generative model θ is built in

the standard supervised fashion with the limited amount of la-

beled data x

∈ X

θ = arg max

(

)



∈ X



(

)



; θ



(7)

Then, we perform E-step of Expectation-Maximization (EM)

technique by classifying the unlabeled data x

∈ X

with the naive

Bayes model

θ. The class of unlabeled data is determined by

estimating the highest probability of that the generative model

component generates the corresponding unlabeled data, which is

shown in Eq. (8) :

= arg max

P ( c

| x

;

θ ) =

P ( c

θ ) P ( x

| c

;

θ )

P ( x

θ )

(8)

As M-step of Expectation-Maximization (EM) technique, a new

naive Bayes classiﬁer is built by re-estimating the parameters

with the updated labeled dataset in Eq. (7) .

The parameters of classiﬁer

θ is iteratively improved by repeat-

ing above process of classifying the unlabeled data and rebuilding

the naive Bayes model based on the labeled data until it converges

to a stable classiﬁer and set of labels for the data. In fact, the con-

vergence condition is indicated by no change in the log probability

of full dataset X with label Y in Eq. (9) :

l( θ | X, Y ) = log ( P ( θ ) ) +



∈ X

log



P ( c

| θ ) P ( x | c

; θ )



∈ X

log P ( y

= c

| θ ) P ( x | y

= c

; θ ) (9)

Description of our approach

In this section, we describe our approach shown in Algorithm 1

as two steps: Firstly, the density-based parameters including least

data points MinPts and a radius Eps are determined by use of both

labeled and unlabeled data for each of classes. Next, the local clus-

ters are constructed by applying DBSCAN on the target dataset

with different parameter sets, and then the ﬁnal clustering result

is obtained by integrating these local clusters.

3.1. Determination of density-based parameter sets

For density-based clustering approaches, different set of param-

eters (least data points MinPts and a radius Eps ) can result in var-

ious clustering results. We assume that the each class of instances

are constructed in different size, shape and density, thus it is nec-

essary to deﬁne individual set of parameters for each cluster.

剩余12页未读，继续阅读

weixin_38748239

粉丝: 3

自适应半监督聚类：多密度信息方法

多密度信息驱动的自适应半监督聚类算法

改进的半监督自适应亲和传播聚类算法

基于密度和自适应密度可达聚类的非监督图像分类研究

基于多个基于密度的信息的自适应半监督聚类方法

基于自适应邻接度图的非平衡数据谱聚类方法

改进的亲和力传播聚类

BoostMIS: 自适应伪标记与信息主动标注提升医学图像SSL性能

Python实现半监督异常检测算法研究

【ISODATA算法详解】：自适应聚类的探索与实践

在空间中寻找结构：基于密度的空间聚类算法

最新资源