解决不平衡数据的自适应多原型竞争学习算法

需积分: 10 121 浏览量更新于2024-07-16 收藏 2.5MB PDF 举报

"该文提出了一种针对不平衡数据集的自我适应多原型竞争学习方法，即SMCL算法，用于改进传统的k均值聚类。在SMCL算法中，首先通过自适应的方式将数据点分为多个子簇，确保每个子簇内的样本数量大致相等，以避免因数据不平衡导致的‘均匀效应’。然后，通过一种新颖的分离度量标准对子簇进行合并，形成最终的聚类结果。此外，文章还介绍了一种内部聚类验证方法，以评估和优化聚类质量。" 在聚类算法领域，不平衡数据集的问题一直是个挑战。在监督学习中，不平衡类别问题已经得到了广泛研究，但在无监督学习的聚类环境中，即各簇间的样本数量不均衡，这个问题的研究相对较少。这篇文章关注的正是这个无监督环境下的不平衡数据聚类问题。它引入了k均值类型的自我适应多原型竞争学习算法，试图解决这一难题。 SMCL算法的核心思想是利用多子簇策略来处理不平衡数据。在初始化阶段，算法会根据数据的分布动态调整每个簇的子簇数量，使得每个子簇内样本的分布尽可能均匀。这种策略有助于减少因数据不平衡导致的聚类偏差。接下来，为了构建最终的聚类，文章提出了一种新的内部子簇分离度量，用以指导子簇的合并过程。这个分离度量能够有效地评估子簇间的差异性，从而选择合适的方式将子簇合并为更大的簇，以获得更好的聚类结构。此外，SMCL算法还包括一个内部聚类验证机制，这是评估聚类效果的关键。通过对聚类结果的内部验证，可以判断算法是否成功地捕捉到了数据的内在结构，并据此进行优化。这种方法有助于提高聚类的稳定性和准确性，尤其是在处理不平衡数据时。这篇文章提出的SMCL算法是一种创新的聚类策略，特别适合处理不平衡数据集。通过自我适应的多原型划分和子簇合并策略，以及内部聚类验证，SMCL能够提供更准确、更平衡的聚类结果，对无监督学习领域的不平衡数据聚类问题提供了新的解决方案。

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LU et al.: SMCL APPROACH: k-MEANS-TYPE ALGORITHM FOR IMBALANCED DATA CLUSTERING 3

Section IV shows the experimental results with some discus-

sions. Finally, a conclusion is drawn in Section V.

II. O

VERVIEW OF RELATED WORK

This section brieﬂy reviews the k-means-type competitive

learning methods, the imbalance classiﬁcation methods, the

nonlinear clustering methods, and the imbalance clustering

methods, respectively.

A. k-Means-Type Competitive Learning

Adaptive k-means is the simplest competitive learning

method [8], [23]. Suppose the data point coming at time t is

and there are K seed points: m

, m

,...,m

to represent

the centroids of K clusters. Accordingly, the values of m

’s

at time t are denoted as: m

(t),...,m

(t). For simplicity, we

will hereinafter utilize m

’s and m

(t)’s interchangeably with-

out further distinction. When x

arrives, the winner seed point

is selected by the following indicator function:

j,x



1ifj = c = arg min

1≤i≤K



m

− x



0 otherwise

(1)

where the cth seed point, that is, the winner, has the mini-

mum distance to x

. Then, the winner seed point is updated

by moving toward x

controlled by a small learning rate α

(

t + 1

)



(

)

+ α



− m

(

)



if I

j,x

= 1

(

)

otherwise

(2)

for j = 1,...,K. To overcome the dead-unit problem

that some seed points may never win, FSCL [13] uses

the frequency weighted distance to determine the winner

indicator [13]

j,x



1ifj = c = arg min

1≤i≤K



m

− x



0 otherwise

(3)

where γ

= n



l=1

is the frequency weight and n

is the

cumulative number of winning times of m

. The seed point

updating remains the same as (2). By adopting this frequency

weight, the seed point that barely wins in the past gains more

chance to win in the future. Furthermore, RPCL [14] improves

FSCL to make the number of clusters can be determined auto-

matically by utilizing the rival penalization mechanism. In

addition to updating the winner seed point, RPCL updates the

rival seed point toward the opposite direction. The rival seed

point is selected as the second closest one to x

j,x

⎧

⎨

⎩

1ifj = c = arg min

1≤l≤K



m

− x



−1ifj = r = arg min

1≤l≤K,l=c



m

− x



0 otherwise

(4)

and the seed points are updated by

(

t + 1

)

⎧

⎨

⎩

(

)

+ α



− m

(

)



if I

j,x

= 1

(

)

− α



− m

(

)



if I

j,x

=−1

(

)

otherwise

(5)

for j = 1,...,K. α

is the delearning rate for the rivals, which

is generally smaller than the learning rate α

. By driving the

rival seed point away, each cluster will not be shared by two

or more seed points. Therefore, the number of clusters can

be determined automatically by counting the remaining seed

points [14]. RPCCL [16] further improves RPCL by making

the rival penalization self-adaptable. According to the positions

of the incoming data point, winner seed point, and rival seed

point, RPCCL determines

= α

min

(

m

− m

, x

− m



)

m

− m



. (6)

Thus, RPCCL only needs one parameter, that is, the learn-

ing rate α

. The value of rival penalization is reduced if

is closer to m

than m

. It overcomes the drawback of

RPCL that an appropriate value of α

is hard to be chosen.

Further, rival penalized expectation–maximization (RPEM)

makes the clustering components in a density mixture com-

pete with each other, and the rivals intrinsically penalized

with a dynamic control during the learning [20]. Thus, the

number of clustering components, that is, the number of

clusters, can be determined with the redundant densities

gradually faded out automatically from the density mix-

ture. Ma and Wang [17] used a cost-function approach to

solve the convergence problem of RPCL. Based on the

theoretical analysis, they propose distance-sensitive RPCL

(DSRPCL). It aims to minimize a speciﬁcally designed cost

function for RPCL to make it theoretically sound. Competitive

repetition-suppression (CoRe) [18] is inspired by biological

phenomenon. It improves RPCL by allowing multiple win-

ners existing in each clustering iteration. It uses a gradient

descent strategy to update the positions of the seed point and

the spread in terms of a Gaussian function. For the datasets

that are not linearly separable, stochastic competitive learn-

ing (SCL) [24] and graph-based multiprototype competitive

learning (GMPCL) [25] utilize kNN to construct the neigh-

borhood graph before carrying on competitive learning. SCL

is a stochastic competitive learning model. The seed points try

to occupy the nodes in the network by random walking and

defending their territory from rival seed points at the same

time. Finally, the dominance of each node is determined by

the visiting frequency of the seed points. GMPCL ﬁrst selects

a portion of data points as the core points, according to their

connectivity in the graph, to produce coarse clusters. Then, it

applies afﬁnity propagation and competitive learning to reﬁne

the coarse clusters on all data points. Moreover, the competi-

tive learning is integrated with cooperative learning [26], [27].

The winner seed point is assigned a conﬁdence coefﬁcient

based on its past winning frequency. The winner seed point

with high conﬁdence coefﬁcient will cooperate more and

penalize less the nearby seed points. Finally, the seed points

in the same cluster will merge together so that the number of

clusters is selected. Its kernel version can also handle nonlin-

early separable data. However, none of the above-mentioned

methods consider the situation of imbalanced data clustering.

B. Nonlinear Clustering

Some advanced k-means-type clustering meth-

ods [22], [25], [28] can produce nonspherical clusters

by merging subclusters. In contrast, the nonlinear cluster-

ing methods can directly generate clusters with arbitrary

剩余14页未读，继续阅读

山中无岁月

粉丝: 0
资源: 1

解决不平衡数据的自适应多原型竞争学习算法

Enhanced k-Means Clustering Algorithm for Malaria Image.pdf

K-Means-Algorithm

如何将YOLOv5 6.0锚框K-Means算法改为K-Means++，请给出代码和具体步骤

找几篇PSO优化k_means相关的论文

YOLOv5 k-means++

k-means算法的elkan

k-means mean shift

K-means algorithm

https://github.com/qyokizzzz/ai-algorithm/tree/master/k-means

k-means算法改进优化matlab

最新资源