This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LU et al.: SMCL APPROACH: k-MEANS-TYPE ALGORITHM FOR IMBALANCED DATA CLUSTERING 3
Section IV shows the experimental results with some discus-
sions. Finally, a conclusion is drawn in Section V.
II. O
VERVIEW OF RELATED WORK
This section briefly reviews the k-means-type competitive
learning methods, the imbalance classification methods, the
nonlinear clustering methods, and the imbalance clustering
methods, respectively.
A. k-Means-Type Competitive Learning
Adaptive k-means is the simplest competitive learning
method [8], [23]. Suppose the data point coming at time t is
x
t
and there are K seed points: m
1
, m
2
,...,m
K
to represent
the centroids of K clusters. Accordingly, the values of m
j
’s
at time t are denoted as: m
1
(t),...,m
K
(t). For simplicity, we
will hereinafter utilize m
j
’s and m
j
(t)’s interchangeably with-
out further distinction. When x
t
arrives, the winner seed point
is selected by the following indicator function:
I
j,x
t
=
1ifj = c = arg min
1≤i≤K
m
i
− x
t
2
0 otherwise
(1)
where the cth seed point, that is, the winner, has the mini-
mum distance to x
t
. Then, the winner seed point is updated
by moving toward x
t
controlled by a small learning rate α
c
m
j
(
t + 1
)
=
m
j
(
t
)
+ α
c
x
t
− m
j
(
t
)
if I
j,x
t
= 1
m
j
(
t
)
otherwise
(2)
for j = 1,...,K. To overcome the dead-unit problem
that some seed points may never win, FSCL [13] uses
the frequency weighted distance to determine the winner
indicator [13]
I
j,x
t
=
1ifj = c = arg min
1≤i≤K
γ
i
m
i
− x
t
2
0 otherwise
(3)
where γ
i
= n
i
/
K
l=1
n
l
is the frequency weight and n
i
is the
cumulative number of winning times of m
i
. The seed point
updating remains the same as (2). By adopting this frequency
weight, the seed point that barely wins in the past gains more
chance to win in the future. Furthermore, RPCL [14] improves
FSCL to make the number of clusters can be determined auto-
matically by utilizing the rival penalization mechanism. In
addition to updating the winner seed point, RPCL updates the
rival seed point toward the opposite direction. The rival seed
point is selected as the second closest one to x
t
I
j,x
t
=
⎧
⎨
⎩
1ifj = c = arg min
1≤l≤K
γ
l
m
l
− x
t
2
−1ifj = r = arg min
1≤l≤K,l=c
γ
l
m
l
− x
t
2
0 otherwise
(4)
and the seed points are updated by
m
j
(
t + 1
)
=
⎧
⎨
⎩
m
j
(
t
)
+ α
c
x
t
− m
j
(
t
)
if I
j,x
t
= 1
m
j
(
t
)
− α
r
x
t
− m
j
(
t
)
if I
j,x
t
=−1
m
j
(
t
)
otherwise
(5)
for j = 1,...,K. α
r
is the delearning rate for the rivals, which
is generally smaller than the learning rate α
c
. By driving the
rival seed point away, each cluster will not be shared by two
or more seed points. Therefore, the number of clusters can
be determined automatically by counting the remaining seed
points [14]. RPCCL [16] further improves RPCL by making
the rival penalization self-adaptable. According to the positions
of the incoming data point, winner seed point, and rival seed
point, RPCCL determines
α
r
= α
c
min
(
m
r
− m
c
, x
t
− m
c
)
m
r
− m
c
. (6)
Thus, RPCCL only needs one parameter, that is, the learn-
ing rate α
c
. The value of rival penalization is reduced if
x
t
is closer to m
c
than m
r
. It overcomes the drawback of
RPCL that an appropriate value of α
r
is hard to be chosen.
Further, rival penalized expectation–maximization (RPEM)
makes the clustering components in a density mixture com-
pete with each other, and the rivals intrinsically penalized
with a dynamic control during the learning [20]. Thus, the
number of clustering components, that is, the number of
clusters, can be determined with the redundant densities
gradually faded out automatically from the density mix-
ture. Ma and Wang [17] used a cost-function approach to
solve the convergence problem of RPCL. Based on the
theoretical analysis, they propose distance-sensitive RPCL
(DSRPCL). It aims to minimize a specifically designed cost
function for RPCL to make it theoretically sound. Competitive
repetition-suppression (CoRe) [18] is inspired by biological
phenomenon. It improves RPCL by allowing multiple win-
ners existing in each clustering iteration. It uses a gradient
descent strategy to update the positions of the seed point and
the spread in terms of a Gaussian function. For the datasets
that are not linearly separable, stochastic competitive learn-
ing (SCL) [24] and graph-based multiprototype competitive
learning (GMPCL) [25] utilize kNN to construct the neigh-
borhood graph before carrying on competitive learning. SCL
is a stochastic competitive learning model. The seed points try
to occupy the nodes in the network by random walking and
defending their territory from rival seed points at the same
time. Finally, the dominance of each node is determined by
the visiting frequency of the seed points. GMPCL first selects
a portion of data points as the core points, according to their
connectivity in the graph, to produce coarse clusters. Then, it
applies affinity propagation and competitive learning to refine
the coarse clusters on all data points. Moreover, the competi-
tive learning is integrated with cooperative learning [26], [27].
The winner seed point is assigned a confidence coefficient
based on its past winning frequency. The winner seed point
with high confidence coefficient will cooperate more and
penalize less the nearby seed points. Finally, the seed points
in the same cluster will merge together so that the number of
clusters is selected. Its kernel version can also handle nonlin-
early separable data. However, none of the above-mentioned
methods consider the situation of imbalanced data clustering.
B. Nonlinear Clustering
Some advanced k-means-type clustering meth-
ods [22], [25], [28] can produce nonspherical clusters
by merging subclusters. In contrast, the nonlinear cluster-
ing methods can directly generate clusters with arbitrary