x and y, we have
DðxJyÞ¼
X
i
x
i
log
x
i
y
i
: ð2Þ
It is easy to note that in most cases DðxJyÞaDðyJxÞ, and
that DðxJyÞþDðyJzÞ Z DðxJzÞ cannot be guaranteed. So D is
not a metric. If we let ‘‘dist’’ be D in Eq. (1), we have the
objective function of information-theoretic K-means clus-
tering (Info-Kmeans) as follows:
O
1
: min
fc
k
g
X
k
X
x2c
k
p
x
DðxJm
k
Þ, ð3Þ
where each instance x is normalized to a discrete dis-
tribution, and m
k
¼
P
x2c
k
p
x
x=
P
x2c
k
p
x
is the arithmetic
mean of instances assigned to cluster c
k
. Let 9 9 denote
the sum of feature values of a vector. Since 9x9 ¼ 1, we
have 9m
k
9 ¼ 1, 8k.
It has been pointed out that Info-Kmeans actually aims
to minimize the loss of mutual information between the
instance variable and the feature variable after the clus-
tering [3]. This illustrates why Info-Kmeans belongs to the
family of information-theoretic clustering.
2.2. The problem of Info-Kmeans
Though having clear physical meaning, Info-Kmeans has
long been criticized for performing relatively poorly on
high-dimen sional sparse data [28]. In this section, however,
we highlight an implementation challenge of Info-Kmeans.
We believe this challenge is one of the major factors that
degrade the performance of Info-Kmeans.
Assume that we use Info-Kmeans to cluster a text
corpus. To this end, we must compute the KL-divergence
between each text vector x and each centroid m
k
.In
practice, we usually let x ¼ x=9x9, and then compute
DðxJm
k
Þ by Eq. (2). Note that Eq. (2) implies that all the
feature values of x are positive real numbers. Unfortu-
nately, however, this is not the case for high-dimensional
data, which are usually very sparse in their feature space.
To illustrate this, we observe the computation of KL-
divergence in the ith dimension. Let D
i
denote x
i
logðx
i
=m
k, i
Þ,
we have four scenarios as follows:
1. Case 1: x
i
4 0 and m
k, i
4 0. In this case, the computa-
tion of D
i
is straightforward, and the result can be any
real number.
2. Case 2: x
i
¼0 and m
k, i
¼ 0. In this case, we can simply
omit this feature, or equivalently let D
i
¼0.
3. Case 3: x
i
¼0 and m
k, i
4 0. In this case, logðx
i
=m
k, i
Þ¼
log 0 ¼1, which implies that the direct computa-
tion is infeasible. However, by the L’ Hospital’s rule [1],
lim
x-0
þ
x logðx=aÞ¼0(a4 0). So we can let x6x
i
and
a6m
k, i
, and thus have D
i
¼0.
4. Case 4: x
i
4 0 and m
k, i
¼ 0. In this case, D
i
¼þ1, which
is hard to handle in practice.
We summarize the four cases in Table 1.Ingeneral,for
Cases 1 and 2, the computation of D
i
is logically reasonable.
However, the computation of D
i
in Case 3 is somewhat
weird; that is, it cannot reveal any difference between x
i
and m
k, i
,althoughm
k, i
may deviate heavily from zero.
Nevertheless, the most difficult case is Case 4. It will lead
to infinite D and hinder the instance from being properly
assigned. This is particularly true for high-dimensional
sparse data, since the centroids of such data typically
contain many zero-value features. We call this problem
the ‘‘zero-feature dilemma’’.
Problem definition: Design a new information-theoretic
K-means algorithm, which can avoid the zero-feature
dilemma and is particularly suitable for high-dimensional
sparse-data clustering.
3. The SAIL algorithm
In this section, we propose a new algorithm called
Summation-bAsed Incremental Learning (SAIL), for Info-
Kmeans clustering.
3.1. SAIL: theoretical foundation
Let H(x) denote the Shannon entropy of a discrete
distribution x. We first have the following lemma:
Lemma 1.
DðxJyÞ¼HðxÞþHðyÞþðxyÞ
t
r
HðyÞ: ð4Þ
Proof. Since HðyÞ¼
P
d
i ¼ 1
y
i
log y
i
, it is easy to have
r
HðyÞ¼ðlog y
1
, ..., log y
d
Þ
T
log e ð1, ..., 1Þ
T
:
Accordingly, we have
HðxÞþHðyÞþðxyÞ
T
r
HðyÞ
¼
X
d
i ¼ 1
x
i
logðx
i
y
i
Þ
|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl}
ðaÞ
log e
X
d
i ¼ 1
ðx
i
y
i
Þ
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
ðbÞ
:
Since ðaÞ¼DðxJyÞ and (b)¼0 provided that
P
d
i ¼ 1
x
i
¼
P
d
i ¼ 1
y
i
¼ 1, the lemma follows. &
Based on DðxJyÞ in Eq. (4), we now derive SAIL, a new
variant of Info-Kmeans. Specifically, we have the follow-
ing theorem:
Theorem 1. Given
p
c
k
¼
P
x2c
k
p
x
, the objective function of
Info-Kmeans O
1
in Eq. (3) is equivalent to
O
2
: min
fc
k
g
X
k
p
c
k
Hðm
k
Þ: ð5Þ
Proof. By Eq. (4), we have DðxJm
k
Þ¼HðxÞþHðm
k
Þþ
ðxm
k
Þ
t
r
Hðm
k
Þ. As a result,
X
k
X
x2c
k
p
x
DðxJm
k
Þ¼
X
k
p
c
k
Hðm
k
Þ
|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}
ðaÞ
X
x
p
x
HðxÞ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
ðbÞ
Table 1
Four cases in KL-divergence computation.
Case i ii iii iv
x
i
4 0 ¼0 ¼0 4 0
m
k, i
4 0 ¼0 4 0 ¼0
D
i
ð1, þ1Þ ¼0 ¼0 þ1
J. Cao et al. / Signal Processing 93 (2013) 2026–20372028