rates (the logarithm of jTj) under a given constraint on the average
distortion D, is given by the following function:
RðDÞ¼ min
fpðtjxÞ:Eðdðx;tÞÞ6Dg
IðX; TÞ; ð1Þ
where the I(X;T) is the mutual information between X and T, and
the E(d(x,t)) is the expected distortion induced by p(tjx).
Unlike the rate distortion approach, the IB method avoids the
arbitrary choice of the distortion function d(x,t)(Tishby et al.,
1999). The motivation comes from the fact that in many cases,
defining the ‘‘target” variable Y with respect to X is a much easier
task than defining a distortion function. Given the joint probability
distribution p(x,y) on variables X and Y, the IB method considers
the following distortion function
dðx; tÞ¼D
KL
ðpðyjxÞkpðyjtÞÞ; ð2Þ
where D
KL
(fkg) is the Kullback–Leibler divergence between distribu-
tions f() and g(). What is interesting is that the pðyjtÞ¼
P
x
pðyjxÞpðxjtÞ is the function of p(tjx). Hence, the distortion
function d(x,t) is not predetermined here; instead as it searches
for the best representation p(tjx) it also searches for the most suit-
able distortion function d(x,t).
In the IB method, the expected distortion E(d(x,t)) can be writ-
ten as
Eðdðx; tÞÞ ¼ ED
KL
ðpðyjxÞkpðyjtÞÞ½
¼
X
x;t
pðx; tÞ
X
y
pðyjxÞ log
pðyjxÞ
pðyjtÞ
¼
X
x;t;y
pðx; t; yÞ log
pðyjxÞ
pðyjtÞ
¼ IðX; YÞIðT; YÞ: ð3Þ
Substituting formula (3) into the rate distortion function (1) we can
get:
RðDÞ¼ min
fpðtjxÞ:IðX;Y ÞIðT;YÞ6Dg
IðX; TÞ: ð4Þ
As I(X;Y) is a constant, the rate distortion function is usually written
as
RðDÞ¼ min
fpðtjxÞ:IðT;YÞPD
0
g
IðX; TÞ: ð5Þ
The equation shows that the IB method tries to minimize I(X; T)
while ensuring I(T;Y) is no less than an infimum D
0
. In a sense, this
objective function implements a ‘‘bottleneck” for the dependency
between X and Y through T, i.e., one is trying to squeeze the infor-
mation which X provides about Y through a compact ‘‘bottleneck”
formed by the compressed representation T. The objective of an IB
algorithm is then to minimize I(X;T)
a
I(T;Y), a compromise be-
tween two mutual information. If multiply it with
1
a
, we can get
the dual objective function of maximizing I(T; Y)
a
1
I(T;X)=
I(T;Y) bI(T; X), here b =
a
1
, and both
a
and b are predefined posi-
tive coefficients.
2.2. The IB algorithms
When b ? 0, the IB objective function is to maximize the I(T;Y),
the mutual information between T and Y. In this situation, p(tjx)
will approach zero or one almost everywhere. Without any
assumption about the origin of the joint distribution p(x, y), Tishby
et al. showed that the IB problem has an exact optimal solution
(Tishby et al., 1999). However, how to construct the optimal or
approximated solutions remains an open problem. Several algo-
rithms were developed for the IB problem (see (Slonim, 2002) for
detailed review and comparison). Among them, the sequential IB
(sIB) algorithm and the agglomerative (aIB) algorithm have been
used widely (Slonim, 2002).
2.2.1. The aIB algorithm
The aIB algorithm implements a hierarchical clustering process:
It starts with the trivial partition in which each element x 2 X rep-
resents a singleton cluster or component t 2 T. To minimize the
possible loss of mutual information I(T; Y), the aIB algorithm
merges ‘‘the most possible merging pair” which locally minimizes
the loss of I(T; Y) at each step. Let t
i
and t
j
be two elements of T, the
information loss, also called merger cost, due to the merging of t
i
and t
j
is then defined as Slonim (2002):
dðt
i
; t
j
Þ¼IðT
before
; YÞIðT
after
; YÞ P 0; ð6Þ
where I(T
before
,Y) and I(T
after
,Y) are the mutual information between
the T and Y before and after t
i
and t
j
are merged. This have been fur-
ther formulated as Slonim (2002):
dðt
i
; t
j
Þ¼ðpðt
i
Þþpðt
j
ÞÞ
dðt
i
; t
j
Þ; ð7Þ
where
d JS
P
½pðyjt
i
Þ; pðyjt
j
ÞbJS
P
½pðxjt
i
Þ; pðxjt
j
Þ; JS
P
½p; q is the
Jensen–Shannon divergence between distribution p() and q(),
and
P ¼
pðt
i
Þ
pðt
i
Þþpðt
j
Þ
;
pðt
j
Þ
pðt
i
Þþpðt
j
Þ
no
.
The aIB algorithm is the only available IB algorithm which can
generate a hierarchical results. However, as for the performance,
it is not as accurate as the sequential IB (sIB) algorithm (Slonim,
2002).
2.2.2. The sIB algorithm
The sIB algorithm implements a partitioning-based clustering
process. It starts from a random partition of X into the representa-
tion T. At each step, one element x 2 X is ‘‘drawn” from its current
cluster t. Then merge x into t
new
= argmin
t2T
d({x},t) and obtain a
new partition t
new
. When t – t
new
, the mutual information I(T;Y)
will increase. Such procedure continues until no more assignment
updates can further improve I(T;Y).
The major weakness with sIB algorithm is that it is a locally
optimal algorithm. Even though it is proved to converge to a local
stable solution, this solution may not be the optimal one.
2.2.3. The Multi-scale Ncuts algorithm
The Normalized-cuts algorithm was proposed by Jianbo Shi in
2000 (Shi and Malik, 2000), and then was been used into a multi-
scale framework (Multi-scale Ncuts) (Cour and Benezit, 2005).
Multi-scale Ncuts also utilize the neighborhood information in
their work. Though the name of the neighborhood is the same as
the one used in our work, there are still some differences:
Density and Neighborhood: In our work, we focus on the den-
sity. We assume that data points in the same cluster should
have the similar density value, so the data points who are den-
sity enough should be arranged into the same cluster. To find
those data points, we use the neighborhood concept to present
where they are density enough. On the other hand, Multi-scale
Ncuts emphasize the neighborhood and show that the data
points (in spectral area) in a neighborhood could be compressed
to a representative.
Details: In Multi-scale Ncuts, they firstly took the R-neighbor-
hoods, N
i
, N
j
of a pair of pixels i, j. Then, measure the variance
of affinities between each two pixels i
0
2 N
i
, j
0
2 N
j
. The affinity
variance across a small neighborhood decrease quickly, which
implies that the pixels in a neighborhood could be condensed
to a representative pixel. In contrast, in our work, we first found
312 Y. Ye et al. / Pattern Recognition Letters 32 (2011) 310–320