information was embedded in the coefficient matrix towards a con-
strained matrix factorization in Liu, Wu, Cai, and Huang (2012).
We can easily observe that most of the above methods do not
employ the discriminant information of the data, which plays a
critical role in data analysis. Nevertheless, it is indeed a fact that
it seems very challenging to extract the discriminant information
of the data in a fashion of unsupervised learning, such as nonneg-
ative matrix factorization. This central problem is the focus of this
work and details are narrated in the following.
3. The proposed DON approach
In this section, we mainly introduce our approach, i.e., Discrim-
inative Orthogonal Nonnegative matrix factorization (DON). First,
we deduce the objective function from manifold discriminant
learning. Second, we present the optimization framework to solve
the objective function and the convergence proof of the multiplica-
tive update algorithm. In the end, we give a brief analysis on the
computational complexity of the proposed method. Prior to our ap-
proach, we begin with a short review of NMF.
3.1. A review of NMF
Given n data points with m features, we denote the input data by
the matrix X 2 R
mn
þ
. Here, the symbol R
þ
means the real data sets
with nonnegative elements. This collection of data is expected to be
categorized or partitioned into c groups. Nonnegative matrix factor-
ization (NMF) (Lee & Seung, 1999) aims to find two non-negative
matrices B 2 R
mc
þ
and V 2 R
nc
þ
, such that the product of them
approximates the original data matrix as much as possible, i.e.,
X BV
T
; ð1Þ
where B is treated as a basis matrix and V is a coefficient matrix. In
practical applications, the reduced dimension c is generally much
lower than the rank of X, i.e., c m; c n. Each row
v
i
of the coef-
ficient matrix is regarded as the low-dimensional data representa-
tion of the data point x
i
under the new basis.
3.2. Manifold discriminant learning
In this work, our goal is to obtain a good data representation
that preserves both the local geometrical structure and the global
discriminating information. Therefore, we propose to exploit the
manifold discriminant learning to achieve this goal. In particular,
we explicitly employ the local manifold learning to reflect the
intrinsic geometry of the data distribution, and the global discrim-
inant learning to equip the new data representation with the dis-
criminating power. In the following, we respectively describe the
two components.
3.2.1. Local manifold learning
In many real-world applications, data points often reside on a
much lower-dimensional manifold, which has attracted much
attention on manifold learning in recent years (Belkin & Niyogi,
2001). Generally, nearby data points will likely to have similar fea-
tures and be categorized or partitioned into the same group. It is
often assumed that nearby data points will also be close in the
new data space, called ‘‘manifold assumption’’ (Belin et al., 2006).
Therefore, if we introduce this to nonnegative matrix factorization,
i.e., local manifold learning, it is expected that nearby data points
within a small neighborhood and corresponding representations
share a common intrinsic geometrical structure.
To capture this intrinsic data structure, we take advantage of
the graph-based manifold method, which is also adopted in other
NMF variants (Cai et al., 2011a, 2011b). The input data points are
modeled as a graph with n vertices, an edge is established if two
data points are close in the k nearest neighborhood. In particular,
the weight matrix W should be designed to reflect the local rela-
tionship and there are several weighting schemes, such as binary
weights, gaussian weights and dot product weights (Belin et al.,
2006). Here, we adopt the gaussian weights, i.e.,
W
ij
¼
e
kx
i
x
j
k
2
r
2
; if x
i
2N
k
ðx
j
Þ or x
j
2N
k
ðx
i
Þ;
0; otherwise:
8
<
:
ð2Þ
where
r
is the bandwidth parameter, N
k
ðx
i
Þ denotes the set of k
nearest neighbors of x
i
. The graph Laplacian matrix is L ¼ D W,
where D
ii
¼
P
j
W
ij
, and L is a discrete approximation to the La-
place–Beltrami operator on the manifold (Belkin & Niyogi, 2001).
Therefore, maximizing the smoothness of the new data representa-
tion is essentially to minimize the gap between each pairwise data
points within a small neighborhood in the lower-dimensional data
space (Cai et al., 2011b), i.e.,
min
V
TrðV
T
LVÞ; ð3Þ
where TrðÞ is the trace operator for matrix. In this way, the nearby
data points are encouraged to be as close as possible in the new data
space.
3.2.2. Global discriminant learning
In order to make the learned data representation characterize
the discriminating power, we attempt to discover the global dis-
criminant information embedded in the data space. Previous stud-
ies in Ye et al. (2007) and Yang et al. (2011) have shown that the
discriminant relation can be disclosed by introducing a scaled indi-
cator matrix and using the between-class scatter matrix and the
total scatter matrix of the data.
To achieve this, we follow Ye et al. (2007) and denote the indi-
cator matrix by Y 2f0; 1g
nc
. Then, its scaled indicator matrix is
expressed by
F ¼ YðY
T
YÞ
1=2
: ð4Þ
Each column of F is given by
F
j
¼ 0; ...; 0; 1; ...; 1
zfflfflfflffl}|fflfflfflffl{
n
j
; 0; ...; 0
2
4
3
5
T
=
ffiffiffiffi
n
j
p
; ð5Þ
where n
j
is the sample size of the j-th group C
j
. In our approach, we
expect the learned data representation V characterizes the structure
of F in order to capture the discriminating ability and yield promis-
ing learning results in the low-dimensional data space R
nc
þ
. Conse-
quently, we use a very small constant
e to control the degree that
the data representation V approaches F, i.e., kV Fk
2
6 e.
Define a centering matrix H ¼ I
n
1
n
1
n
1
T
n
, where 1
n
is a column
vector with all ones, I
n
is an identity matrix. Intuitively, we want to
maximize the inter-distance among groups with respect to the in-
tra-distance within individual groups. Let
^
X ¼ XH be the centered
matrix, then the between-cluster class matrix S
b
¼
^
XFF
T
^
X
T
and the
total scatter matrix S
t
¼
^
X
^
X
T
(Ye et al., 2007) should satisfy this
formulation
max
F
Tr½ðS
t
þ
l
I
m
Þ
1
S
b
; ð6Þ
where the parameter
l
¼ 1e 10 is utilized to avoid the singular
problem. Since TrðF
T
HFÞ¼c 1 is a constant, (6) is equivalent to
(Yang et al., 2011)
min
F
Tr½F
T
ðH
^
X
T
ð
^
X
^
X
T
þ
l
I
m
Þ
1
^
XÞF: ð7Þ
As a result, we can obtain a data representation characterizing
the discriminating power by minimizing the above equation.
P. Li et al. / Expert Systems with Applications 41 (2014) 1283–1293
1285