(a) to illustrate the learned transformation matrix by SPCA. Note
that each row of the transformation matrix corresponds to an
original feature while each column corresponds to a dimension-
ality of the subspace. For one fixed dimensionality of the subspace,
the feature with zero loading is not selected. For example, the first
four features are not selected on the second dimensionality but
they are selected on the remaining subspace dimensionality. Be-
sides, the eighth feature is not selected on the third dimensionality
but it is selected on the remaining subspace dimensionality; the
sixth feature is not selected on the sixth dimensionality but it is
selected on the remaining subspace dimensionality. Since the
feature loadings across all the subspace dimensionality cannot be
ignorable, it still cannot tell us that which features are really
useless as a whole. That is, the useless feature cannot be jointly
excluded by SPCA. Inspired by SPCA, we aim to learn a transfor-
mation matrix with row-sparsity, which is shown in the left sub-
figure in Fig. 1(a). In this way, the learned transformation matrix
can tell us that the third and seventh features are useless. This is
the reason that why we add
ℓ
2,1
-norm on the transformation
matrix.
On the other hand, considering the largely appearing of outliers
in real-world applications, we utilize
ℓ
2,1
-norm on loss term to
enhance the robustness to outliers. In order to test the robustness
to outliers of JSPCA, 200 points near a straight line are generated
with 20 outliers. Then, we apply PCA and JSPCA to this data set,
respectively. From Fig. 1(b), we can see that PCA is significantly
affected while JSPCA is affected much less. This is the reason that
why we add
ℓ
2,1
-norm on loss term.
3.2. Objective function of JSPCA
Considering the outliers appearing in data sets and the con-
sistent selection of features, we propose the following optimiza-
tion formulation:
λ()= − +
()
JQ P X PQX Qarg min , arg min ,
2
QP QP
T
,,
2,1
2,1
where transformation matrix
∈
×
R
md
is first used to project the
data matrix X onto a low-dimensional subspace and another
transformation matrix
∈
×
PR
m
is then used to recover the data
matrix X. Here, we relax the orthogonal constraint of transfor-
mation matrix Q, introduce another transformation matrix P and
add joint
ℓ
2,1
-norms on both loss term and regularization term. In
this way, JSPCA can have more freedom to learn a low-dimensional
subspace that approximates to high-dimensional data in a flexible
way. The loss term
−XPQX
T
2,1
is not squared and hence it en-
hances the robustness to outliers. The penalty term
Q
2,1
pena-
lizes all m regression coefficients corresponding to a single feature
as a whole and hence our method is able to jointly select features.
On the other hand, the regularization term
Q
2,1
is convex and can
be easily optimized.
≥
, as a regularization parameter, is used to
balance the loss term and regularization term.
Directly solving Eq. (2) is difficult as both loss term and reg-
ularization term are non-smooth [1]. Using some mathematical
techniques for Eq. (2), we have,
()
λ
λ
λ
λ
λ
−+
=((−)(−))+()
=((−)(−))+( )
= (((−))(−))+(( ) )
=(−)+
3
XPQX Q
XPQXDXPQX QDQ
XPQX D DXPQX QD DQ
D X PQX D X PQX DQ DQ
DX PQX DQ
arg min
arg min 2tr 2 tr
arg min tr tr
arg min tr tr
arg min .
QP
T
QP
TT T T
QP
TT
T
TT
T
QP
TT T T
QP
T
FF
,
2,1
2,1
,
12
,
11 22
,
11 22
,
1
2
2
2
Hence, Eq. (2) becomes,
λ()= (− )+
()
JQ P D X PQX D Qarg min , arg min ,
4
QP QP
T
FF
,,
1
2
2
2
where
=
(− )
(− )
⋱()
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
D
XPQX
XPQX
1
2
1
2
,
5
T
T
1
1
2
2
2
and
=
⋱()
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
D
Q
Q
1
2
1
2
,
6
2
1
2
2
2
are two
×m
diagonal matrices. Note that
(− )XPQX
Ti
(= …
im1, 2, ,
means the i-th row of matrix
−
PQ
T
, and Q
i
(= …
im1, 2, ,
means the i-th row of matrix Q. When
(− ) =XPQX
Ti
2
, we let
=
ζ(− ) +
D
ii
XPQX
1
1
2
Ti
2
(
ζ
is a very small
constant). Similarly, when
=Q
i
2
, we let
=
ζ+
D
ii
Q
2
1
2
i
2
. In this
way, the smaller the
D
ii
2
is, the more important the i-th feature is.
Moreover, we can see that if
(− )XPQX
Ti
and
Q
i
are small, D
1
and D
2
are large and thus the minimization of
λ(( − ) ( − )) + (
XPQXDXPQX QDQ
tr 2 tr
TT T T
12
in Eq. (3) tends to force
(− )XPQX
Ti
and
Q
i
to be a very small value. After several
iterations, some
(− )XPQX
Ti
and
(= …
Qi m1, 2, ,
i
2
may be
close to zero and thus we obtain a joint sparse Q and a small re-
construction loss.
Next, let
=
¯
DP
1
, and
=
¯
−
DQ
1
1
. Then, the formulation in
Eq. (4) can be rewritten as,
λ−
¯
¯
+
¯
()
¯¯
DX PQ DX D DQarg min .
7
QP
T
FF
,
11
2
21
2
In order to reduce the feature redundancy, we impose the ortho-
gonal constraint
¯¯
=
×
PP I
T
d
for Eq. (7). Then, we have,
()
λ(
¯¯
)= −
¯¯
+
¯
¯¯
=
¯¯ ¯¯
×
8
JQ P D X PQ D X D D Q
PP I
arg min , arg min ,
s. t. ,
QP QP
T
FF
T
dd
,,
11
2
21
2
where
¯
∈
×
R
m
is first used to project the weighted data matrix
D
1
and
¯
∈
×
PR
md
is then used to recover it.
3.3. The optimal solution
The solution of Eq. (8) is divided into the below two steps:
Step 1: Given
¯
, there exists an optimal matrix
¯
⊥
P
such that
¯¯
⊥
PP,
is
×m
column orthogonal matrix. Then, optimization
problem in Eq. (8) becomes,
λ−
¯
¯
+
¯
()
¯
DX PQ DX D DQarg min .
9
Q
T
FF
11
2
21
2
The first part of Eq. (9) can be rewritten as,
S. Yi et al. / Pattern Recognition 61 (2017) 524–536526