L. Wang, B. Wang, Z. Zhang et al. / Neural Networks 117 (2019) 201–215 203
well handle the original data that usually contains noise and cor-
ruptions, IRPCA computes a low-rank projection matrix P to com-
pute the low-rank principal components by solving the following
nuclear-norm minimization based problem:
min
P,E
∥
P
∥
∗
+ λ
∥
E
∥
1
, s.t. X = PX + E, (1)
where
∥
·
∥
∗
is the nuclear-norm,
∥
·
∥
1
is the L1-norm, E = X − PX
is the sparse error, and λ is a positive factor depending on the
noise level (Bao et al., 2012; Liu et al., 2013, 2010). Suppose P
∗
is an optimal solution to the problem, the principal components
can be recovered as P
∗
X by embedding data into the underlying
subspace.
PCA-L1 improves PCA by replacing the original L2-norm by
L1-norm to measure the scatter matrix so that the method is
invariant to rotation and is also robust to noise and outliers. The
optimization problem of PCA-L1 is defined as
min
P
+
1
N
N
i=1
P
+
T
(
x
i
− x
)
1
, s.t. P
+
∈ R
n×d
, P
+
T
P
+
= I, (2)
where x
i
is the ith sample of X, x denotes the mean vector of all
samples, and P
T
+
P
+
= I is orthogonal constraint.
2.2. Robust LatLRR (rLatLRR) (Zhang & Lin et al., 2014)
rLatLRR is a robust formulation of LatLRR, and improves LatLRR
by performing the low-rank projection learning and low-rank
coefficients coding in the noise-removed clean data space. Specif-
ically, when given data X is noisy or corrupted, rLatLRR applies
the de-noised data X −E as the dictionary instead of original data,
leading to the following problem:
min
Z,P,E
∥
Z
∥
∗
+
∥
P
∥
∗
+ λ
∥
E
∥
1
, s.t. X − E =
(
X − E
)
Z + P
(
X − E
)
,
(3)
where Z is the low-rank coefficients matrix. It is clear that the
above model of rLatLRR model is non-convex, so it adopts a two-
stage strategy. That is, rLatLRR firstly applies RPCA (Candès et al.,
2011; Wright & Peng et al., 2009) to separate X into X = A
∗
+
E
∗
. rLatLRR obtains the solution by solving the following relaxed
noiseless problem:
min
Z,P
∥
Z
∥
∗
+
∥
P
∥
∗
, s.t. A
∗
= A
∗
Z + PA
∗
. (4)
3. Robust auto-weighted low-rank and sparse representation
framework by joint feature embedding
3.1. Formulation
In this section, we mainly present the objective function of
our RALSR. Given the data matrix X ∈ R
n×N
, our RALSR aims
at improving the data representation power in threefold. First,
to encode the extracted features accurately, our new formulation
regularizes the joint truncated nuclear norm based low-rank and
L1-norm based sparse constraints on the salient features and the
underlying projection P ∈ R
n×n
, inspired by I-LSPFC (Zhang et al.,
2016). Second, to make sure the learnt weights to be optimal
for data representation and avoid the tricky issue of determining
the optimal number of neighbors or kernel width, our RALSR
seamlessly integrates the low-rank and sparse coding with the
adaptive weight learning into a unified framework. Moreover,
the adaptive reconstruction weights are learnt by minimizing the
joint reconstruction errors simultaneously based on the original
data and salient features PX:
J
(
W
)
=
∥
X − XW
∥
2
F
+
∥
PX − PXW
∥
2
F
+
∥
W
∥
1
=
X
PX
−
X
PX
W
2
F
+
∥
W
∥
1
,
(5)
which is mainly added to encode the adaptive reconstruction
weights in W by simultaneously minimizing the reconstruction
errors based on the given data X and salient features PX of
samples, i.e.,
∥
X − XW
∥
2
F
+
∥
PX − PXW
∥
2
F
. Thus, the adaptive
weights W
i,j
can measure the contribution of each sample x
j
to reconstruct x
i
. In other words, W
i,j
can preserve the locality
and reconstructive relationship between samples x
i
and x
j
in the
original input space and projective feature subspace spanned by
P, i.e., the larger W
i,j
, the more similar x
i
and x
j
. The L1-norm is
also regularized on the weight matrix W to make the weights
W
i,j
satisfy the sparse properties with larger values denoting the
neighbors of x
i
that play an important role for reconstructing
x
i
. As a result, the reconstruction error can be potentially mini-
mized, and minimizing J(W) can also enable the learnt projection
P to preserve the local neighborhood information of samples
adaptively.
Thus, based on the formulations of IRPCA and PCA-L1, the
intermediate problem of RALSR is defined as
min
P,W ,E
∥
P
∥
r
+ α
∥
PX
∥
1
+
β
2
X
PX
−
X
PX
W
2
F
+
∥
W
∥
1
+λ
∥
E
∥
1
s.t. X = PX + E.
(6)
where
∥
P
∥
r
is the truncated nuclear norm (Hu et al., 2013) that
is defined as the sum of min
(
n, N
)
− r minimum singular values,
that is,
∥
P
∥
r
=
min
(
n, N
)
i=r+1
σ
i
(
P
)
, where r is a nonnegative integer
and r ≤ min
(
n, N
)
. Note that the motivation of truncated nuclear
norm is detailed in Han, Leung, Huang, and So (2017) and Hu et al.
(2013), i.e., the values of largest r nonzero singular values will
not affect the rank of the matrix, so the truncated nuclear norm
mainly focuses on minimizing the sum of smallest min
(
n, N
)
− r
singular values. But the truncated norm
∥
P
∥
r
is generally non-
convex, it is not easy to optimize it directly. To address this
issue, Hu et al. (2013) have clearly proved that the truncated
nuclear norm
∥
P
∥
r
and the standard nuclear norm
∥
P
∥
∗
have the
following relationship:
∥
P
∥
∗
− max
U
r
U
r
T
=I, V
r
V
r
T
=I
Tr
U
r
PV
T
r
=
min
(
n, N
)
i=1
σ
i
(
P
)
−
r
i=1
σ
i
(
P
)
=
min
(
n, N
)
i=r+1
σ
i
(
P
)
=
∥
P
∥
r
,
(7)
max
U
r
U
r
T
=I, V
r
V
r
T
=I
Tr
U
r
PV
T
r
=
r
i=1
σ
i
(
P
)
, (8)
where
∥
P
∥
∗
=
min
(
m,n
)
i=1
σ
i
(
P
)
is the nuclear norm, σ
i
(
P
)
is the
ith largest singular value of P. Suppose that U
V
T
is the singular
value decomposition of P, we have U = (u
1
, . .., u
m
)∈ R
m×m
,
V =
(
v
1
, . . . , v
n
)
∈ R
n×n
. Note that U
r
=
(
u
1
, . . . , u
r
)
T
∈
R
r×m
and V
r
=
(
v
1
, . . . , v
r
)
T
∈ R
r×n
include truncated singular
values. The details about the truncated nuclear norm can be
referred to Hu et al. (2013). Clearly, an underlying projection
P can be learnt based on the problem to extract low-rank and
sparse features from data efficiently by embedding given data
into the learnt P, and the extracted features will hold the joint
low-rank and sparse properties. But note that the above problem
also obtains the underlying projection and adaptive weights from
original data, so the results may be inaccurate by the possibly
included noise. To handle this issue, we clearly borrow the idea
of rLatLRR (Zhang & Lin et al., 2014) by learning a projection
P based on the noise-removed clean data X-E instead of the
original data X to make the learnt features more notable and
robust to noise, where E is sparse error to be modeled. Note that
we also aim at computing the adaptive weights based on the
recovered clean data X-E so that the similarity measure can be