B. Liu et al. / Neural Networks 101 (2018) 57–67 59
to be predicted. Given the whole set of examples and partial label
information, we hope to complete the Y
U
.
In the following, we will first model the multi-label learning
as a matrix completion problem, then enhance it with manifold
regularization for better exploiting the intrinsic manifold structure
of data, followed by the optimization process with ADMM.
3.1. Matrix completion for multilabel learning
We first introduce an intermediate instance matrix (X
0
L
, X
0
U
),
such that the observed instance matrix (X
L
, X
U
) is sampled from
(X
0
L
, X
0
U
) with i.i.d Gaussian noises: (X
L
, X
U
) = (X
0
L
+ ϵ, X
0
U
+ ϵ),
where ϵ
ij
∼ N(0, σ
2
). Correspondingly, the intermediate label
matrix (Y
0
L
, Y
0
U
) ∈ R
t×n
can be represented as a linear combination
of instance matrix under a weight matrix W ∈ R
t×d
: (Y
0
L
, Y
0
U
) =
W(X
0
L
, X
0
U
).
The joint matrix is denoted by Z = (Y
0
L
, Y
0
U
; X
0
L
, X
0
U
). Given the
linear projection W, the ranks of these two matrices (X
0
L
, X
0
U
) and Z
satisfy (Goldberg et al., 2010),
rank
(
Z
)
≤ rank((X
0
L
, X
0
U
)) + 1. (1)
Further, we construct an observed matrix M with partition of
four blocks M
L
x
, M
U
x
, M
L
y
, and M
U
y
as follows,
M =
Y
L
Y
U
X
L
X
U
=
M
L
y
M
U
y
M
L
x
M
U
x
. (2)
where the blocks X
L
= M
L
x
, X
U
= M
U
x
, Y
L
= M
L
y
, and Y
U
= M
U
y
= 0.
Obviously, matrix M has same size with Z.
According to Eq. (1), the rank of Z should be much smaller
than its attribution dimensionalities. Specifically, the estimation
of unobserved part of Z can be treated as a matrix completion
problem, which relies on minimizing the rank of Z. However, it
is very difficult to directly solve a rank minimization problem.
Fortunately, rank minimization can be relaxed as minimization of
the nuclear norm of Z, that is ∥Z∥
∗
=
k
σ
k
(Z) where σ
k
(Z) is the
kth singular value of Z (Candès & Recht, 2009; Candès & Tao, 2010).
Now with the consideration of the noises on (X
0
L
, X
0
U
) and (Y
0
L
), the
optimization problem can be formulated as follows (Cabral et al.,
2011; Goldberg et al., 2010),
min
Z
µ∥Z∥
∗
+
1
|Ω
X
|
(i,j)∈Ω
X
c
x
(z
t+i,j
, x
ij
)
+
λ
|Ω
Y
|
(i,j)∈Ω
Y
c
y
(z
ij
, y
ij
)
s. t. z
m
= 1
⊤
, m = t + d + 1,
where Ω
X
denotes the index set of observed data entries of
(X
L
, X
U
), i.e., 1 ≤ i ≤ d and 1 ≤ j ≤ n for any x
ij
, and Ω
Y
denotes the index set of known label entries of Y
L
. Note that in
the transduction setting, since all the instances are observed, Ω
X
actually refers to the indices of rows and columns of the whole
data. Here c
x
(z
t+i,j
, x
ij
) is the loss between elements of (X
L
, X
U
) and
(X
0
L
, X
0
U
), and c
y
(z
ij
, y
ij
) is the loss between the soft labels Y
0
L
and
actual labels Y
L
. In this paper, we choose the squared loss for c
x
(.),
namely c
x
(z
t+i,j
, x
ij
) =
1
2
(z
t+i,j
− x
ij
)
2
as in Goldberg et al. (2010).
For the label noise (from continuous z
ij
∈ R to y
ij
∈ {−1, +1}), we
choose c
y
(z
ij
, y
ij
) =
1
α
log(1 + exp(−αz
ij
y
ij
)) as in Eq. (14) to model
the loss (Cabral et al., 2011). This loss between soft and binary
labels is little different from the definition in Goldberg et al. (2010).
In contrast to Goldberg et al. (2010), we use the loss c
y
(.) with
an additional parameter α because it has been reported a better
performance than the sigmoid function (Cabral et al., 2011). There
are many possible ways to implement the general mapping from
the predicted soft label Z
y
to the binary label. We will state one of
the possible solution in Section 5.
3.2. Manifold regularized matrix completion
Under the setting of semi-supervised learning, a natural as-
sumption is that if two instances x
i
and x
j
are close in the in-
trinsic geometry of the data distribution, then the corresponding
predicted labels Z
i
y
and Z
j
y
are also similar to each other. This
assumption is usually referred as smoothness (or manifold) as-
sumption (Belkin & Niyogi, 2001; Bengio, Delalleau, & Le Roux,
2006; Chapelle, Schölkopf, Zien, et al., 2006; Liu, Xu, Wu, & Wang,
2016).
Suppose the mapping from the instance x
i
to its label Z
i
y
can be
denoted by a function f (x
i
) = Z
i
y
, then the smoothness of f along
the geodesics in the compact manifold of the data can be measured
as ∥f ∥
2
M
.
In reality, the data manifold is usually unknown. A symmetric
weight matrix W was defined in (Zhu, Ghahramani, Lafferty, et al.,
2003),
W
ij
= exp(−
d
m=1
(x
m
i
− x
m
j
)
2
σ
2
)
can be approximated to measure the smoothness ∥f ∥
2
M
. We further
define L = D − W , where D = diag(d
i
) is a diagonal matrix with
the d
i
=
j
W
ij
. L ∈ R
n×n
is called graph Laplacian matrix (Chung,
1997).
Therefore, we hope the prediction for Y
0
U
should keep consis-
tency with their neighbors, which motivates the choice of the
quadratic loss (Cai, He, Wu, & Han, 2008; Zhu, 2005; Zhu et al.,
2003),
E(Z
y
) =
1
2
i,j
W
ij
∥f (x
i
) − f (x
j
)∥
2
2
(3)
=
1
2
i,j
W
ij
∥Z
i
y
− Z
j
y
∥
2
2
=
i
d
i
(Z
i
y
)
2
−
i,j
W
ij
Z
i
y
Z
j
y
= Z
⊤
y
LZ
y
where ∥ · ∥
2
indicates L2 norm operation. Suppose that Z
y
=
argmin
Z
y
E(Z
y
), we find that LZ
y
= 0. The LZ
y
= (D − W)Z
y
= 0
called harmonic property, it suggests that the value of predicted
label is the average of its neighbors’ value,
Z
i
y
=
i,j
W
ij
Z
j
y
d
i
=
i,j
W
ij
Z
j
y
j
W
ij
(4)
which satisfy the manifold assumption we mentioned at the begin-
ning of this section.
The matrix form of energy term of Eq. (3) is E(Z
y
) = Tr(Z
⊤
y
LZ
y
).
In this paper, we exploit the geometrical structure of the whole
data to regularize the multi-label learning problem and transform
it as a regularization term, as shown in Eq. (5):
min
Z
∥Z∥
∗
+ Tr(Z
⊤
y
LZ
y
) (5)
s.t. C
x
(Z
x
, M
x
) = 0,
C
y
(Z
L
y
, M
L
y
) = 0.
It is used to ensure the label prediction among each class in Z
y
to
satisfy the local consistency, where Z
y
= (Y
0
L
, Y
0
U
).
Here in Eq. (5), the manifold term seems to be a ‘‘hard’’ regular-
ization over the nuclear loss, but apparently it is not. Actually, it is
a soft constraint that can be controlled by a parameter introduced
by ADMM during optimization. The details will be explained in
Section 4.3.