88 G. Wu et al. / Neurocomputing 289 (2018) 86–100
globally by employing Laplacian regularizations. MIFS [12] pro-
poses multi-label informed feature selection by embedding the la-
bel space to a low-dimension space to exploit the label correlation.
LIFT [15] and LLSF [16] utilize the label-specific features to improve
the performance of multi-label learning. All in all, on one hand,
many algorithms extending BR mostly use SVM as the base binary
classifier and train each classifier individually, because traditional
BR is implemented by decomposing into many independent binary
classifiers and learning each individually, which is hard to extend.
On the other hand, when extending BR by training many binary
classifiers together, most of these methods use least squared loss
function which is more suitable for a regression task rather than a
classification task. Furthermore, there is a lack of a unified frame-
work of BR for various loss functions. To tackle above issues, we
propose a unified framework implementing linear Binary Relevance
for various loss functions, which is easy to extend. We also analyze
the influences of different loss functions to find suitable loss func-
tions for multi-label learning.
Multi-class classification can be viewed as a special case of
multi-label learning when restricting the number of labels per in-
stance to one. One-vs-all algorithm [36] can also be viewed as a
special case of BR algorithm where the decision function is special.
In one-vs-all for multi-class classification, an instance is predicted
as the label with the maximum classification score among all cor-
responding binary classifiers. Whereas, in BR for multi-label learn-
ing, an instance is predicted as the label set which contains all the
labels of corresponding binary classifier’s predictions.
3. Model
3.1. Notations
For an arbitary matrix A, A
denotes its transpose, a
i
and a
j
denote the i th row and j th column of A , a
i
2
denotes the vector
l
2
-norm, A
1
denotes the matrix l
1
-norm and A
F
is the Frobe-
nius norm. For two matrices A and B, A ◦B denotes the Hadamard
(element-wise) product. Denote a function g : R → R , g
is the
corresponding derivative function and for an arbitary matrix A ∈
R
n ×m
, g(A ) : R
n ×m
→ R
n ×m
, and (g(A ))
ij
= g(A
ij
) . I denotes the
corresponding matrix where each element equals 1.
Given a training dataset D = { x
i
, y
i
}
n
i =1
, where x
i
∈ R
m
is a real
value instance vector, y
i
∈ {−1 , 1 }
c
is a label vector for x
i
, n is the
number of the samples, m is the feature dimension of the instance
and c is the number of the class labels. Therefore, y
ij
denotes the
j th label of the i th instance and y
ij
= 1 (or −1) means the j th
label is relevant (or irrelevant). Denote the dataset by matrices
D = (X , Y ) , where X ∈ R
n ×m
, Y ∈ {−1 , 1 }
n ×c
.
The task of multi-label learning is to find a multi-label classi-
fier H : R
m
→ {−1 , 1 }
c
. In BR [4] , H can be decomposed into c in-
dependent binary classifiers on each label, so H = { h
1
, h
2
, . . . , h
c
}
and h
j
( x
i
) denotes the prediction of y
ij
. A multi-label predictor
F : R
m
→ R
c
is firstly learned and it can also be decomposed into
c independent predictors on each label { f
1
, f
2
, . . . , f
c
} , where f
j
( x
i
)
denotes the predicted real value of y
ij
. Next, H can be induced from
F by thresholding function. In this paper, for multi-label learning,
h
j
(x
i
) = [[ f
j
(x
i
) > t(x
i
)]] , where thresholding function t(x
i
) = 0 for
all instances and [[ π ]] equals 1 when the proposition π holds,
0 otherwise. When the predictors f
j
(x
i
) < 0 , j = 1 , . . . , c, the pre-
dicted label set of x
i
is empty. In this case, we choose the so-called
T-Criterion [4] to predict the label with the largest predictor value.
Note that it has only one class per instance for multi-class clas-
sification. Thus, the decision function of one-vs-all multi-class clas-
sification is different from BR for multi-label learning while the
training process is same. In one-vs-all multi-class classification, for
an instance x
i
, it is predicted as the class label with the largest
predictor value.
In this paper, we focus on BR [4] or one-vs-all multi-class clas-
sification [36] where base binary classifier is linear model, but the
linear model can be easily extended to non-linear kernel model
[37] by the kernel trick , which we do not discuss in this paper.
Thus, our goal is to find a coefficient matrix W = [ w
1
, w
2
, . . . , w
c
] ∈
R
m ×c
, which is used to map the instance space to the label space.
3.2. Unified model
In the traditional implementation of BR [4] for multi-label
learning or one-vs-all for multi-class classification [36] , it is de-
composed into c independent binary classifiers and trains each
classifier independently.
For the j th binary classifier corresponding to the j th label, the
model is as follows:
min
w
j
n
i =1
loss (y
ij
, f
j
(x
i
)) + λ
j
R (w
j
) (1)
where loss ( y
ij
, f
j
( x
i
)) is a loss function measuring the risk between
the true label y
ij
and the predicted value f
j
( x
i
), R (w
j
) is a reg-
ularizer which controls the complexity of this model and λ
j
is a
tradeoff parameter. In this paper, we mainly discuss five popu-
lar loss functions for linear binary classification and set R (w
j
) =
w
j
2
2
. Because of the linear model, it can be written that f
j
(x
i
) =
x
i
w
j
+ b
j
, where b = [ b
1
, b
2
, . . . , b
c
] ∈ R
c
is the bias. For simplic-
ity, the bias b
j
can be absorbed into w
j
when the constant value
1 is added into each data x
i
as an additional dimension feature.
Thus, the model can be written as
min
w
j
n
i =1
loss (y
ij
, x
i
w
j
) + λ
j
w
j
2
2
(2)
Thus, w
j
, j = 1 , 2 , . . . , c is learned independently to compose W =
[ w
1
, w
2
, . . . , w
c
] . This is equivalent to solve the following problem:
min
w
1
, w
2
,..., w
c
c
j=1
n
i =1
loss (y
ij
, x
i
w
j
) +
c
j=1
λ
j
w
j
2
2
(3)
Set = [ λ
1
e , λ
2
e , . . . , λ
c
e ] ∈ R
m ×c
, where e is the corre-
sponding column vector where elements all equal 1. Because
loss (y
ij
, x
i
w
j
) > = 0 always holds due to the property of loss
function, this problem can be written as
min
W
loss (Y , XW )
1
+ ◦ W
2
F
(4)
where (loss (Y , XW ))
ij
= loss (y
ij
, x
i
w
j
) . Denote F (W ) =
loss (Y , XW )
1
+ ◦ W
2
F
. If λ
j
between each binary classi-
fier is set to be all equal to λ, this problem becomes
min
W
loss (Y , XW )
1
+ λ W
2
F
(5)
which is mainly used to extend. We solve this problem Eq. (4) us-
ing gradient (or subgradient) descent algorithm in this paper. Next
we talk about the gradient (or subgradient) of the objective func-
tion w.r.t W .
Generally speaking, loss function can be written as
loss (y, x
w ) = g(y (x
w )) , where the function g : R → R
+
. Thus,
loss (Y , XW ) = g(Y ◦ ( XW )) .
Proposition 1. The gradient (or subgradient) of the objective function
Eq. (4) w.r.t W is as follows:
∇
W
F (W ) = X
(g
(Z ) ◦ Y ) + ◦ W (6)
where Z = Y ◦ ( XW )) .
Proof. Take the partial derivative of loss function loss ( Y, XW )
1
w.r.t w
pq
according to the chain rule,
∂ loss (Y , XW )
1
∂w
pq
=
∂[
n
i =1
g(y
iq
(x
i
w
q
))]
∂w
pq