1556-6013 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2018.2885252, IEEE
Transactions on Information Forensics and Security
3
In this paper, we propose a unified cost-sensitive frame-
work for conducting label propagation and classifier learning
simultaneously. Let F = [f
1
, f
2
, . . . , f
N
] denote the inferred
cost-sensitive label matrix, where f
i
is a one-hot vector, i.e.,
only one of its c elements is one and all the others are zero.
In our approach, each label vector f
i
for i = 1, 2, . . . , N is
estimated in a cost-sensitive way by regressing the current
classification results. The joint optimization problem can be
solved by minimizing some misclassification loss function in
the general form of
min
W,F
loss{φ(X, W ), F, C} (1)
where φ classifies the input training data X with the projection
matrix W . Then, the classification results can be used to
evaluate the label matrix F with the cost matrix C. The
cost-sensitive label information is used in turn to update the
classifier φ with respect to W . This process is iterated until
the overall misclassification loss is minimized. In this way,
both label propagation and classifier learning are embedded in
a cost-sensitive framework.
To deal with face feature variations, we further propose to
conduct cost-sensitive semi-supervised learning in some latent
semantic space of face images. The last two key notations
in Table I specify the robust high-level features used in
our approach. In particular, B ∈ R
D×d
spans the learned
latent semantic space and S ∈ R
d×N
accommodates the d-
dimensional latent semantic representations of X.
IV. THE UNIFIED COST-SENSITIVE FRAMEWORK
In this section, we elaborate our unified cost-sensitive
framework for semi-supervised face recognition. Section IV-A
proposes cost-sensitive latent semantic regression for label
propagation and learning of the classifier. Section IV-B in-
troduces cost-sensitive regularization to guide the label prop-
agation process. Section IV-C presents design of the misclas-
sification loss function for cost-sensitive learning in the latent
semantic space. Section IV-D describes the iterative algorithm
for solving the unified framework. Section IV-E explains the
procedure for inference.
A. Cost-sensitive learning in the latent semantic space
Considering facial expressions, lighting and poses of face
images taken at different times, it is necessary to extract robust
feature representations for cost-sensitive face recognition. To
address this issue, we adopt matrix factorization to extract
high-level features that can reflect the inherent structure be-
tween data [34], [36]–[39]. The latent semantic space B and
the high-level features S can be jointly learned from
L
1
(B, S) = ||X −BS||
2
F
(2)
where || · ||
F
denotes the Frobenius norm. We do not include
any sparsity constraint in (2) for matrix factorization because
face recognition is not commonly considered as a compressive
sensing problem [6], [8], [35].
We then use a linear predictive classifier to project S into
the label space, i.e., φ(X, W ) = W
T
S(X) where S(X) is the
latent semantic features learned from (2) with input X, and
cast least square minimization for the loss function. Note that it
is possible to consider other classifiers for φ and optimization
rules. In our context, linear regression makes an update simpler
in every iteration and yet can achieve effective results for the
unified framework. Thus, we introduce cost-sensitive latent
semantic regression as
L
2
(W, S, F ) =
N
X
i=1
h(i)||W
T
s
i
− f
i
||
2
2
(3)
where s
i
denotes the latent semantic representation of sample
x
i
and h(i), known as the importance function [3], [6]–[8],
depicts the importance of sample x
i
in the training process.
In supervised learning scenarios [3], [40], the importance
function is often defined as the total cost of misclassifying
sample x
i
whose true class lable is denoted by l(x
i
). In our
context of semi-supervised learning, sample x
i
can be either
labeled or unlabled. Accordingly, the importance of sample x
i
is evaluated as
h(i) =
(
P
c
j=1
C
l(x
i
)j
, if i ≤ N
l
τ, if i > N
l
(4)
where the hyper-parameter τ is set for unlabeled training data
and its value is found empirically to stress the importance of
unlabeled data in cost-sensitive learning.
Proposition 1: Assume that x
i
∈ X for i = 1, 2, . . . , N
are conditionally independent of each other given their label
classes l(x
i
) = 1, 2, . . . , c whose densities are multivariate
Gaussian’s with a common covariance matrix. Given the label
matrix F = [f
1
, f
2
, . . . , f
N
], minimizing the least squares
criterion in the form of min
W
||W
T
S −F||
2
2
results in a solu-
tion
ˆ
W = [
ˆ
w
1
,
ˆ
w
2
, . . . ,
ˆ
w
c
] that projects the latent semantic
feature s
i
of each sample x
i
into the label space with regressed
terms proportional to the posteriori class probabilities, i.e.,
ˆ
w
T
k
s
i
∝ p(l(x
i
) = k|x
i
) for k = 1, 2, . . . , c, and
||
ˆ
W
T
s
i
− f
i
||
2
2
∝
X
j:j≤c,j6=l (x
i
)
p(j|x
i
)
2
+ [1 − p(l(x
i
)|x
i
)]
2
.
(5)
Proof: Let g
T
k
be a row vector in F containing one-hot
vectors for label class k = 1, 2, . . . , c such that g
ki
= 1 if
l(x
i
) = k and g
ki
= 0 otherwise for all i = 1, 2, . . . , N in
the training dataset. The least squares solution can also be
obtained by solving
min
w
k
||S
T
w
k
− g
k
||
2
2
(6)
for each label classifier individually [41].
Note that the problem expressed in (6) is two-class regres-
sion with class k and a null class that contains all samples
that do not belong to class k, i.e., l(x
i
) 6= k. Suppose that the
mean for the two classes are m
k
and m
0
, respectively. Since
all label classes have the same covariance matrix Σ, the least
squares solution of two-class regression satisfies the following
relationship [41]:
ˆ
w
k
∝ Σ
−1
(m
k
− m
0
). (7)
On the other hand, we may use a Gaussian Naive Bayes
(GNB) classifier to estimate the posteriori class probability