2. Related Work
Metric learning.
Metric learning aims to learn a sim-
ilarity (distance) function. Traditional metric learning
[
36
,
33
,
12
,
38
] usually learns a matrix
A
for a distance met-
ric
kx
1
−x
2
k
A
=
p
(x
1
−x
2
)
T
A(x
1
−x
2
)
upon the given
features
x
1
, x
2
. Recently, prevailing deep metric learning
[
7
,
17
,
24
,
30
,
25
,
22
,
34
] usually uses neural networks
to automatically learn discriminative features
x
1
, x
2
fol-
lowed by a simple distance metric such as Euclidean dis-
tance
kx
1
−x
2
k
2
. Most widely used loss functions for deep
metric learning are contrastive loss [
1
,
3
] and triplet loss
[32, 22, 6], and both impose Euclidean margin to features.
Deep face recognition.
Deep face recognition is ar-
guably one of the most active research area in the past few
years. [
30
,
26
] address the open-set FR using CNNs super-
vised by softmax loss, which essentially treats open-set FR
as a multi-class classification problem. [
25
] combines con-
trastive loss and softmax loss to jointly supervise the CNN
training, greatly boosting the performance. [
22
] uses triplet
loss to learn a unified face embedding. Training on nearly
200 million face images, they achieve current state-of-the-art
FR accuracy. Inspired by linear discriminant analysis, [
34
]
proposes center loss for CNNs and also obtains promising
performance. In general, current well-performing CNNs
[
28
,
15
] for FR are mostly built on either contrastive loss or
triplet loss. One could notice that state-of-the-art FR meth-
ods usually adopt ideas (e.g. contrastive loss, triplet loss)
from metric learning, showing open-set FR could be well
addressed by discriminative metric learning.
L-Softmax loss [
16
] also implicitly involves the concept
of angles. As a regularization method, it shows great im-
provement on closed-set classification problems. Differently,
A-Softmax loss is developed to learn discriminative face em-
bedding. The explicit connections to hypersphere manifold
makes our learned features particularly suitable for open-set
FR problem, as verified by our experiments. In addition,
the angular margin in A-Softmax loss is explicitly imposed
and can be quantitatively controlled (e.g. lower bounds to
approximate desired feature criterion), while [
16
] can only
be analyzed qualitatively.
3. Deep Hypersphere Embedding
3.1. Revisiting the Softmax Loss
We revisit the softmax loss by looking into the decision
criteria of softmax loss. In binary-class case, the posterior
probabilities obtained by softmax loss are
p
1
=
exp(W
T
1
x + b
1
)
exp(W
T
1
x + b
1
) + exp(W
T
2
x + b
2
)
(1)
p
2
=
exp(W
T
2
x + b
2
)
exp(W
T
1
x + b
1
) + exp(W
T
2
x + b
2
)
(2)
where
x
is the learned feature vector.
W
i
and
b
i
are weights
and bias of last fully connected layer corresponding to class
i
, respectively. The predicted label will be assigned to
class 1 if
p
1
>p
2
and class 2 if
p
1
<p
2
. By comparing
p
1
and
p
2
, it is clear that
W
T
1
x+b
1
and
W
T
2
x+b
2
de-
termine the classification result. The decision boundary is
(W
1
−W
2
)x+b
1
−b
2
=0
. We then rewrite
W
T
i
x+b
i
as
kW
T
i
kkxkcos(θ
i
)+b
i
where
θ
i
is the angle between
W
i
and
x
. Notice that if we normalize the weights and zero
the biases (
kW
i
k=1
,
b
i
=0
), the posterior probabilities be-
come
p
1
=kxkcos(θ
1
)
and
p
2
=kxkcos(θ
2
)
. Note that
p
1
and
p
2
share the same
x
, the final result only depends on
the angles
θ
1
and
θ
2
. The decision boundary also becomes
cos(θ
1
)−cos(θ
2
)=0
(i.e. angular bisector of vector
W
1
and
W
2
). Although the above analysis is built on binary-calss
case, it is trivial to generalize the analysis to multi-class case.
During training, the modified softmax loss (
kW
i
k=1, b
i
=0
)
encourages features from the
i
-th class to have smaller angle
θ
i
(larger cosine distance) than others, which makes angles
between
W
i
and features a reliable metric for classification.
To give a formal expression for the modified softmax loss,
we first define the input feature
x
i
and its label
y
i
. The
original softmax loss can be written as
L =
1
N
X
i
L
i
=
1
N
X
i
− log
e
f
y
i
P
j
e
f
j
(3)
where
f
j
denotes the
j
-th element (
j ∈ [1, K]
,
K
is the
class number) of the class score vector
f
, and
N
is the
number of training samples. In CNNs,
f
is usually the
output of a fully connected layer
W
, so
f
j
= W
T
j
x
i
+ b
j
and
f
y
i
= W
T
y
i
x
i
+ b
y
i
where
x
i
,
W
j
, W
y
i
are the
i
-th
training sample, the
j
-th and
y
i
-th column of
W
respectively.
We further reformulate L
i
in Eq. (3) as
L
i
= − log
e
W
T
y
i
x
i
+b
y
i
P
j
e
W
T
j
x
i
+b
j
= − log
e
kW
y
i
kkx
i
k cos(θ
y
i
,i
)+b
y
i
P
j
e
kW
j
kkx
i
k cos(θ
j,i
)+b
j
(4)
in which
θ
j,i
(0≤θ
j,i
≤π)
is the angle between vector
W
j
and
x
i
. As analyzed above, we first normalize
kW
j
k=1, ∀j
in each iteration and zero the biases. Then we have the
modified softmax loss:
L
modified
=
1
N
X
i
− log
e
kx
i
k cos(θ
y
i
,i
)
P
j
e
kx
i
k cos(θ
j,i
)
(5)
Although we can learn features with angular boundary with
the modified softmax loss, these features are still not neces-
sarily discriminative. Since we use angles as the distance
metric, it is natural to incorporate angular margin to learned
features in order to enhance the discrimination power. To
this end, we propose a novel way to combine angular margin.
3.2. Introducing Angular Margin to Softmax Loss
Instead of designing a new type of loss function and con-
structing a weighted combination with softmax loss (similar