Attention-aware Deep Reinforcement Learning for Video Face Recognition
Yongming Rao
1,2,3
, Jiwen Lu
1,2,3∗
, Jie Zhou
1,2,3
1
Department of Automation, Tsinghua University, Beijing, China
2
State Key Lab of Intelligent Technologies and Systems, Beijing, China
3
Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China
raoyongming95@gmail.com; {lujiwen,jzhou}@tsinghua.edu.cn
Abstract
In this paper, we propose an attention-aware deep rein-
forcement learning (ADRL) method for video face recogni-
tion, which aims to discard the misleading and confounding
frames and find the focuses of attentions in face videos for
person recognition. We formulate the process of finding the
attentions of videos as a Markov decision process and train
the attention model through a deep reinforcement learning
framework without using extra labels. Unlike existing at-
tention models, our method takes information from both the
image space and the feature space as the input to make bet-
ter use of face information that is discarded in the feature
learning process. Besides, our approach is attention-aware,
which seeks different attentions of videos for the recognition
of different pairs of videos. Our approach achieves very
competitive video face recognition performance on three
widely used video face datasets.
1. Introduction
Video face recognition has attracted great attention in
computer vision over the past few years [4, 7, 8, 15, 24, 31,
32, 40, 41, 43]. There are many practical applications for
video face recognition such as access control, video search
and visual surveillance. Compared to still face recognition,
videos can capture human faces from multiple views, which
provide more useful information of a single face. Howev-
er, video faces usually suffer from uncontrolled variations
of poses, illuminations and etc., which leads to large intra-
class distances. Hence, it is desirable to design a model to
integrate information across frames and reduce intra-class
distances for effective and robust video face recognition.
There have been a variety of studies on how to effective-
ly integrate information across frames for video face rep-
resentation [6, 18, 21, 28, 43]. These methods exploit video
information from all frames, which is usually considered
∗
Corresponding author.
澷濂濂
激濣濗濕濠澔濆濙濗濩濦濦濙濢濨
澺濦濕濡濙澔澹濪濕濠濩濕濨濝濣濢澔濂濙濨濫濣濦濟澔
澷濂濂
澷濂濂
瀖
激濣濗濕濠澔濈濙濡濤濣濦濕濠澔濄濣濣濠濝濢濛澔
瀖
澷濂濂
激濣濗濕濠澔濆濙濗濩濦濦濙濢濨
澷濂濂
澷濂濂
瀖
激濣濗濕濠澔濈濙濡濤濣濦濕濠澔濄濣濣濠濝濢濛澔
瀖
瀖
瀖
澵濨濨濙濢濨濝濣濢
澵濨濨濙濢濨濝濣濢
瀖
瀖
濊濙濦濝濚濝濗濕濨濝濣濢
澽濡濕濛濙澔濧濤濕濗濙
濇濤濕濨濝濕濠澔澔
濦濙濤濦濙濧濙濢濨濕濨濝濣濢
濠濙濕濦濢濝濢濛
濈濙濡濤濣濦濕濠澔
濦濙濤濦濙濧濙濢濨濕濨濝濣濢
濠濙濕濦濢濝濢濛
澵濨濨濙濢濨濝濣濢澡濕濫濕濦濙澔
濦濙濝濢濚濣濦濗濙濡濙濢濨
濠濙濕濦濢濝濢濛
Figure 1. Flow-chart of our proposed method for video face recog-
nition. Our approach takes a pair of face videos as the input and
produces the temporal-spatial representations for each frame by
using multiple stacked modules, including a convolutional neural
network (CNN), a recurrent layer and a pooling layer with local-
ity constraints, respectively. Then, a hard attention model with
a frame evaluation network is trained by the proposed deep rein-
forcement learning method, which finds the attentions of the video
pair for face verification.
as equal importance. However, some features are mislead-
ing and confounding so that low quality frames may har-
m the performance of recognition. To address this, Yang
et al. [43] proposed an attention-based method to find the
weights of features by using the information from features
themselves. However, the information of image quality is
reduced in the feature learning process [40], where infor-
mation from the feature space is not reliable enough to find
the most important parts (precise focuses of attention) in
videos.
In this work, we propose a new approach by introducing
the Markov decision process (MDP) [3] to remove these
misleading and confounding frames step by step with the
2017 IEEE International Conference on Computer Vision
2380-7504/17 $31.00 © 2017 IEEE
DOI 10.1109/ICCV.2017.424
3951