The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)
Recurrent Attention Model for Pedestrian Attribute Recognition
∗
Xin Zhao,
1
Liufang Sang,
1
Guiguang Ding,
1
Jungong Han,
2
Na Di,
1
Chenggang Yan
3
1
Beijing National Research Center for Information Science and Technology(BNRist)
School of Software, Tsinghua University, Beijing 100084, China
2
School of Computing & Communications, Lancaster University, UK
3
Institute of Information and Control Hangzhou Dianzi University, Hangzhou, China
zhaoxin19@gmail.com, slf12thuss@163.com, dinggg@tsinghua.edu.cn,
jungong.han@northumbria.ac.uk, dn15@mails.tsinghua.edu.cn, cgyan@hdu.edu.cn
Abstract
Pedestrian attribute recognition is to predict attribute labels
of pedestrian from surveillance images, which is a very chal-
lenging task for computer vision due to poor imaging qual-
ity and small training dataset. It is observed that many se-
mantic pedestrian attributes to be recognised tend to show
spatial locality and semantic correlations by which they can
be grouped while previous works mostly ignore this phe-
nomenon. Inspired by Recurrent Neural Network (RNN)’s
super capability of learning context correlations and Atten-
tion Model’s capability of highlighting the region of inter-
est on feature map, this paper proposes end-to-end Recurrent
Convolutional (RC) and Recurrent Attention (RA) models,
which are complementary to each other. RC model mines the
correlations among different attribute groups with convolu-
tional LSTM unit, while RA model takes advantage of the
intra-group spatial locality and inter-group attention correla-
tion to improve the performance of pedestrian attribute recog-
nition. Our RA method combines the Recurrent Learning and
Attention Model to highlight the spatial position on feature
map and mine the attention correlations among different at-
tribute groups to obtain more precise attention. Extensive em-
pirical evidence shows that our recurrent model frameworks
achieve state-of-the-art results, based on pedestrian attribute
datasets, i.e. standard PETA and RAP datasets.
Introduction
Pedestrian attributes, e.g., age, haircut, and footware, are hu-
manly searchable semantic descriptions and can be used as
the soft-biometrics in visual surveillance applications such
as person re-identification (Layne, Hospedales, and Gong
2012; Liu et al. 2012; Peng et al. 2016), face verification
(Kumar et al. 2009), and human identification (Reid, Nixon,
and Stevenage 2014). Attributes are robust against view-
point changes and viewing condition diversity compared to
low-level visual features. While pedestrian attribute recog-
nition has been profitably tackled from a face recognition
perspective, very few works focus on whole people body.
∗
This research was supported by the National Key R&D Pro-
gram of China (2018YFC0806900) and the National Natural Sci-
ence Foundation of China (No. 61571269). Corresponding author:
Guiguang Ding.
Copyright
c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
There is inherently challenging to recognise pedestrian at-
tributes from real-world surveillance images subject to the
poor imaging quality and small training dataset. High imag-
ing quality and large scale training data are not available for
pedestrian attributes. For example, the two largest pedestrian
attribute benchmark datasets PETA (Deng et al. 2014) and
RAP (Li et al. 2016a) contain only 9500 and 33268 train-
ing images. Besides, recognising pedestrian attributes has
to cope with images with poor quality, imbalance label and
complex appearance variations in surveillance scenes.
Attribute recognition methods include hand-crafted fea-
ture methods, CNN methods and CNN-RNN methods. Early
attribute recognition methods mainly rely on hand-crafted
features like colour and texture (Layne, Hospedales, and
Gong 2012; Liu et al. 2012; Jaha and Nixon 2014). Re-
cently, deep learning based attribute models have been pro-
posed due to the capacity to learn more expressive repre-
sentations (Li, Chen, and Huang 2015; Fabbri, Calderara,
and Cucchiara 2017; Liu et al. 2017b), which significantly
improve the performance of pedestrian attribute recogni-
tion. For example, DeepMAR method (Li, Chen, and Huang
2015) utilizes the prior knowledge in the object topology for
attribute recognition and designs a weighted sigmoid cross-
entropy loss to deal with the data imbalance problem whilst
training attribute recognition model. Multi-directional at-
tention modules are applied in an inception based deep
model named HydraPlus Network (Liu et al. 2017b) to take
the visual attention into consideration. CNN-RNN meth-
ods are proved to be successful in multi-label classifica-
tion task to mine the dependency of labels (Li et al. 2017;
Liu et al. 2017a). A recurrent encoder-decoder framework is
introduced into pedestrian attribute recognition task (Wang
et al. 2017b), which aims to discover the interdependency
and correlation among attributes with Long Short-Term
Memory (LSTM) model.
Attributes of pedestrian always show semantic or visual
spatial correlation by which they can be grouped. For ex-
ample, BoldHair and LongHair cannot occur on the same
person while they are both related to the head-shoulders re-
gion of a person, so they can be in the same group to be
recognised together with a specific attention on the head-
shoulders region. Existing methods try to mine the corre-
lations of attributes separately but ignore the spatial neigh-
borhood relationship and the semantic similarity of a group
9275