Pedestrian Aribute Recognition via Hierarchical Multi-task
Learning and Relationship Aention
Lian Gao
Beijing Advanced Innovation Center for Big Data and
Brain Computing, School of Computer Science and
Engineering, Beihang University.
Beijing, China
gaolian@buaa.edu.cn
Di Huang
∗
Beijing Advanced Innovation Center for Big Data and
Brain Computing, School of Computer Science and
Engineering, Beihang University.
Beijing, China
dhuang@buaa.edu.cn
Yuanfang Guo
School of Computer Science and Engineering, Beihang
University, Beijing, China.
Beijing, China
andyguo@buaa.edu.cn
Yunhong Wang
Beijing Advanced Innovation Center for Big Data and
Brain Computing, School of Computer Science and
Engineering, Beihang University.
Beijing, China
yhwang@buaa.edu.cn
ABSTRACT
Pedestrian Attribute Recognition (PAR) is an important task in
surveillance video analysis. In this paper, we propose a novel end-
to-end hierarchical deep learning approach to PAR. The proposed
network introduces semantic segmentation into PAR and formu-
lates it as a multi-task learning problem, which brings in pixel-level
supervision in feature learning for attribute localization. According
to the spatial properties of local and global attributes, we present a
two stage learning mechanism to decouple coarse attribute local-
ization and ne attribute recognition into successive phases within
a single model, which strengthens feature learning. Besides, we de-
sign an attribute relationship attention module to eciently capture
and emphasize the latent relations among dierent attributes, fur-
ther enhancing the discriminative power of the feature. Extensive
experiments are conducted and very competitive results are reached
on the RAP and PETA databases, indicating the eectiveness and
superiority of the proposed approach.
CCS CONCEPTS
• Computing methodologies → Object recognition.
KEYWORDS
pedestrian attribute recognition, deep learning, multi-task learning
and visual attention
∗
indicates the corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MM ’19, October 21–25, 2019, Nice, France
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6889-6/19/10.. . $15.00
https://doi.org/10.1145/3343031.3351003
ACM Reference Format:
Lian Gao, Di Huang, Yuanfang Guo, and Yunhong Wang. 2019. Pedestrian
Attribute Recognition via Hierarchical Multi-task Learning and Relation-
ship Attention. In Proceedings of the 27th ACM International Conference on
Multimedia (MM ’19), October 21–25, 2019, Nice, France. ACM, New York,
NY, USA, 9 pages. https://doi.org/10.1145/3343031.3351003
1 INTRODUCTION
Nowadays, video surveillance systems have been widely employed
with dierent security demands in various public and private facili-
ties and places, including squares, malls, railway stations, airports,
residential buildings, libraries, etc. Pedestrians are major targets in
surveillance videos and automatic pedestrian analysis is important
to many applications, such as key person indexing, criminal trajec-
tory tracking, and abnormal behavior detection, where Pedestrian
Attribute Recognition (PAR) plays a fundamental role. PAR aims
to predict intrinsic characteristics (e.g. “gender", “age") as well as
appearance properties (e.g. “clothes style", “accessory") of persons
and has received increasing attentions in recent years.
PAR is a challenging task with a number of intractable problems.
On the one hand, it has to handle the common reputed issues in the
eld of computer vision, involving changes in ambient illumination,
camera viewpoint, video resolution, person gesture, and external
occlusion. On the other hand, to satisfy diverse requirements, the
number of attributes concerned becomes larger and larger. The
attributes convey rich semantic information at dierent levels. In
general, local attributes (e.g. “hair style" and “accessory") are related
to low-level or mid-level appearance features of certain regions,
while global attributes (e.g. “gender") require holistic representation
with special areas highlighted (e.g. face, hair, and torso), probably
corresponding to some local attributes. This complexity of attribute
relationship makes PAR even more dicult. Figure 1 shows some
examples of pedestrians and typical attributes.
Early studies on PAR follow the detection pipeline, which rstly
extracts handcrafted features of candidate regions and then feeds
them into classiers for prediction, and demonstrate promising re-
sults [
2
,
12
]. Unfortunately, they can only handle single or very few
similar attributes, as the features used are ad-hoc and not easy to be
Session 3B: Attention & Saliency
MM ’19, October 21–25, 2019, Nice, France