6128 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 12, DECEMBER 2019
Fig. 1. The overview of the proposed network architecture. The network consists of three branches, where each branch is incorporated with a specific attention
mechanism, including parsing attention (PA), label attention (LA) and spatial attention (SA). The network is constructed based on the SE-BN-Inception [24],
which is a light CNN architecture in the SE-Net family [24]. More detailedly, the CNNs module f
B
consists of nine inception blocks [31] and nine SE
blocks [24] with each inception block followed by a SE block. The module f
L
, f
S
, f
P
have the same structure, where a inception block followed by
a SE block are included in each module. To the end, all three branches are jointly learned concurrently with each branch followed by a loss layer.
to explore attribute context and correlation. Lin et al. [5]
propose a discriminative CNN embedding for both person
re-identification and attributes recognition, yielding promising
performance in both tasks. Moreover, Liu et al. [6] pro-
pose a m ulti-directional attention mechanism for fine-grained
pedestrian analysis. In this study, we establish another atten-
tion based method which is different from Liu’s work [6].
Furthermore, three different concurrently learned attention
mechanisms are proposed to consider the prediction problem
from different perspectives.
2) Pedestrian Parsing: Methods [29], [39]–[41] proposed
for pedestrian parsing in the early stage rely heavily on training
set and lack the ability of accurately fitting object boundaries.
However, the present Fully Convolutional Networks (FCNs)
which was a category of network architectures has shown its
effectiveness and efficiency for segmentation tasks [42]–[47].
When it comes to the specific networks in this category,
Long et al. [42] first propose a Fully Convolutional Net-
works (FCNs) for pixel-wise prediction which is originally
used in sematic segmentation, and improve the state-of-the-
art performance by a big margin at that time. More recently,
Zhao et al. [45] propose another network in this category
named as the Pyramid Scene Parsing Network (PSPNet) for
scene parsing which ranks the 1st place in ImageNet Scene
Parsing Challenge 2016. Considering how FCNs category
achieves a great performance in the segmentation tasks, we are
encouraged to construct our pedestrian parsing network based
on FCNs. Nevertheless, we are aware of the fact that the low
resolutio n of pedestrian images in surveillance scenes will
have a negative influence on the performance of the FCNs. For
example, Xia et al. [30] employ FCN-32 and FCN-16 [42] for
pedestrian parsing while get low performance. As a result, it is
inappropriate to directly apply those existing frameworks to
pedestrian parsing and some adjustments should be appended
to those frameworks.
3) Attention: Attention models [19]–[27], [48] have aroused
great enth usiasm in recent years. In the literature, a recurrent
attention convolutional neural network architecture to detect
the discriminative regions for fine-grained imag e reco gnition is
proposed by Fu et al. [21]. Wang et al. [22] propose a residual
attention network th at is constructed b y stacking multiple
attention modules. Moreover, a various of experiments are
also conducted in their work to show the effectiveness of
the proposed network. Li et al. [23] propose an attention
mechanism that consists of spatial and temporal attention for
person re-identification. In more recent work, Hu et al. [24]
propose Squeeze-and-Excitation Networks (SE-Net) for image
classification, where a channels attention mechanism is pro-
posed to recalibrate channel-wise feature responses. With
its superior performance, the SE-Net won first place in
ILSVRC 2017. Shen et al. [48] propose sharp attention
networks for person re-identification, and achieve promising
performance. Inspired by these works, we establish a new
attention network which is expected to achieve better perfor-
mances. Specially, one of our innovations lies in the adoption
of three different kinds of attention mechanisms with two of
which are newly developed. The three attention mechanisms
are not only incorporated into a unified network and learned
jointly, but also, extract the most discriminative features from
different views and reach a mutual complementary to obtain
better prediction performances.
III. O
UR APPROACH
A. The Overall Design
The overview of the proposed network architecture for
pedestrian attributes analysis is shown in Fig. 1. The pro-
posed network architecture is constructed based on the
SE-BN-Inception [24], which is a light CNN architecture in
the SE-Net family. As is presented in the overview, the pro-
posed network architecture utilizes a parallel structure where
each of the three branches is incorporated with a specific
attention mechanism from parsin g attention, label attention
and spatial attention. Since different attention mechanisms
have different perspectives, they are expected to capture the
correlated complementary inf ormation and discover optimal
per-branch discriminative feature representations. To this end,
we formulate a joint learning sch eme with the following
principles: (1) low-level features are shared for all branches.
It can be seen from the Fig. 1 that all three branches
receive common low-level features before performing the
respective CNNs. This shared learning inspired by multi-task
leaning [49], [50] can facilitate not only the inter-attention
common learning, but also the knowledge transfer between
different attentions. Nevertheless, it also helps to reduce the