Pattern Recognition 98 (2020) 107036
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Deep-Person: Learning discriminative deep features for person
Re-Identification
Xiang Bai, Mingkun Yang, Tengteng Huang, Zhiyong Dou, Rui Yu, Yongchao Xu
∗
School of Electronic Information and Communications, Huazhong University of Science and Technology (HUST), Wuhan 430074, China
a r t i c l e i n f o
Article history:
Received 2 March 2018
Revised 14 July 2019
Accepted 3 September 2019
Available online 6 September 2019
Keywords:
Person Re-ID
LSTM
Triplet loss
End-to-end
a b s t r a c t
Person re-identification (Re-ID) requires discriminative features focusing on the full person to cope with
inaccurate person bounding box detection, background clutter, and occlusion. Many recent person Re-ID
methods attempt to learn such features describing full person details via part-based feature representa-
tion. However, the spatial context between these parts is ignored for the independent extractor on each
separate part. In this paper, we propose to apply Long Short-Term Memory (LSTM) in an end-to-end way
to model the pedestrian, seen as a sequence of body parts from head to foot. Integrating the contex-
tual information strengthens the discriminative ability of local feature aligning better to full person. We
also leverage the complementary information between local and global feature. Furthermore, we inte-
grate both identification task and ranking task in one network, where a discriminative embedding and a
similarity measurement are learned concurrently. This results in a novel three-branch framework named
Deep-Person, which learns highly discriminative features for person Re-ID. Experimental results demon-
strate that Deep-Person outperforms the state-of-the-art methods by a large margin on three challenging
datasets including Market-1501, CUHK03, and DukeMTMC-reID.
©2019 Elsevier Ltd. All rights reserved.
1. Introduction
Person re-identification (Re-ID) refers the task of matching a
specific person across multiple non-overlapping cameras. It has
been receiving increasing attention in the computer vision commu-
nity thanks to its various surveillance applications. Despite decades
of study on person Re-ID task, it is still very challenging due to
inaccurate person bounding box detection and large variations in
illumination, pose, background clutter, occlusion, and ambiguity
in visual appearance. Discriminative features focusing mainly on
full person are inevitable to cope with these challenges in person
Re-ID.
Most early works in person Re-ID either focus on discrimina-
tive hand-craft feature representation or robust distance metric for
similarity measurement. Benefiting from the development of deep
learning and increasing large-scale datasets [1–3] , recent person
Re-ID methods combine feature extraction and distance metric into
an end-to-end deep convolution neural network (CNN). Neverthe-
less, most recent CNN-based methods endeavor to either design
a better feature representation or develop a more robust feature
∗
Corresponding author.
E-mail addresses: xbai@hust.edu.cn (X. Bai), yangmingkun@hust.edu.cn (M.
Yang), tengtenghuang@hust.edu.cn (T. Huang), zydou@hust.edu.cn (Z. Dou),
yurui.thu@gmail.com (R. Yu), yongchaoxu@hust.edu.cn (Y. Xu).
learning, but rarely both aspects together. Recently, some semi-
supervised and unsupervised methods are proposed to further pro-
mote this field [4,5] , which achieve satisfactory performance with
few or even no labels.
The CNN-based methods focusing on better feature represen-
tations can be roughly divided into three categories: 1) Global
full-body representation, which is adopted in many methods [3,6] .
Global average pooling is widely used for such global feature ex-
traction, which decreases the granularity of features, thus resulting
in missing local details (see Fig. 1 (a)); 2) Local body-part represen-
tation, which has been exploited in many works with variant part
partitions. A straightforward partition into predefined rigid body
parts is used in many works [7–10] . This may make the learned
feature focus on some person details, Yet, due to pose variations,
imperfect pedestrian detectors, and occlusion, such trivial partition
fails to correctly learn features aligned to full person, leading to
part-based features far from robust. Some recent works endeavor
to develop better body partitions with some sophisticated meth-
ods [11–13] or using extra pose annotation [14,15] . Although these
part-based methods can enrich the generated feature describing
better some person details, they all ignore the contextual informa-
tion between the body parts, still failing to well align to full per-
son and suffering from occlusion, blurring, and background noise.
In [16] , the authors propose to first convert the original person im-
age into sequential LOMO and Color Names features, then rely on
https://doi.org/10.1016/j.patcog.2019.107036
0031-3203/© 2019 Elsevier Ltd. All rights reserved.