GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition
*
Hanqing Chao
†
, Yiwei He
†
, Junping Zhang
‡
, JianFeng Feng
Fudan University, Shanghai, China
{hqchao16, heyw15, jpzhang, jffeng}@fudan.edu.cn
Abstract
As a unique biometric feature that can be recognized at a
distance, gait has broad applications in crime prevention,
forensic identification and social security. To portray a gait,
existing gait recognition methods utilize either a gait tem-
plate, where temporal information is hard to preserve, or a
gait sequence, which must keep unnecessary sequential con-
straints and thus loses the flexibility of gait recognition. In
this paper we present a novel perspective, where a gait is
regarded as a set consisting of independent frames. We pro-
pose a new network named GaitSet to learn identity informa-
tion from the set. Based on the set perspective, our method
is immune to permutation of frames, and can naturally inte-
grate frames from different videos which have been filmed
under different scenarios, such as diverse viewing angles,
different clothes/carrying conditions. Experiments show that
under normal walking conditions, our single-model method
achieves an average rank-1 accuracy of 95.0% on the CASIA-
B gait dataset and an 87.1% accuracy on the OU-MVLP gait
dataset. These results represent new state-of-the-art recog-
nition accuracy. On various complex scenarios, our model
exhibits a significant level of robustness. It achieves accura-
cies of 87.2% and 70.4% on CASIA-B under bag-carrying and
coat-wearing walking conditions, respectively. These outper-
form the existing best methods by a large margin. The method
presented can also achieve a satisfactory accuracy with a small
number of frames in a test sample, e.g., 82.5% on CASIA-
B with only 7 frames. The source code has been released at
https://github.com/AbnerHqC/GaitSet.
1 Introduction
Unlike other biometrics such as face, fingerprint and iris,
gait is a unique biometric feature that can be recognized at a
distance without the cooperation of subjects and intrusion to
them. Therefore, it has broad applications in crime prevention,
forensic identification and social security.
However, gait recognition suffers from exterior factors
such as the subject’s walking speed, dressing and carrying
*
This work is supported in part by National Natural Science
Foundation of China (NSFC) (Grant No. 61673118) and in part by
Shanghai Pujiang Program (Grant No. 16PJD009).
†
H.C. and Y.H. are co-first authors.
‡
Corresponding author.
Copyright
©
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: From top-left to bottom-right are silhouettes of a
completed period of a subject in CASIA-B gait dataset.
condition, and the camera’s viewpoint and frame rate. There
are two main ways to identify gait in literature, i.e., regarding
gait as an image and regarding gait as a video sequence.
The first category compresses all gait silhouettes into one
image, or gait template for gait recognition (He et al
.
2019;
Takemura et al
.
2018a; Wu et al
.
2017; Hu et al
.
2013).
Simple and easy to implement, gait template easily loses
temporal and fine-grained spatial information. Differently,
the second category extracts features directly from the orig-
inal gait silhouette sequences in recent years (Liao et al
.
2017;
Wolf, Babaee, and Rigoll 2016). However, these methods are
vulnerable to exterior factors. Further, deep neural networks
like 3D-CNN for extracting sequential information are harder
to train than those using a single template like Gait Energy
Image (GEI) (Han and Bhanu 2006).
To solve these problems, we present a novel perspective
which regards gait as a set of gait silhouettes. As a periodic
motion, gait can be represented by a single period. In a sil-
houette sequence containing one gait period, it was observed
that the silhouette in each
position
has unique appearance,
as shown in Fig. 1. Even if these silhouettes are shuffled,
it is not difficult to rearrange them into correct order only
by observing the appearance of them. Thus, we assume the
appearance of a silhouette has contained its
position
informa-
tion. With this assumption, order information of gait sequence
is not necessary and we can directly regard gait as a set to
extract temporal information. We propose an end-to-end deep
learning model called GaitSet whose scheme is shown in
Fig. 2. The input of our model is a set of gait silhouettes.
First, a CNN is used to extract frame-level features from
each silhouette independently. Second, an operation called
Set Pooling is used to aggregate frame-level features into
a single set-level feature. Since this operation is applied on
high-level feature maps instead of the original silhouettes, it
arXiv:1811.06186v2 [cs.CV] 18 Nov 2018