1934 W. Li et al. / Neurocomputing 275 (2018) 1932–1945
based on spatiotemporal shape variation. It captures the motion
information over time and represents normalized frame differ-
ences over a gait cycle. The dynamic texture descriptors and lo-
cal binary patterns from three orthogonal planes were used to de-
scribe the human gait in a spatiotemporal manner by Kellokumpu
et al. [31] . Kusakunniran et al. [32] used higher-order shape config-
uration based on a differential composition model for cross-speed
gait recognition. Jure Kova
ˇ
c
and Peter Peer [33] investigated the
influence of walking speed variation to different gait recognition
approaches and proposed normalization based on geometric trans-
formations to mitigate the influence in gait recognition. Mansur
et al. [34] proposed to model speed change using a cylindrical
manifold whose azimuth and height correspond to the phase and
the stride, respectively. Huang et al. [35] presented a scheme com-
posed by speed invariant gait template (SIGT) and normalized hy-
pergraph classifier for cross-speed gait recognition. These methods
used the shape and dynamic information of frames for recognition.
However, extracting the spatiotemporal shape and motion informa-
tion frame by frame is a time-consuming task, and extracted fea-
tures can be sensitive to noise.
Worapan Kusakunniran [36,37] proposed the histogram of
space-time interest points descriptors (HSD) as a gait feature. Cas-
tro et al. [38] proposed the pyramidal Fisher motion (PFM) de-
scriptors by combining densely sampled local features and Fisher
vectors for both single-view and multi-view gait recognition. Un-
like most appearance-based methods that rely on human silhou-
ettes obtained from the foreground-background segmentation, the
HSD-based and the PFM-based methods extract gait features di-
rectly from the raw gait videos.
Recently, deep learning techniques have been employed for gait
recognition [39–42] . Alotaibi and Mahmood [39] proposed a spe-
cialized deep convolutional neural networks (CNN) for gait recog-
nition. Their CNN architecture consists of multiple convolutional
and sub-sampling layers, making the gait recognition scheme ro-
bust against certain types of variations. Yan et al. [40] proposed
to use convolutional neural networks (ConvNets) and multi-task
learning model (MLT) to identify human gait and predict multi-
ple human attributes simultaneously. Castro et al. [41] proposed
to use CNN to learn high-level descriptors from low-level motion
features (i.e. optical flow components) for gait recognition. Zhang
et al. [42] proposed a Siamese neural network (SiaNet) with dis-
tance metric learning for gait recognition. In general, the perfor-
mance of these methods is highly dependent on the training sam-
ples. With sufficient training samples containing rich variances,
these methods can learn effective features automatically, leading
to good recognition performance.
2.2. Multi-View Gait Recognition
As compared with single-view gait recognition, the appear-
ance variation caused by multiple viewing angles brings even
more challenges for robust gait recognition. Algorithms have been
proposed to address the viewing angle variation. Examples in-
clude [11,12,27,43–53] . These methods can be divided into three
types: methods [11,43–45,47,49] based on view transformation
models (VTMs), methods [46,51,52] based on pairwise projection
by canonical correlation analysis (CCA) or nonlinear coupled map-
pings (NCMs) and the others [12,27,38,48,50,53] without learning
specific VTMs or specific pairwise subspaces.
Makihara et al. [11] used frequency-domain features and
view transformation models (VTMs) for multi-view gait recog-
nition. Kusakunnira emphet al. [43] exploited the VTMs based
on optimized GEIs for further performance improvement. Zheng
et al. [44] considered VTMs using the partial least squares on
the GEI, which offered more robust performance against varia-
tions in viewing angle, clothing and object carrying. Muramatsu
et al. [45] proposed a VTM-based approach by using transfor-
mation consistency measures for cross-view recognition. More-
over, Muramatsu et al. [47] proposed multiple quality measures
for VTM-based cross-view gait recognition. The key idea is to as-
sociate the quality measures with the degree of how well the
test subjects’ gait features are represented by a joint subspace
spanned by the training subjects’ gait features. Still further, Mura-
matsu et al. [49] proposed an arbitrary view transformation model
(AVTM) to match a pair of gait traits from an arbitrary view. These
methods based on VTMs achieved high performance for dealing
with multi-view gait recognition. However, they are all based on
the assumption that the viewing angles of the gallery and the
probe sets are known as a prior, which imposes a strong restric-
tion on gait applications. Besides, it is burdensome to learn a spe-
cific VTM for every pair of views and the recognition rate is highly
dependent on the density of view sampling.
Kusakunniran et al. [46] carried out motion co-clustering to par-
tition the most related parts of gaits from different views into
the same group. Inside each group, a linear correlation between
gait information across views is further maximized through CCA.
Xing et al. [51] proposed a complete canonical correlation analy-
sis (C3A) method to deal with multi-view gait recognition. As re-
ported in these papers, methods of this type currently achieve the
highest recognition rates among all the multi-view gait recognition
methods. Ben et al. [52] proposed a novel nonlinear coupled map-
pings (NCMs) algorithm to successfully match between the cross-
domain gaits. The relationships within the training data as nodes
in a graph are modeled in the kernel space and the constraint is
designed to make the difference minimized between cross-domain
gaits for an identical subject. However, these methods also as-
sume that the viewing angles of the gallery and the probe sets
are known as a prior. Besides, they need to learn a projection sub-
space, through CCA, C3A or NCMs, for every pair of views and the
recognition rate is highly dependent on the density of view sam-
pling. All of these form a strong barrier to practical use of this type
of methods.
Yu et al. [12] proposed a framework for gait recognition perfor-
mance evaluation, and employed the GEI and the nearest neigh-
bor classifier for multi-view gait recognition. Dupuis et al. [27] and
Choudhury et al. [50] both adopted a two-step hierarchical recog-
nition procedure which, for a probe sample, firstly predicts its
viewing angle and secondly finds the match in the predicted
subset of the gallery. These algorithms do not require any prior
knowledge about the probe or the gallery samples, but their
recognition rates are dependent on the prediction accuracy and
the completeness of the gallery subsets. Makihara et al. [48] de-
scribed a method of multi-view discriminant analysis with tensor
representation (MvDATER) for multi-view gait recognition. How-
ever, there must be sufficient training samples for it to learn mul-
tiple view-specific projection matrices. Besides, the tensor repre-
sentation is sensitive to large viewing angle changes. Wu et al.
[53] conducted multi-view gait recognition via similarity learning
by deep CNN. They trained deep networks to recognize the most
discriminative changes of gait patterns by a small group of labeled
multi-view human walking videos.
Despite all the above-mentioned effort s, the achieved multi-
view gait recognition rates are still relatively low. This is largely
due to the fact that the viewing angle variation brings larger intra-
class variances than other types of variation. Furthermore, the
viewing angle variation on top of other variation types adds even
more complications to the problem.
2.3. Cooperative vs uncooperative settings
Most algorithms described above are designed for a cooperative
experimental setting, where the covariate conditions are known as