5906 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Different from above approaches, TVA learns a local feature
representation by analysing temporal variances of small video
cubes.
C. Primary Visual Cortex (V1) and Bio-Inspired Model
Gabor filters have been applied to model simple cells of V1
in many bio-inspired models. The most popular approach
is the HMAX model [27]. Most of these approaches use
Gabor filters and hierarchical feedforward architectures to
extract appearance information to mimic the function of the
ventral pathway in the visual cortex. A neurophysiologically
plausible model based on Gabor filters was proposed in [3]
to model functions of the dorsal pathway. This model has
been successfully applied to action recognition [28]. A spatio-
temporal Laplacian pyramid coding approach was introduced
as a holistic representation by applying a bank of 3D Gabor
filters and max pooling to each level of the Laplacian pyra-
mid [29]. Escobar et al. [30] proposed a bio-inspired feedfor-
ward spiking network to model V1 and MT areas for motion
representation in action recognition. However, this motion-
based approach failed to outperform the above-mentioned
approach based on the Gabor filters. Liu et al. [31] use a
genetic programming based approach to automatically evolve
spatio-temporal feature descriptors such as 3D Gabor filters
and wavelet filters for action recognition.
Slow feature analysis (SFA) [1] extracts slowly-changing
features from rapidly-changing signals. Research shows that
receptive fields learned by SFA have similar properties of V1
complex cells [4]. In action recognition, SFA was first
proposed as a local feature by representing action using its
aggregated changes in speed [32], which was competitive
with other state-of-the-art methods on sample datasets but
poorly generalizable to complex datasets. Inspired by deep
learning and deep representation, Sun et al. [33] proposed a
two-layer SFA approach to extract features from videos for
action recognition, which was able to handle complex action
recognition tasks. Minh and Wiskott [34] introduced multi-
variate SFA for blind source separation. Theriault et al. [35]
improved scene recognition accuracy by SFA. A probabilistic
SFA [36] was proposed to detect changes in facial expression
in video sequences. Most of these approaches use SFA as a
conventional dimension reduction method, but SFA is rarely
used as a bio-inspired model.
D. Contribution
Based on the study mentioned above, the main contributions
of this paper can be summarized as follows.
1) The TVA is proposed as a generalization of SFA to
use both slow and fast features. We introduce the usage
of fast features for motion representation. By mimick-
ing the function of V1 cells, appearance and motion
information can be obtained by slow and fast features
respectively.
2) Additional motion features are introduced by extracting
features from optical flows. In this way, slow features
encode velocity information, and fast features encode
acceleration information.
Fig. 2. A flow chart of TVA for action recognition.
3) By using parts of fast filters as slow filters and vice
verse, the hybrid slim filter is proposed to improve both
slow and fast feature extraction.
III. TVA
FOR ACTION RECOGNITION
In this section, we give details on the proposed method.
A brief framework of TVA for action recognition is shown
in Fig. 2. We first train convolution filters by TVA using
cubes aligned with tracked trajectories. Then convolution and
pooling are performed for local feature extraction. Lastly
Fisher vector are used for obtaining final video representation.
A. Temporal Variance Analysis
Considerable efforts have been made to model temporal
information for feature extraction. SFA [1] extracts slowly-
varying information from quickly-varying input signals by
applying the temporal slowness principle. For example, in
action recognition, it is evident that while pixels in a video
may change markedly, the perception of action might not
change at all. The temporal slowness principle argues that
this unchanging concept can be extracted by capturing slowly-
varying features.
However, from the perspective of local features, it is difficult
to find a compact high-level semantic representation that slows
the features as much as we would like. We therefore suggest
that local features need to be represented by both slow- and
fast-varying information. For example, considering a moving
object in a small video cube, the fast-varying information
encodes the dynamic motion pattern, and the slow-varying
information encodes the near static appearance of the object.
Using both fast- and slow-varying information results in a
graceful representation.
To this end, we propose the temporal variance
analysis (TVA) for local feature extraction. Considering
a multi-dimensional temporal sequence which consists of
components with different temporal variances, TVA extracts
these components by a linear projection and use them as the
feature representation. Fast features, which are components
with large temporal variances, encode motion information,
while slow features, which are components with small
temporal variances, encode appearance information.
In this paper, we denote matrices by upper case letters
and vectors by lower case letters. The matrix transpose is
denoted by using T in superscript. For example, U
T
means the
transpose of matrix U. Mathematically, the proposed TVA is
detailed as follows.