
ActionVLAD: Learning spatio-temporal aggregation for action classification
Rohit Girdhar
1
∗
Deva Ramanan
1
Abhinav Gupta
1
Josef Sivic
2,3
∗
Bryan Russell
2
1
Robotics Institute, Carnegie Mellon University
2
Adobe Research
3
INRIA
http://rohitgirdhar.github.io/ActionVLAD
Abstract
In this work, we introduce a new video representa-
tion for action classification that aggregates local convolu-
tional features across the entire spatio-temporal extent of
the video. We do so by integrating state-of-the-art two-
stream networks [
42] with learnable spatio-temporal fea-
ture aggregation [
6]. The resulting architecture is end-to-
end trainable for whole-video classification. We investigate
different strategies for pooling across space and time and
combining signals from the different streams. We find that:
(i) it is important to pool jointly across space and time,
but (ii) appearance and motion streams are best aggregated
into their own separate representations. Finally, we show
that our representation outperforms the two-stream base ar-
chitecture by a large margin (13% relative) as well as out-
performs other baselines with comparable base architec-
tures on HMDB51, UCF101, and Charades video classifi-
cation benchmarks.
1. Introduction
Human action recognition is one of the fundamental
problems in computer vision with applications ranging from
video navigation and movie editing to human-robot collab-
oration. While there has been great progress in classifi-
cation of objects in still images using convolutional neu-
ral networks (CNNs) [
19, 20, 43, 47], this has not been
the case for action recognition. CNN-based representa-
tions [15, 51, 58, 59, 63] have not yet significantly outper-
formed the best hand-engineered descriptors [
12, 53]. This
is partly due to missing large-scale video datasets similar in
size and variety to ImageNet [
39]. Current video datasets
are still rather small [
28, 41, 44] containing only on the
order of tens of thousands of videos and a few hundred
classes. In addition, those classes may be specific to cer-
tain domains, such as sports [
44], and the dataset may con-
tain noisy labels [
26]. Another key open question is: what
is the appropriate spatiotemporal representation for mod-
∗
Work done at Adobe Research during RG’s summer internship
Dribbling
Hoop
Ball
Running
Jump
Group
Throw
= Basketball Shoot
RGB
Flow
Figure 1: How do we represent actions in a video? We propose Ac-
tionVLAD, a spatio-temporal aggregation of a set of action prim-
itives over the appearance and motion streams of a video. For
example, a basketball shoot may be represented as an aggregation
of appearance features corresponding to ‘group of players’, ‘ball’
and ‘basketball hoop’; and motion features corresponding to ‘run’,
‘jump’, and ‘shoot’. We show examples of primitives our model
learns to represent videos in Fig.
6.
eling videos? Most recent video representations for action
recognition are primarily based on two different CNN ar-
chitectures: (1) 3D spatio-temporal convolutions [
49, 51]
that potentially learn complicated spatio-temporal depen-
dencies but have been so far hard to scale in terms of recog-
nition performance; (2) Two-stream architectures [
42] that
decompose the video into motion and appearance streams,
and train separate CNNs for each stream, fusing the outputs
in the end. While both approaches have seen rapid progress,
two-stream architectures have generally outperformed the
spatio-temporal convolution because they can easily exploit
the new ultra-deep architectures [
19, 47] and models pre-
trained for still-image classification.
However, two-stream architectures largely disregard the
long-term temporal structure of the video and essentially
learn a classifier that operates on individual frames or short
blocks of few (up to 10) frames [
42], possibly enforcing
consensus of classification scores over different segments
of the video [
58]. At test time, T (typically 25) uniformly
1
971