Semantic Adversarial Network with Multi-scale Pyramid Attention
for Video Classification
De Xie
1
, Cheng Deng
1∗
, Hao Wang
1
, Chao Li
1
, Dapeng Tao
2
1
School of Electronic Engineering, Xidian University, Xi’an 710071, China
2
School of Information Science and Engineering, Yunnan University, Kunming 650091, China
{xiede.xd, chdeng.xd, haowang.xidian,}@gmail.com, li chao@stu.xidian.edu.cn, dapeng.tao@gmail.com
Abstract
Two-stream architecture have shown strong performance in
video classification task. The key idea is to learn spatio-
temporal features by fusing convolutional networks spatially
and temporally. However, there are some problems within
such architecture. First, it relies on optical flow to model tem-
poral information, which are often expensive to compute and
store. Second, it has limited ability to capture details and lo-
cal context information for video data. Third, it lacks explicit
semantic guidance that greatly decrease the classification per-
formance. In this paper, we proposed a new two-stream based
deep framework for video classification to discover spatial
and temporal information only from RGB frames, moreover,
the multi-scale pyramid attention (MPA) layer and the seman-
tic adversarial learning (SAL) module is introduced and inte-
grated in our framework. The MPA enables the network cap-
turing global and local feature to generate a comprehensive
representation for video, and the SAL can make this repre-
sentation gradually approximate to the real video semantics
in an adversarial manner. Experimental results on two pub-
lic benchmarks demonstrate our proposed methods achieves
state-of-the-art results on standard video datasets.
Introduction
Video classification is a fundamental task in computer vision
community, and it serves as an important basis for high-level
tasks, such as video caption (Wang et al. 2018), action de-
tection (Ren
´
e and Hager 2017), and video tracking (Li et al.
2018b). Significant progress on video classification has been
made by deep learning on account of the powerful model-
ing capability of deep convolutional neural networks that
obtain superior performance than those hand-crafted repre-
sentation based methods. However, compared with other vi-
sual tasks (Li et al. 2018a; Fan et al. 2018; Deng et al. 2018;
Yang et al. 2018), video classification should consider not
only static spatial information in each frame but also dy-
namic temporal information between frames. Although deep
convolutional neural networks can model spatial informa-
tion well, it is limited ability to capture temporal information
only from frame sequence. Therefore, how to model spa-
tial and temporal information effectively with deep learning
framework is still a challenging problem.
∗
Corresponding author.
Copyright
c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Modeling temporal information with images. (a)
input frames; (b) The optical flows between these frames,
(c) The differential images between multiple video frames.
Video classification methods based on deep learning can
be divided into three different categories. The first category
relies on a combination of multiple input modalities, which
models spatial and temporal information, respectively. The
two-stream CNN (Simonyan and Zisserman 2014) is a
groundbreaking work of this category, which captures static
spatial information and dynamic temporal information with
different streams from multi-modality input, usually RGB
images and optical flow. Due to its prominent performance,
many state-of-the-art methods can be considered as variants
and improvements of this paradigm. However, this method
suffers from the heavy reliance on optical flow to model
temporal information, which are often expensive to compute
and store. To overcome this limitation, the second category
takes 2D CNN with temporal models on top such as LSTM
(Donahue et al. 2015), temporal convolution (Yue-Hei Ng
et al. 2015) and sparse sampling and aggregation (Wang
et al. 2016). This category usually extracts features from
different frames with 2DCNN, then captures the relation-
ship between these features using temporal models. Such
type of method more intuitive but lacks capacity to obtain
local dynamic information and global context information.
arXiv:1903.02155v1 [cs.CV] 6 Mar 2019