
Multimodal video classification with stacked
contractive autoencoders
Yanan Liu
n
, Xiaoqing Feng, Zhiguang Zhou
Zhejiang University of Finance & Economics, Hangzhou, PR China
article info
Article history:
Received 7 September 2014
Received in revised form
26 November 2014
Accepted 1 January 2015
Keywords:
Multimodal
Video classification
Deep learning
Stacked contractive autoencoder
abstract
In this paper we propose a multimodal feature learning mechanism based on deep
networks (i.e., stacked contractive autoencoders) for video classification. Considering the
three modalities in video, i.e., image, audio and text, we first build one Stacked Contractive
Autoencoder (SCAE) for each single modality, whose outputs will be joint together and fed
into another Multimodal Stacked Contractive Autoencoder ( MSCAE). The first stage
preserves intra-modality semantic relations and the second stage discovers inter-
modality semantic correlations. Experiments on real world dataset demonstrate that
the proposed approach achieves better performance compared with the state-of-the-art
methods.
& 2015 Elsevier B.V. All rights reserved.
1. Introduction
With rapid progress of storage devices, Internet and
social network, a large amount of video data are generated.
How to index and search for these videos effectively is an
increasingly active research issue in the multimedia com-
munity. To bridge the semantic gap between low-level
features and high-level semantics, automatic video anno-
tation and classification have emerged as important tech-
niques for efficient video retrieval [1–3].
Typical approaches to accomplishing video annotation
and classification are to apply machine learning methods
only using image features from keyframes of video clips.
As a matter of fact, video consists of three modalities,
namely image, audio and text. Image features of keyframes
just express visual aspects, whereas auditory and textual
features are equivalently significant for video semantics
understanding. A great deal of research has been focused
on utilizing multimodal features for better understanding
of video semantics [4,5]. Thus multimodal integration in
video may compensate the limitations of learning from
any single modality.
There are also many other multimodal learning strate-
gies. One group focuses on multi-modal or cross-modal
retrieval that learn to map the high-dimensional hetero-
geneous features into a common low-dimensional latent
space [6–9]. Another group is composed of graph-based
models, which generate geometric descriptors from multi-
channel or multi-sensor to improve image or video analy-
sis [10–16]. However, these methods are discriminative by
supervised setting, which require a large amount of label-
ed data and waste abundant unlabeled data. Collecting
labeled data is time-consuming and labor intensive. Thus
discovering good representations of data that make it
easier to extract useful information when building classi-
fiers with only unsupervised learning has become a big
challenge.
Recently, deep learning methods have tremendously
attracted researchers interests. The breakthrough in deep
learning was initiated by Hinton and quickly followed up
in the same year [17–19], and many more later. A central
idea, referred to as greedy layerwise unsupervised pre-
training, was to learn a hierarchy of features one level at a
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/sigpro
Signal Processing
http://dx.doi.org/10.1016/j.sigpro.2015.01.001
0165-1684/& 2015 Elsevier B.V. All rights reserved.
n
Corresponding author.
E-mail addresses: liuyn@zju.edu.cn (Y. Liu),
fxq_snake@163.com (X. Feng), zhouzhiguang@zjucadcg.cn (Z. Zhou).
Signal Processing ] (]]]]) ]]] –]]]
Please cite this article as: Y. Liu, et al., Multimodal video classification with stacked contractive autoencoders, Signal
Processing (2015), http://dx.doi.org/10.1016/j.sigpro.2015.01.001i