深度学习驱动的多模态视频分类

197 浏览量更新于2024-08-29 收藏 765KB PDF 举报

"Multimodal videoclassification with stacked contractive autoencoders" 本文是一篇研究论文，探讨了基于深度网络（即，堆叠紧致自编码器）的多模态视频分类方法。作者Yanan Liu、Xiaoqing Feng和Zhiguang Zhou来自中国杭州的浙江财经大学。关键词包括：多模态、视频分类、深度学习、堆叠紧致自编码器。文章历史：初稿提交于2014年9月7日，经过修订后于2014年11月26日再次提交，并于2015年1月1日被接受。该论文由Elsevier B.V.出版，保留所有权利。论文摘要指出，研究提出了一种利用深度网络（堆叠紧致自编码器）的多模态特征学习机制，用于视频分类。考虑到视频中的三种模态——图像、音频和文本，首先为每一种单一模态构建一个堆叠紧致自编码器（SCAE）。这些SCAE的输出将被联合起来，输入到另一个多模态堆叠紧致自编码器（MSCAE）中。第一阶段保留了模态内的语义关系，第二阶段则发掘不同模态之间的语义关联。通过在真实世界数据集上的实验，证明了所提出的这种方法相比于现有的先进方法具有更好的性能。多模态视频分类是视频分析领域的一个重要课题，它涉及从不同感官数据（视觉、听觉和可能的文字信息）中提取特征并整合这些信息，以提高分类的准确性和鲁棒性。在本文中，自编码器作为一种无监督学习工具，被用来自动学习和表示这些模态的数据特性。紧致自编码器（Contractive Autoencoder）是一种特殊的自编码器，它通过约束编码过程中的梯度范数，鼓励模型学习数据的局部保持性，从而减少过拟合，增强泛化能力。堆叠自编码器（Stacked Autoencoder）是将多个自编码器层级连接，形成一个深层神经网络结构，可以学习更复杂的非线性特征。在多模态设置中，堆叠紧致自编码器不仅在每一层内学习模态特定的特征，还在跨模态层之间学习模态间的相互关系，这有助于捕捉不同感官数据之间的潜在关联。实验结果表明，这种方法对于视频分类任务是有效的，尤其是在处理复杂、多维度的视频数据时，能够提供比传统方法更优的性能。这对于视频内容理解和智能视频分析等应用具有重要意义，如视频推荐系统、安全监控和社交媒体内容理解等。这篇研究论文提出了一种新颖的多模态视频分类方法，通过堆叠紧致自编码器来学习和融合不同模态的特征，展示了深度学习在视频分析领域的潜力。这一工作为进一步提升视频理解系统的性能提供了新的视角和方法。

Multimodal video classification with stacked

contractive autoencoders

Yanan Liu

, Xiaoqing Feng, Zhiguang Zhou

Zhejiang University of Finance & Economics, Hangzhou, PR China

article info

Article history:

Received 7 September 2014

Received in revised form

26 November 2014

Accepted 1 January 2015

Keywords:

Multimodal

Video classification

Deep learning

Stacked contractive autoencoder

abstract

In this paper we propose a multimodal feature learning mechanism based on deep

networks (i.e., stacked contractive autoencoders) for video classification. Considering the

three modalities in video, i.e., image, audio and text, we first build one Stacked Contractive

Autoencoder (SCAE) for each single modality, whose outputs will be joint together and fed

into another Multimodal Stacked Contractive Autoencoder ( MSCAE). The first stage

preserves intra-modality semantic relations and the second stage discovers inter-

modality semantic correlations. Experiments on real world dataset demonstrate that

the proposed approach achieves better performance compared with the state-of-the-art

methods.

1. Introduction

With rapid progress of storage devices, Internet and

social network, a large amount of video data are generated.

How to index and search for these videos effectively is an

increasingly active research issue in the multimedia com-

munity. To bridge the semantic gap between low-level

features and high-level semantics, automatic video anno-

tation and classification have emerged as important tech-

niques for efficient video retrieval [1–3].

Typical approaches to accomplishing video annotation

and classification are to apply machine learning methods

only using image features from keyframes of video clips.

As a matter of fact, video consists of three modalities,

namely image, audio and text. Image features of keyframes

just express visual aspects, whereas auditory and textual

features are equivalently significant for video semantics

understanding. A great deal of research has been focused

on utilizing multimodal features for better understanding

of video semantics [4,5]. Thus multimodal integration in

video may compensate the limitations of learning from

any single modality.

There are also many other multimodal learning strate-

gies. One group focuses on multi-modal or cross-modal

retrieval that learn to map the high-dimensional hetero-

geneous features into a common low-dimensional latent

space [6–9]. Another group is composed of graph-based

models, which generate geometric descriptors from multi-

channel or multi-sensor to improve image or video analy-

sis [10–16]. However, these methods are discriminative by

supervised setting, which require a large amount of label-

ed data and waste abundant unlabeled data. Collecting

labeled data is time-consuming and labor intensive. Thus

discovering good representations of data that make it

easier to extract useful information when building classi-

fiers with only unsupervised learning has become a big

challenge.

Recently, deep learning methods have tremendously

attracted researchers interests. The breakthrough in deep

learning was initiated by Hinton and quickly followed up

in the same year [17–19], and many more later. A central

idea, referred to as greedy layerwise unsupervised pre-

training, was to learn a hierarchy of features one level at a

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/sigpro

Signal Processing

http://dx.doi.org/10.1016/j.sigpro.2015.01.001

Corresponding author.

E-mail addresses: liuyn@zju.edu.cn (Y. Liu),

fxq_snake@163.com (X. Feng), zhouzhiguang@zjucadcg.cn (Z. Zhou).

Signal Processing ] (]]]]) ]]] –]]]

Please cite this article as: Y. Liu, et al., Multimodal video classification with stacked contractive autoencoders, Signal

Processing (2015), http://dx.doi.org/10.1016/j.sigpro.2015.01.001i

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38548421

粉丝: 6

深度学习驱动的多模态视频分类

商品搭配论文 DeepStyle: Multimodal Search Engine for Fashion and Interior Design

中文part2 Multimodal Sentiment Analysis

multimodal_captioning

Multimodal Deep Learning

multimodal_representation

Plot a multimodal distribution in 3D:Plot a multimodal distribution in 3D-matlab开发

最大似然度matlab源码-nips2014_multimodal_learning:nips2014_multimodal_learning

500,000_multimodal_short_video_data_and_baseline_m_Multimodal-sh

Python Multimodal Hub-开源

Multimodal Collaboration Framework-开源

最新资源