RGB+MPA+SAL：视频分类中的多尺度金字塔注意力与语义对抗网络

5 浏览量更新于2024-08-27 收藏 993KB PDF 举报

本文主要探讨了"具有多尺度金字塔注意度的语义对抗网络"在视频分类任务中的应用，针对两流架构存在的局限性进行创新设计。两流体系结构，如C3D或I3D模型，通过同时处理RGB帧和光学流来捕捉视频的时空特征，展现了强大的性能。然而，这种方法存在几个关键挑战： 1. 依赖于光学流的限制：传统的两流体系结构依赖于昂贵的光学流来处理时间信息，这在计算成本和存储需求上是一项负担。光学流的计算需要大量的计算资源，且可能影响实时性。 2. 细节和局部上下文捕捉不足：该体系结构在处理视频数据时，可能无法充分利用细节信息和局部上下文，这可能导致特征提取的不完整，影响分类精度。 3. 缺乏明确的语义指导：缺乏明确的语义指导使得网络在学习过程中难以聚焦于关键特征，从而降低分类的准确性。为解决这些问题，作者提出了一个全新的两流深度框架，该框架专注于从单一的RGB帧中提取时空信息，采用多尺度金字塔注意力（MPA）层。MPA允许网络同时捕获全局和局部特征，形成视频的丰富表示，增强了对复杂场景的理解能力。此外，还引入了语义对抗学习（SAL）模块，通过对抗训练的方式，促使网络的表示逐渐接近真实视频的语义，从而增强其分类决策的精确性。在实验部分，作者在两个公共基准数据集上验证了这种方法的有效性。结果显示，相比于传统两流体系结构，新提出的框架在保持甚至提高性能的同时，简化了计算流程，提高了效率。这表明，通过引入MPA和SAL，可以在保持视频分类性能的同时，优化模型结构，使之更适合实际应用和资源受限的环境。本研究提出了一种创新的视频分类方法，它通过结合多尺度金字塔注意力和语义对抗学习，有效地解决了两流体系结构中的问题，为视频理解领域的研究提供了新的视角和实用方案。

Semantic Adversarial Network with Multi-scale Pyramid Attention

for Video Classiﬁcation

De Xie

, Cheng Deng

1∗

, Hao Wang

, Chao Li

, Dapeng Tao

School of Electronic Engineering, Xidian University, Xi’an 710071, China

School of Information Science and Engineering, Yunnan University, Kunming 650091, China

{xiede.xd, chdeng.xd, haowang.xidian,}@gmail.com, li chao@stu.xidian.edu.cn, dapeng.tao@gmail.com

Abstract

Two-stream architecture have shown strong performance in

video classiﬁcation task. The key idea is to learn spatio-

temporal features by fusing convolutional networks spatially

and temporally. However, there are some problems within

such architecture. First, it relies on optical ﬂow to model tem-

poral information, which are often expensive to compute and

store. Second, it has limited ability to capture details and lo-

cal context information for video data. Third, it lacks explicit

semantic guidance that greatly decrease the classiﬁcation per-

formance. In this paper, we proposed a new two-stream based

deep framework for video classiﬁcation to discover spatial

and temporal information only from RGB frames, moreover,

the multi-scale pyramid attention (MPA) layer and the seman-

tic adversarial learning (SAL) module is introduced and inte-

grated in our framework. The MPA enables the network cap-

turing global and local feature to generate a comprehensive

representation for video, and the SAL can make this repre-

sentation gradually approximate to the real video semantics

in an adversarial manner. Experimental results on two pub-

lic benchmarks demonstrate our proposed methods achieves

state-of-the-art results on standard video datasets.

Introduction

Video classiﬁcation is a fundamental task in computer vision

community, and it serves as an important basis for high-level

tasks, such as video caption (Wang et al. 2018), action de-

tection (Ren

e and Hager 2017), and video tracking (Li et al.

2018b). Signiﬁcant progress on video classiﬁcation has been

made by deep learning on account of the powerful model-

ing capability of deep convolutional neural networks that

obtain superior performance than those hand-crafted repre-

sentation based methods. However, compared with other vi-

sual tasks (Li et al. 2018a; Fan et al. 2018; Deng et al. 2018;

Yang et al. 2018), video classiﬁcation should consider not

only static spatial information in each frame but also dy-

namic temporal information between frames. Although deep

convolutional neural networks can model spatial informa-

tion well, it is limited ability to capture temporal information

only from frame sequence. Therefore, how to model spa-

tial and temporal information effectively with deep learning

framework is still a challenging problem.

∗

Corresponding author.

 2019, Association for the Advancement of Artiﬁcial

(a) (b) (c)

Figure 1: Modeling temporal information with images. (a)

input frames; (b) The optical ﬂows between these frames,

Video classiﬁcation methods based on deep learning can

be divided into three different categories. The ﬁrst category

relies on a combination of multiple input modalities, which

models spatial and temporal information, respectively. The

two-stream CNN (Simonyan and Zisserman 2014) is a

groundbreaking work of this category, which captures static

spatial information and dynamic temporal information with

different streams from multi-modality input, usually RGB

images and optical ﬂow. Due to its prominent performance,

many state-of-the-art methods can be considered as variants

and improvements of this paradigm. However, this method

suffers from the heavy reliance on optical ﬂow to model

temporal information, which are often expensive to compute

and store. To overcome this limitation, the second category

takes 2D CNN with temporal models on top such as LSTM

(Donahue et al. 2015), temporal convolution (Yue-Hei Ng

et al. 2015) and sparse sampling and aggregation (Wang

et al. 2016). This category usually extracts features from

different frames with 2DCNN, then captures the relation-

ship between these features using temporal models. Such

type of method more intuitive but lacks capacity to obtain

local dynamic information and global context information.

arXiv:1903.02155v1 [cs.CV] 6 Mar 2019

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38649091

粉丝: 6
资源: 933

RGB+MPA+SAL：视频分类中的多尺度金字塔注意力与语义对抗网络

西安交大计算机视觉：理解SIFT多尺度金字塔与高斯金字塔实现

多尺度上下文交织的语义分割技术

高斯金字塔在图像多尺度划分中的应用

Zeng等人2019 CVPR论文：金字塔上下文编码器网络提升高质量图像修复

GANs技术发展脉络：从生成对抗网络到应用研究

基于卷积神经网络的图像语义分割技术分析

提升模型对关键区域的关注：语义分割中的注意力机制

多尺度目标检测算法研究与实践

【YOLOv8多尺度训练宝典】：一文搞懂如何实现模型尺度不变性

多尺度CNN突破：医学影像分析的创新架构与实践案例

最新资源