ActionVLAD：深度学习视频动作分类的新方法

下载需积分: 49 | PDF格式 | 2.91MB | 更新于2024-09-11 | 142 浏览量 | 举报

"ActionVLAD: Learning spatial-temporal aggregation for action classification" 这篇论文"ActionVLAD: Learning spatial-temporal aggregation for action classification"是计算机视觉领域的一篇重要研究，主要关注视频动作分类。计算机视觉（Computer Vision）是人工智能的一个分支，致力于理解和解释图像或视频中的视觉数据。在这项工作中，作者提出了一种新的视频表示方法，用于动作分类。这种方法通过在整个空间-时间范围内聚合局部卷积特征来实现。他们将最先进的两流网络（Two-Stream Networks）与可学习的空间-时间特征聚合相结合。两流网络是计算机视觉中处理视频的一种常见策略，分别对静止图像（外观流）和光流（运动流）进行处理，以捕获视觉信息的不同方面。 ActionVLAD架构是端到端可训练的，适用于整个视频的分类。在设计中，研究者探讨了不同的空间和时间池化策略，以及如何融合不同流的信号。他们发现： 1) 同时在空间和时间上进行池化是重要的，这有助于捕捉动作的连续性和时空关联性。 2) 外观和运动流最好被聚合到各自独立的表示中，这样可以保留每一流的独特信息，避免信息混淆。实验结果显示，ActionVLAD的表现显著优于两流基础架构（相对提升13%），同时也优于其他基准方法。这种改进的表示方法增强了视频动作识别的准确性，对于监控、体育分析、自动驾驶等应用具有重大意义。论文作者来自卡内基梅隆大学机器人研究所、Adobe Research和INRIA，表明了多学科合作在推动计算机视觉领域的进步。ActionVLAD的代码和更多细节可在作者的GitHub页面获取，这对于研究者和开发者来说是一个宝贵的资源，可以帮助他们在自己的项目中应用和进一步发展这一技术。

ActionVLAD: Learning spatio-temporal aggregation for action classiﬁcation

Rohit Girdhar

∗

Deva Ramanan

Abhinav Gupta

Josef Sivic

2,3

∗

Bryan Russell

Robotics Institute, Carnegie Mellon University

Adobe Research

INRIA

http://rohitgirdhar.github.io/ActionVLAD

Abstract

In this work, we introduce a new video representa-

tion for action classiﬁcation that aggregates local convolu-

tional features across the entire spatio-temporal extent of

the video. We do so by integrating state-of-the-art two-

stream networks [

42] with learnable spatio-temporal fea-

ture aggregation [

6]. The resulting architecture is end-to-

end trainable for whole-video classiﬁcation. We investigate

different strategies for pooling across space and time and

combining signals from the different streams. We ﬁnd that:

(i) it is important to pool jointly across space and time,

but (ii) appearance and motion streams are best aggregated

into their own separate representations. Finally, we show

that our representation outperforms the two-stream base ar-

chitecture by a large margin (13% relative) as well as out-

performs other baselines with comparable base architec-

tures on HMDB51, UCF101, and Charades video classiﬁ-

cation benchmarks.

1. Introduction

Human action recognition is one of the fundamental

problems in computer vision with applications ranging from

video navigation and movie editing to human-robot collab-

oration. While there has been great progress in classiﬁ-

cation of objects in still images using convolutional neu-

ral networks (CNNs) [

19, 20, 43, 47], this has not been

the case for action recognition. CNN-based representa-

tions [15, 51, 58, 59, 63] have not yet signiﬁcantly outper-

formed the best hand-engineered descriptors [

12, 53]. This

is partly due to missing large-scale video datasets similar in

size and variety to ImageNet [

39]. Current video datasets

are still rather small [

28, 41, 44] containing only on the

order of tens of thousands of videos and a few hundred

classes. In addition, those classes may be speciﬁc to cer-

tain domains, such as sports [

44], and the dataset may con-

tain noisy labels [

26]. Another key open question is: what

is the appropriate spatiotemporal representation for mod-

∗

Work done at Adobe Research during RG’s summer internship

Dribbling

Hoop

Ball

Running

Jump

Group

Throw

= Basketball Shoot

RGB

Flow

Figure 1: How do we represent actions in a video? We propose Ac-

tionVLAD, a spatio-temporal aggregation of a set of action prim-

itives over the appearance and motion streams of a video. For

example, a basketball shoot may be represented as an aggregation

of appearance features corresponding to ‘group of players’, ‘ball’

and ‘basketball hoop’; and motion features corresponding to ‘run’,

‘jump’, and ‘shoot’. We show examples of primitives our model

learns to represent videos in Fig.

eling videos? Most recent video representations for action

recognition are primarily based on two different CNN ar-

chitectures: (1) 3D spatio-temporal convolutions [

49, 51]

that potentially learn complicated spatio-temporal depen-

dencies but have been so far hard to scale in terms of recog-

nition performance; (2) Two-stream architectures [

42] that

decompose the video into motion and appearance streams,

and train separate CNNs for each stream, fusing the outputs

in the end. While both approaches have seen rapid progress,

two-stream architectures have generally outperformed the

spatio-temporal convolution because they can easily exploit

the new ultra-deep architectures [

19, 47] and models pre-

trained for still-image classiﬁcation.

However, two-stream architectures largely disregard the

long-term temporal structure of the video and essentially

learn a classiﬁer that operates on individual frames or short

blocks of few (up to 10) frames [

42], possibly enforcing

consensus of classiﬁcation scores over different segments

of the video [

58]. At test time, T (typically 25) uniformly

971

下载后可阅读完整内容，剩余9页未读，立即下载

杜琪峰

粉丝: 3

ActionVLAD：深度学习视频动作分类的新方法

计算机视觉经典论文

计算机视觉论文

计算机视觉经典论文合集2012-2019

计算机视觉全球最佳论文

关于深度学习计算机视觉论文YOLO9000

深度学习与计算机视觉精选论文集

1996-至今顶尖计算机会议最佳论文精选

基于计算机视觉的机器人应用科技论文

计算机视觉与图像处理论文实用全套PPT.ppt

my-lpcv:低功耗计算机视觉的博客和论文

最新资源