时间差分分析：提升动作识别性能的新方法

需积分: 9 90 浏览量更新于2024-08-26 收藏 2.65MB PDF 举报

本文主要探讨了"时间差异分析在动作识别中的应用"，发表于2015年12月的《IEEE Transactions on Image Processing》。该研究论文由Jie Miao、Xiangmin Xu（IEEE成员）、Shuoyang Qiu和Chunmei Qing（IEEE成员）以及Dacheng Tao（IEEE Fellow）共同完成。时间差异分析（Temporal Variance Analysis, TVA）是对动作识别领域的一种创新方法，它源于慢特征分析（Slow Feature Analysis, SFA）。SFA是一种从输入数据中提取缓慢变化信号的技术，在初级视觉皮层（V1）中模拟复杂的细胞结构。SFA主要关注的是局部特征提取，因为它能更有效地捕捉静态的表观信息，而非动态的运动信息。然而，为了更好地利用时间信息，TVA被提出作为SFA的一个扩展。TVA通过学习一个线性变换矩阵，将多维度的时间数据映射到具有时间变化特性的成分上。这种方法模仿了V1神经元的功能，即通过时变滤波器（receptive fields）来捕捉动态特征，并结合卷积和池化技术进行局部特征提取。论文的核心贡献是将TVA融入到改进的密集轨迹框架（Improved Dense Trajectory Framework）中，这是一种广泛应用于动作识别的特征表示方法。通过这种方法，TVA能够捕捉到动作序列中的时空模式，从而提高了动作识别的精度和鲁棒性。与传统方法相比，TVA不仅考虑了静态图像特征，还充分利用了动作数据的时间维度，这在处理复杂的动作识别任务时显示出显著的优势。总结来说，这篇研究论文深入剖析了时间差异分析在动作识别领域的实际应用，通过提升对动态信息的处理能力，优化了特征提取过程，为视频监控、计算机视觉和人工智能等领域中的行为识别提供了新的理论和技术支持。其研究结果不仅推动了动作识别技术的发展，也为后续的相关研究提供了有价值的参考。

5906 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

Different from above approaches, TVA learns a local feature

representation by analysing temporal variances of small video

cubes.

C. Primary Visual Cortex (V1) and Bio-Inspired Model

Gabor ﬁlters have been applied to model simple cells of V1

in many bio-inspired models. The most popular approach

is the HMAX model [27]. Most of these approaches use

Gabor ﬁlters and hierarchical feedforward architectures to

extract appearance information to mimic the function of the

ventral pathway in the visual cortex. A neurophysiologically

plausible model based on Gabor ﬁlters was proposed in [3]

to model functions of the dorsal pathway. This model has

been successfully applied to action recognition [28]. A spatio-

temporal Laplacian pyramid coding approach was introduced

as a holistic representation by applying a bank of 3D Gabor

ﬁlters and max pooling to each level of the Laplacian pyra-

mid [29]. Escobar et al. [30] proposed a bio-inspired feedfor-

ward spiking network to model V1 and MT areas for motion

representation in action recognition. However, this motion-

based approach failed to outperform the above-mentioned

approach based on the Gabor ﬁlters. Liu et al. [31] use a

genetic programming based approach to automatically evolve

spatio-temporal feature descriptors such as 3D Gabor ﬁlters

and wavelet ﬁlters for action recognition.

Slow feature analysis (SFA) [1] extracts slowly-changing

features from rapidly-changing signals. Research shows that

receptive ﬁelds learned by SFA have similar properties of V1

complex cells [4]. In action recognition, SFA was ﬁrst

proposed as a local feature by representing action using its

aggregated changes in speed [32], which was competitive

with other state-of-the-art methods on sample datasets but

poorly generalizable to complex datasets. Inspired by deep

learning and deep representation, Sun et al. [33] proposed a

two-layer SFA approach to extract features from videos for

action recognition, which was able to handle complex action

recognition tasks. Minh and Wiskott [34] introduced multi-

variate SFA for blind source separation. Theriault et al. [35]

improved scene recognition accuracy by SFA. A probabilistic

SFA [36] was proposed to detect changes in facial expression

in video sequences. Most of these approaches use SFA as a

conventional dimension reduction method, but SFA is rarely

used as a bio-inspired model.

D. Contribution

Based on the study mentioned above, the main contributions

of this paper can be summarized as follows.

1) The TVA is proposed as a generalization of SFA to

use both slow and fast features. We introduce the usage

of fast features for motion representation. By mimick-

ing the function of V1 cells, appearance and motion

information can be obtained by slow and fast features

respectively.

2) Additional motion features are introduced by extracting

features from optical ﬂows. In this way, slow features

encode velocity information, and fast features encode

acceleration information.

Fig. 2. A ﬂow chart of TVA for action recognition.

3) By using parts of fast ﬁlters as slow ﬁlters and vice

verse, the hybrid slim ﬁlter is proposed to improve both

slow and fast feature extraction.

III. TVA

FOR ACTION RECOGNITION

In this section, we give details on the proposed method.

A brief framework of TVA for action recognition is shown

in Fig. 2. We ﬁrst train convolution ﬁlters by TVA using

cubes aligned with tracked trajectories. Then convolution and

pooling are performed for local feature extraction. Lastly

Fisher vector are used for obtaining ﬁnal video representation.

A. Temporal Variance Analysis

Considerable efforts have been made to model temporal

information for feature extraction. SFA [1] extracts slowly-

varying information from quickly-varying input signals by

applying the temporal slowness principle. For example, in

action recognition, it is evident that while pixels in a video

may change markedly, the perception of action might not

change at all. The temporal slowness principle argues that

this unchanging concept can be extracted by capturing slowly-

varying features.

However, from the perspective of local features, it is difﬁcult

to ﬁnd a compact high-level semantic representation that slows

the features as much as we would like. We therefore suggest

that local features need to be represented by both slow- and

fast-varying information. For example, considering a moving

object in a small video cube, the fast-varying information

encodes the dynamic motion pattern, and the slow-varying

information encodes the near static appearance of the object.

Using both fast- and slow-varying information results in a

graceful representation.

To this end, we propose the temporal variance

analysis (TVA) for local feature extraction. Considering

a multi-dimensional temporal sequence which consists of

components with different temporal variances, TVA extracts

these components by a linear projection and use them as the

feature representation. Fast features, which are components

with large temporal variances, encode motion information,

while slow features, which are components with small

temporal variances, encode appearance information.

In this paper, we denote matrices by upper case letters

and vectors by lower case letters. The matrix transpose is

denoted by using T in superscript. For example, U

means the

transpose of matrix U. Mathematically, the proposed TVA is

detailed as follows.

剩余11页未读，继续阅读

weixin_38735899

粉丝: 2
资源: 973

时间差分分析：提升动作识别性能的新方法

基于深度学习的足球赛事视频动作识别系统.pdf

利用时间方差分析提升动作识别

用于实时动作识别的局部特征分析

基于最大时空差异嵌入的人体动作识别轮廓分析

基于3D高斯bin的稀疏表示用于动作识别

DTW聚类Matlab代码-HDPE:用于动作识别的分层动态解析和编码

Movie-sentiment-analyzer:尝试通过情感分析来识别小鸡电影和动作电影之间的差异

视频图matlab代码-TSTDDs:论文“用于红外动作识别的基于全局时间表示的CNN”（IEEESPL）的代码和数据

数据融合matlab代码-SAKDN:论文“用于跨模式动作识别的语义感知自适应知识蒸馏”的代码和数据（提交给IEEET-IP）

手腕动作模式识别：基于表面肌电信号的深度分析

最新资源