分层池化深度卷积神经网络：新视角于人体动作识别

153 浏览量更新于2024-07-14 1 收藏 1.38MB PDF 举报

"基于分层池的深度卷积神经网络用于人类动作识别" 本文探讨了在计算机视觉领域中一个关键问题——基于视频的人体动作识别。近年来，深度卷积神经网络（CNN）在处理这一问题上取得了显著的进步，尤其是在HMDB-51和UCF-101等标准数据集上，其表现已经达到了最先进的水平。然而，一个关键的挑战是如何有效地整合视频中的帧级特征，以构建出能捕捉到复杂动作模式的视频级特征。为了应对这个挑战，作者提出了名为分层池化（Stratified Pooling，SP）的深度卷积神经网络（SP-CNN）新方法。这个方法主要分为五个步骤： 1. **预训练CNN微调**：首先，使用已经在大型图像数据集（如ImageNet）上预训练的CNN模型，并针对特定的目标动作识别任务进行微调，以适应新的数据集特性。 2. **帧级特征提取**：对视频中的每一帧应用CNN，提取丰富的特征表示，这些特征通常包含了帧中的物体、形状和纹理等信息。 3. **主成分分析（PCA）**：为了降低特征维度，提高计算效率和防止过拟合，使用PCA方法对提取的帧级特征进行降维处理，保留最重要的特征成分。 4. **分层池化**：这是SP-CNN的核心创新，它不是简单地对帧级特征求平均或最大值，而是采用分层次的策略来合并这些特征。通过这种方式，能够更好地捕捉动作的时间序列信息，同时保持对关键动作特征的敏感度。 5. **支持向量机（SVM）分类**：最后，利用支持向量机作为多类分类器，将得到的视频级特征映射到不同的动作类别，完成动作识别。实验结果证明，SP-CNN在HMDB-51和UCF-101数据集上的性能优于现有的最新技术，显示了分层池化策略的有效性和优越性。这种方法不仅提高了动作识别的准确性，而且展示了深度学习模型在处理视频数据时的潜力，特别是在理解和捕获时间序列信息方面。总结来说，这篇研究提出了一种创新的深度学习框架，即基于分层池化的深度卷积神经网络，它通过优化帧级特征的整合，提升了视频动作识别的性能。这种方法对未来的计算机视觉研究，尤其是视频理解和智能监控等领域，具有重要的参考价值。

13370 Multimed Tools Appl (2017) 76:13367–13382

a compact, global feature for face-image representation. In [19], Jian et al. proposed a

novel singular value decomposition method for simultaneous hallucination and recognition

of low-resolution faces, in which the singular values are first proved to be effective for

representing face images.

As for action recognition, Wang et al. [44] proposed dense trajectory features (DTF)

extended local features HOF, HOG, MBH and trajectory to align 3D volumes, which

obtains much richer low level descriptors for representing the video. HOG descriptor depicts

static appearance. HOF descriptor captures the local motion information. MBH descrip-

tor captures the relative dynamic motion information [48]. So, DTF obtained a significant

improvement on some challenging datasets which are UCF-101, HMDB-51, etc. Further-

more, in order to deal with camera motion, Wang et al. [45] proposed improved dense

trajectory (IDT) which explicitly estimating camera motion to form a feature descrip-

tor which has shown a good result on many action recognition datasets. However, local

space-time features are sensitive to noise and then results in the instability of the recogni-

tion performance. In [13], the recognition performance significant improves which due to

decomposing visual motion into residual and dominant motions. In [34], Peng proposed a

new dense sampling strategy to reduce much valid trajectories while preserves the discrimi-

native power. Chen et al. [4] proposed cluster trees model of improved trajectories not only

obtain good recognition performance, but also reduce noisy clusters and alleviate intra-class

variation. However, these local features were not robust to clutter motions, such as camera

motions and background changes were accumulated.

Meanwhile, in order to transform local descriptors into video feature descriptor, there

have been a number of feature coding algorithms proposed, such as sparse coding [5, 30],

vector of local aggregated descriptors (VLAD) [14], naive bag of words (BoW) [8], fisher

vector (FV) [37], improved fisher kernel [38], which achieves the successful of performance.

The hand-crafted features based action recognition achieve good performance, but these

features are not optimized for visual representation and lack discriminative capacity when

encounter background clutter, large intra-class variations videos for action recognition.

In recent years, deep learning models like deep Boltzmann machines (DBMs) [1, 28],

deep belief networks (DBNs) [26, 27], stacked auto-encoders [9], Recurrent neural networks

(RNNs) [6, 51] and CNN are used in computer vision applications. For human action recog-

nition, RNNs and CNN have led to impressive performance. RNNs with Long Short-Term

Memory (LSTM) units have been shown to perform well in the domain of image descrip-

tion [6, 49], image classification [33], video description [6, 50] and action recognition [6,

51]. Volodymyr et al. [33] proposed a visual attention mechanism based on RNNs model

to select a sequence of locations for image classification task. Xu et al. [49] used attention

mechanisms to generate image descriptions. Georgia et al. adapt R*CNN to use a primary

region and a secondary region for image based action recognition [11, 33]. Inspired by those

works on finding visual attention in images, Shikhar et al. [40] use RNNs-LSTM units to

expand the visual attention on video based action recognition. The recognition performance

is depends on the model pays attention to the action relevant region, but discriminative local-

ization is a challenge problem for video. So, the recognition accuracy on HMDB-51 is only

41.3 %, which is lower than the state-of-the-art models about 30 %.

More recently, some impressive results has been achieved using CNN for human action

recognition in videos. Ji et al. [16] trained 3D convolutional operation to extract spatio-

temporal features from raw sequence data for action recognition, which may appear of

over-fitting phenomenon when without enough samples for training. The performance is

worse than hand-crafted representation [45]. Karpathy et al. [20] trained CNN structures

on the Sparts-1M dataset. The network consists of thousands of 3D convolutional filters

剩余15页未读，继续阅读

weixin_38560797

粉丝: 5
资源: 997

分层池化深度卷积神经网络：新视角于人体动作识别

Keras_Image_Classification_CNN:卷积神经网络模型在二值图像分类中的应用

CNN-Action-Recognition:使用卷积神经网络（CNN）的动作识别

姿势识别源代码matlab-SignFi:使用WiFi和卷积神经网络的手语识别

深度卷积神经网络是什么

深度神经网络的目标重识别的方法和原理

基于神经网络的船只图像识别国内外研究现状

卷积神经网络的原理及其特点

卷积神经网络可以实现回归吗？

cnn卷积神经网络的原理与结构

卷积神经网络预测的优点

最新资源