压缩视频动作识别：利用压缩信息提升深度学习效果

需积分: 10 190 浏览量更新于2024-09-08 收藏 6.3MB PDF 举报

"Compressed Video Action Recognition" 在当前的数字化时代，视频数据的处理与分析已经成为一个重要的研究领域。"Compressed Video Action Recognition"是针对这一挑战提出的一种创新方法，主要聚焦于在压缩视频中识别动作。传统的视频动作识别通常需要处理原始的、未压缩的视频流，这不仅数据量庞大，而且由于高时间冗余性，使得训练深度学习模型来提取鲁棒的视频表示极具挑战性。为了应对这个问题，研究者们开始探索如何直接利用经过压缩的视频数据进行训练。压缩视频（如使用H.264或HEVC标准）能够有效减少数据量，甚至可以减小两个数量级。这种压缩不仅降低了存储和传输的成本，也为训练提供了更高信息密度的输入。论文的作者们发现，在压缩视频上训练深度网络，可以使得训练过程变得更加容易。压缩视频中包含的信号虽然有一定的噪声，但却免费提供了一定的运动信息。为了有效地利用这些信息，研究人员提出了新的技术策略。这些技术可能包括但不限于对压缩视频特有的特征进行优化，例如运动矢量和残差信息，以增强模型对动作的识别能力。在深度学习框架下，通过直接处理压缩视频，模型可以学习到去除冗余后的关键运动模式，从而提高识别效率和准确性。这可能涉及到对压缩视频编码器输出的解析，以及设计适应这些特定输入的网络结构。此外，通过利用压缩过程中产生的元数据，例如熵编码和量化步长，可以进一步提高模型对复杂动作场景的理解。 “Compressed Video Action Recognition”是一个旨在解决大规模视频数据处理难题的先进技术，通过直接在压缩视频上训练深度网络，它不仅减少了计算资源的需求，还提高了模型训练的效率和效果。这种方法对于实时监控、智能安防、自动驾驶等需要高效视频理解的应用场景具有重大意义，为未来视频分析领域的研究和发展开辟了新的方向。

P-frame

t=1++++++++++++++++++++++t=4++++++++++++++ + ++ + + + + ++ t=7++ + + + ++ + + + + ++ + + + + ++ + + t=10

Motion

vectors

Accumulated

motion

vectors

Residual

Accumulated

residual

Figure 2: Original motion vectors and residuals describe

only the change between two frames. Usually the signal

to noise ratio is very low and hard to model. The accu-

mulated motion vectors and residuals consider longer term

difference and show clearer patterns. Assume I-frame is at

t = 0. Motion vectors are plotted in HSV space, where the

H channel encodes the direction of motion, and the S chan-

nel shows the amplitude. For residuals we plot the absolute

values in RGB space. Best viewed in color.

A B-frame may be viewed as a special P-frame, where

motion vectors are computed bi-directionally and may ref-

erence a future frame as long as there are no circles in ref-

erencing. Both B- and P- frames capture only what changes

in the video, and are easier to compress owing to smaller

dynamic range [28]. See Figure 2 for a visualization of the

motion estimates and the residuals. Modeling arbitrary de-

coding order is beyond the scope of this paper. We focus on

videos encoded using only backward references, namely I-

and P- frames.

Features from Compressed Data. Some prior works

have utilized signals from compressed video for detection or

recognition, but only as a non-deep feature [15, 36, 38, 47].

To the best of our knowledge, this is the ﬁrst work that con-

siders training deep networks on compressed videos. MV-

CNN apply distillation to transfer knowledge from an opti-

cal ﬂow network to a motion vector network [50]. However,

unlike our approach, it does not consider the general setting

of representation learning on a compressed video; it still

needs the entire decompressed video as RGB stream, and it

requires optical ﬂow as an additional supervision.

Equipped with this background, next we will explore

how to utilize the compressed representation, devoid of re-

dundant information, for action recognition.



t=1 t=2 t=3 t=4 t=4

Figure 3: We trace all motion vectors back to the reference

I-frame and accumulate the residual. Now each P-frame

depends only on the I-frame but not other P-frames.

3. Modeling Compressed Representations

Our goal is to design a computer vision system for action

recognition that operates directly on the stored compressed

video. The compression is solely designed to optimize the

size of the encoding, thus the resulting representation has

very different statistical and structural properties than the

images in a raw video. It is not clear if the successful deep

learning techniques can be adapted to compressed represen-

tations in a straightforward manner. So we ask how to feed

a compressed video into a computer vision system, speciﬁ-

cally a deep network?

Feeding I-frames into a deep network is straightforward

since they are just images. How about P-frames? From Fig-

ure 2 we can see that motion vectors, though noisy, roughly

resemble optical ﬂows. As modeling optical ﬂows with

CNNs has been proven effective, it is tempting to do the

same for motion vectors. The third row of Figure 2 visu-

alizes the residuals. We can see that they roughly give us

a motion boundary in addition to a change of appearance,

such as the change of lighting conditions. Again, CNNs are

well-suited for such patterns. The outputs of corresponding

CNNs from the image, motion vectors, and residual will

have different properties. To combine them, we tried var-

ious fusion strategies, including mean pooling, maximum

pooling, concatenation, convolution pooling, and bilinear

pooling, on both middle layers and the ﬁnal layer, but with

limited success.

Digging deeper, one can argue that the motion vectors

and residuals alone do not contain the full information of

a P-frame — a P-frame depends on the reference frame,

which again might be a P-frame. This chain continues all

the way back to a preceding I-frame. Treating each P-frame

as an independent observation clearly violates this depen-

dency. A simple strategy to address this is to reuse features

from the reference frame, and only update the features given

the new information. This recurrent deﬁnition screams for

RNNs to aggregate features along the chain. However, pre-

liminary experiments suggest the elaborate modeling effort

in vain (see supplementary material for details). The dif-

ﬁculty arises from the long chain of dependency of the P-

frames. To mitigate this issue, we devise a novel yet simple

back-tracing technique that decouples individual P-frames.

剩余13页未读，继续阅读

jiwwy

粉丝: 2
资源: 33

压缩视频动作识别：利用压缩信息提升深度学习效果

matlab集成c代码-Compressed-video-action-recognition:压缩视频动作识别

Action Recognition

CASME2_Compressed video.zip

Distributed compressed video sensing康立伟.

Multi-Frame Quality Enhancement for Compressed Video

Efficient watermarking technique for H.264/AVC compressed video

Fractal compressed sensing imaging with sparse difference based on fractal and entropy recognition

compressed-size-action:GitHub Action为PR提供了压缩后的大小更改

Image and Video Processing in the Compressed Domain.pdf

video_20201227_160109 - Compressed with FlexClip (1).mp4

最新资源