P-frame
t=1++++++++++++++++++++++t=4++++++++++++++ + ++ + + + + ++ t=7++ + + + ++ + + + + ++ + + + + ++ + + t=10
Motion
vectors
Accumulated
motion
vectors
Residual
Accumulated
residual
Figure 2: Original motion vectors and residuals describe
only the change between two frames. Usually the signal
to noise ratio is very low and hard to model. The accu-
mulated motion vectors and residuals consider longer term
difference and show clearer patterns. Assume I-frame is at
t = 0. Motion vectors are plotted in HSV space, where the
H channel encodes the direction of motion, and the S chan-
nel shows the amplitude. For residuals we plot the absolute
values in RGB space. Best viewed in color.
A B-frame may be viewed as a special P-frame, where
motion vectors are computed bi-directionally and may ref-
erence a future frame as long as there are no circles in ref-
erencing. Both B- and P- frames capture only what changes
in the video, and are easier to compress owing to smaller
dynamic range [28]. See Figure 2 for a visualization of the
motion estimates and the residuals. Modeling arbitrary de-
coding order is beyond the scope of this paper. We focus on
videos encoded using only backward references, namely I-
and P- frames.
Features from Compressed Data. Some prior works
have utilized signals from compressed video for detection or
recognition, but only as a non-deep feature [15, 36, 38, 47].
To the best of our knowledge, this is the first work that con-
siders training deep networks on compressed videos. MV-
CNN apply distillation to transfer knowledge from an opti-
cal flow network to a motion vector network [50]. However,
unlike our approach, it does not consider the general setting
of representation learning on a compressed video; it still
needs the entire decompressed video as RGB stream, and it
requires optical flow as an additional supervision.
Equipped with this background, next we will explore
how to utilize the compressed representation, devoid of re-
dundant information, for action recognition.
Figure 3: We trace all motion vectors back to the reference
I-frame and accumulate the residual. Now each P-frame
depends only on the I-frame but not other P-frames.
3. Modeling Compressed Representations
Our goal is to design a computer vision system for action
recognition that operates directly on the stored compressed
video. The compression is solely designed to optimize the
size of the encoding, thus the resulting representation has
very different statistical and structural properties than the
images in a raw video. It is not clear if the successful deep
learning techniques can be adapted to compressed represen-
tations in a straightforward manner. So we ask how to feed
a compressed video into a computer vision system, specifi-
cally a deep network?
Feeding I-frames into a deep network is straightforward
since they are just images. How about P-frames? From Fig-
ure 2 we can see that motion vectors, though noisy, roughly
resemble optical flows. As modeling optical flows with
CNNs has been proven effective, it is tempting to do the
same for motion vectors. The third row of Figure 2 visu-
alizes the residuals. We can see that they roughly give us
a motion boundary in addition to a change of appearance,
such as the change of lighting conditions. Again, CNNs are
well-suited for such patterns. The outputs of corresponding
CNNs from the image, motion vectors, and residual will
have different properties. To combine them, we tried var-
ious fusion strategies, including mean pooling, maximum
pooling, concatenation, convolution pooling, and bilinear
pooling, on both middle layers and the final layer, but with
limited success.
Digging deeper, one can argue that the motion vectors
and residuals alone do not contain the full information of
a P-frame — a P-frame depends on the reference frame,
which again might be a P-frame. This chain continues all
the way back to a preceding I-frame. Treating each P-frame
as an independent observation clearly violates this depen-
dency. A simple strategy to address this is to reuse features
from the reference frame, and only update the features given
the new information. This recurrent definition screams for
RNNs to aggregate features along the chain. However, pre-
liminary experiments suggest the elaborate modeling effort
in vain (see supplementary material for details). The dif-
ficulty arises from the long chain of dependency of the P-
frames. To mitigate this issue, we devise a novel yet simple
back-tracing technique that decouples individual P-frames.