Multirate Multimodal Video Captioning
Ziwei Yang
School of Computer Science and
Technology, Tianjin University
Tianjin, China
yzwtend@tju.edu.cn
Youjiang Xu
School of Computer Science and
Technology, Tianjin University
Tianjin, China
yjxu@tju.edu.cn
Huiyun Wang
School of Computer Science and
Technology, Tianjin University
Tianjin, China
wanghuiyun@tju.edu.cn
Bo Wang
School of Computer Science and
Technology, Tianjin University
Tianjin, China
bwong@tju.edu.cn
Yahong Han
School of Computer Science and
Technology, Tianjin University
Tianjin, China
yahong@tju.edu.cn
ABSTRACT
Automatically describing videos with natural language is a crucial
challenge of video understanding. Compared to images, videos have
specic spatial-temporal structure and various modality informa-
tion. In this paper, we propose a Multirate Multimodal Approach for
video captioning. Considering that the speed of motion in videos
varies constantly, we utilize a Multirate GRU to capture temporal
structure of videos. It encodes video frames with dierent inter-
vals and has a strong ability to deal with motion speed variance.
As videos contain dierent modality cues, we design a particular
multimodal fusion method. By incorporating visual, motion, and
topic information together, we construct a well-designed video rep-
resentation. Then the video representation is fed into a RNN-based
language model for generating natural language descriptions. We
evaluate our approach for video captioning on "Microsoft Research -
Video to Text" (MSR-VTT), a large-scale video benchmark for video
understanding. And our approach gets great performance on the
2nd MSR Video to Language Challenge.
KEYWORDS
Video Captioning; GRUs; Multimodal; CNN
1 INTRODUCTION
Understanding video content and generating natural language de-
scription, which is knew as video captioning [
6
,
12
,
26
], is an inter-
esting attempt to bridge vision with semantics. It needs not only
comprehensive understanding of video content, but a suitable lan-
guage model for generating sentences. It is a grand challenge for
machine to understand visual content like human being. And video
captioning may benet plenty of practical applications, such as
video retrieval, surveillance video analysis, and the blind auxiliary.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MM ’17, October 23–27, 2017, Mountain View, CA, USA
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-4906-2/17/10.. . $15.00
https://doi.org/10.1145/3123266.3127904
Previous researches on visual captioning [
3
,
25
,
29
] mainly fo-
cus on image content understanding. They just generate natural
language descriptions for static images. Compared to images, video
contains not only static visual content (frame) but also temporal
structure and multiple modalities. The consecutive video frames
deliver rich information about scene switching, motions, and ob-
jects interaction, which will be included in the temporal structure.
Meanwhile, video is also a collection of multiple modalities, such
as static vision, motions, and audios.
In order to deal with sequential video inputs and incorporate
temporal dependencies, there are many approaches proposed re-
cently. The early work [
23
] simply formed video representation
with a mean pooling over frames. It merely considered temporal
relationship between video frames. S2VT [
22
] rstly used a stacked
LSTM framework to encode consecutive inputs. The later work
[
14
] incorporated video temporal information by a well-designed
hierarchical recurrent network. It alleviates the drawback of LSTM
on capturing long-range dependencies. However, it only deals video
clips with the same intervals. The motion speed of videos varies in
the same video or dierent videos. In a video clip of short duration,
if people are walking, there is almost no apparent motion. On the
contrary, if people are running, the motion is observed during a
short time. Therefore, Zhu et al. [
30
] proposed a Multirate Visual
Recurrent Model (MVRM) to deal with motion speed variance. The
multirate model adaptively incorporated frames of dierent mo-
tions with dierent encoding rates according to their speed. In this
paper, we use the Multirate GRU [
30
] to capture temporal structure
of videos in our approach.
Besides visual content, there always exists multimodal informa-
tion in the video, such as audio and topic (category). There are
relatively few research attempting to incorporate multimodal in-
formation for video captioning. Jin et al. [
9
] proposed a method to
fuse image, video, aural, speech, and category modality features. It
made a great improvement of performance on the MSR-VTT bench-
mark [
24
]. Features of dierent modalities express video content in
dierent aspects and they always complement each other.
In this paper, we propose a Multirate Multimodal Approach for
video captioning. As illustrated in Figure 1, we employ a Multirate
GRU for incorporating video temporal information and a multi-
modal fusion method for creating video representation. We rstly
MM’17, October 23-27, 2017, Mountain View, CA, USA