Object Detection in Videos by High Quality Object Linking
Peng Tang
†∗
Chunyu Wang
‡
Xinggang Wang
†
Wenyu Liu
†
Wenjun Zeng
‡
Jingdong Wang
‡
†
School of EIC, Huazhong University of Science and Technology
‡
Microsoft Research Asia
{pengtang,xgwang,liuwy}@hust.edu.cn {chnuwa,wezeng,jingdw}@microsoft.com
Abstract
Compared with object detection in static images, object
detection in videos is more challenging due to degraded im-
age qualities. An effective way to address this problem is to
exploit temporal contexts by linking the same object across
video to form tubelets and aggregating classification scores
in the tubelets. In this paper, we focus on obtaining high
quality object linking results for better classification. Un-
like previous methods that link objects by checking boxes
between neighboring frames, we propose to link in the same
frame. To achieve this goal, we extend prior methods in fol-
lowing aspects: (1) a cuboid proposal network that extracts
spatio-temporal candidate cuboids which bound the move-
ment of objects; (2) a short tubelet detection network that
detects short tubelets in short video segments; (3) a short
tubelet linking algorithm that links temporally-overlapping
short tubelets to form long tubelets. Experiments on the Im-
ageNet VID dataset show that our method outperforms both
the static image detector and the previous state of the art.
In particular, our method improves results by 8.8% over the
static image detector for fast moving objects.
1. Introduction
Detecting objects in static images [5, 6, 22, 24, 25, 35,
31] has achieved significant progress due to the emergence
of deep convolutional neural networks (CNNs) [11, 18, 19,
29]. However, object detection in videos brings additional
challenges due to degraded image qualities, e.g. motion blur
and video defocus, leading to unstable classifications for the
same object across video. Therefore, many research efforts
have been allocated to video object detection by exploiting
temporal contexts [8, 3, 17, 16, 15, 37, 36], especially af-
ter the introduction of the ImageNet video object detection
(VID) challenge.
Many previous methods exploit temporal contexts by
linking the same object across video to form tubelets and
aggregating classification scores in the tubelets [8, 17, 16,
∗
This work was done during Microsoft Research Asia internship.
3]. They first use static image detectors to detect objects in
each frame, and then link these detected objects by check-
ing object boxes between neighboring frames, according to
the spatial overlap between object boxes in different frames
[8] or predicting object movements between neighboring
frames [17, 16, 15, 3]. Very promising results are obtained
by these methods.
However, the same object changes its locations and ap-
pearances in neighboring frames due to object motion,
which may make the spatial overlap between boxes of the
same object in neighboring frames not sufficient enough or
the predicted object movements not accurate enough. This
influences the quality of object linking, especially for fast
moving objects. By contrast, in the same frame, it is ob-
vious that two boxes correspond to the same object if they
have sufficient spatial overlaps. Inspired by these facts, we
propose to link objects in the same frame instead of neigh-
boring frames for high quality object linking.
In our method, a long video is first divided into some
temporally-overlapping short video segments. For each
short video segment, we extract a set of cuboid propos-
als, i.e. spatio-temporal candidate cuboids which bound
the movement of objects, by extending the region proposal
network for static images [25] to a cuboid proposal network
for short video segments. The objects across frames lying
in a cuboid are regarded as the same object.
For each cuboid proposal, we adapt the Fast R-CNN [5]
to detect short tubelets. More precisely, we compute the
precise box locations and classification scores for each
frame separately, forming a short tubelet representing the
linked object boxes in the short video segment. We com-
pute the classification score of the tubelet, by aggregating
the classification scores of the boxes across frames. In ad-
dition, to remove spatially redundant short tubelets, we ex-
tend the standard non-maximum suppression (NMS) with a
tubelet overlap measurement, which prevents tubelets from
breaking that may happen in frame-wise NMS. Consider-
ing short range temporal contexts by short tubelets benefits
detection, see Fig. 1 (b).
Finally, we link the short tubelets with sufficient overlap
across temporally-overlapping short video segments. If two
arXiv:1801.09823v2 [cs.CV] 10 Jun 2018