没有合适的资源?快使用搜索试试~ 我知道了~
首页视频目标检测与跟踪:综述论文
视频目标检测与跟踪:综述论文
需积分: 0 24 下载量 96 浏览量
更新于2023-05-13
1
收藏 1.86MB PDF 举报
目标分割和目标跟踪是计算机视觉领域的基础研究领域。这两个主题很难处理一些常见的挑战,如遮挡、变形、运动模糊、缩放变化等。前者包含异构对象、交互对象、边缘模糊性和形状复杂性;后者在处理快速运动、不可见和实时处理方面存在困难。
资源详情
资源推荐
1
Video Object Segmentation and Tracking: A Survey
RUI YAO, China Uinversity of Mining and Technology
GUOSHENG LIN, Nanyang Technological University
SHIXIONG XIA, JIAQI ZHAO, and YONG ZHOU, China Uinversity of Mining and Technology
Object segmentation and object tracking are fundamental research area in the computer vision community.
These two topics are dicult to handle some common challenges, such as occlusion, deformation, motion
blur, and scale variation. The former contains heterogeneous object, interacting object, edge ambiguity,
and shape complexity. And the latter suers from diculties in handling fast motion, out-of-view, and
real-time processing. Combining the two problems of video object segmentation and tracking (VOST) can
overcome their respective diculties and improve their performance. VOST can be widely applied to many
practical applications such as video summarization, high denition video compression, human computer
interaction, and autonomous vehicles. This article aims to provide a comprehensive review of the state-of-the-
art tracking methods, and classify these methods into dierent categories, and identify new trends. First, we
provide a hierarchical categorization existing approaches, including unsupervised VOS, semi-supervised VOS,
interactive VOS, weakly supervised VOS, and segmentation-based tracking methods. Second, we provide a
detailed discussion and overview of the technical characteristics of the dierent methods. Third, we summarize
the characteristics of the related video dataset, and provide a variety of evaluation metrics. Finally, we point
out a set of interesting future works and draw our own conclusions.
Additional Key Words and Phrases: Video object segmentation, object tracking, unsupervised methods,
semi-supervised methods, interactive methods, weakly supervised methods
ACM Reference format:
Rui Yao, Guosheng Lin, Shixiong Xia, Jiaqi Zhao, and Yong Zhou. 2019. Video Object Segmentation and
Tracking: A Survey. 1, 1, Article 1 (January 2019), 39 pages.
DOI: 0000001.0000001
1 INTRODUCTION
The rapid development of intelligent mobile terminals and the Internet has led to an exponential
increase in video data. In order to eectively analyze and use video big data, it is very urgent to
automatically segment and track the objects of interest in the video. Video object segmentation
and tracking are two basic tasks in eld of computer vision. Object segmentation divides the
pixels in the video frame into two subsets of the foreground target and the background region, and
generates the object segmentation mask, which is the core problem of behavior recognition and
video retrieval. Object tracking is used to determine the exact location of the target in the video
This work is supported by the Fundamental Research Funds for the Central Universities (No. 2017XKQY075).
Author’s addresses: R. Yao, S. Xia (corresponding author), J. Zhao, and Y. Zhou, School of Computer Science and Technology,
China University of Mining and Technology, Xuzhou, 221116, China; emails: {ruiyao, xiasx, jiaqizhao, yzhou}cumt.edu.cn;
G. Lin, School of Computer Science and Engineering, Nanyang Technological University; email: gslin@ntu.edu.sg.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 ACM. XXXX-XXXX/2019/1-ART1 $15.00
DOI: 0000001.0000001
, Vol. 1, No. 1, Article 1. Publication date: January 2019.
arXiv:1904.09172v1 [cs.CV] 19 Apr 2019
1:2 R. Yao et al.
image and generate the object bounding box, which is a necessary step for intelligent monitoring,
big data video analysis and so on.
The segmentation and tracking problems of video objects seem to be independent, but they
are actually inseparable. That is to say, the solution to one of the problems usually involves
solving another problem implicitly or explicitly. Obviously, by solving the object segmentation
problem, it is easy to get a solution to the object tracking problem. On the one hand, accurate
segmentation results provide reliable object observations for tracking, which can solve problems
such as occlusion, deformation, scaling, etc., and fundamentally avoid tracking failures. Although
not so obvious, the same is true for object tracking problems, which must provide at least a coarse
solution to the problem of object segmentation. On the other hand, accurate object tracking results
can also guide the segmentation algorithm to determine the object position, which reduces the
impact of object fast movement, complex background, similar objects, etc., and improves object
segmentation performance. A lot of research work has noticed that the simultaneous processing of
the object segmentation and tracking problems, which can overcome their respective diculties
and improve their performance. The related problems can be divided into two major tasks: video
object segmentation (VOS) and video object tracking (VOT).
The goal of video object segmentation is to segment a particular object instance in the entire
video sequence of the object mask on a manual or automatic rst frame, causing great concern in
the computer vision community. Recent VOS algorithms can be organized by their annotations.
The unsupervised and interactive VOS methods denote the two extremes of the degree of user
interaction with the method: at one extreme, the former can produce a coherent space-time
region through the bottom-up process without any user input, that is, without any video-specic
tags [
17
,
48
,
58
,
75
,
95
,
101
]. In contrast, the latter uses a strongly supervised interaction method
that requires pixel-level precise segmentation of the rst frame (human provisioning is very time
consuming), but also the human needs to loop error correction system [
13
,
23
,
104
,
114
,
175
]. There
are semi-supervised VOS approaches between the two extremes, which requires manual annotation
to dene what is the foreground object and then automatically segment to the rest frames of the
sequence [
22
,
77
,
125
,
135
,
161
]. In addition, because of the convenience of collecting video-level
labels, another way to supervise VOS is to produce masks of objects given the [
154
,
203
] or natural
language expressions [
84
]. However, as mentioned above, the VOS algorithm implicitly handles the
process of tracking. That is, the bottom-up approach uses a spatio-temporal motion and appearance
similarity to segment the video in a fully automated manner. These methods read multiple or
all image frames at once to take full advantage of the context of multiple frames, and segment
the precise object mask. The datasets evaluated by these methods are dominated by short-term
videos. Moreover, because these methods iteratively optimize energy functions or ne-turns a deep
network, so it can be slow.
In contrast to VOS, given a sequence of input images, the video object tracking method utilizes a
class-specic detector to robustly predict the motion state (location, size, or orientation, etc.) of
the object in each frame. In general, most of VOT methods are especially suitable for processing
long-term sequences. Since these methods only need to output the location, orientation or size of
the object, the VOT method uses the online manner for fast processing. For example, tracking-by-
detection methods utilize generative [
139
] and/or discriminative [
63
,
197
] appearance models to
accurate estimate object state. The impressive results of these methods prove accurate and fast
tracking. However, most algorithms are limited to generating bounding boxes or ellipses for their
output, so that when non-rigid and articulated motions are involved in the object, they are often
subject to visual drift problems. To address this problem, part-based tracking methods [
198
,
199
]
have been presented, but they still use part of the bounding box for object localization. In order to
, Vol. 1, No. 1, Article 1. Publication date: January 2019.
Video Object Segmentation and Tracking: A Survey 1:3
leverage the precision object masks and fast object location, segmentation-based tracking methods
have been developed which combine video object segmentation and tracking [
3
,
14
,
173
,
181
,
200
].
Most of methods estimate the object results (i.e. bounding boxes of the object or/and object masks)
by a combination of bottom-up and top-down algorithms. The contours of deformable objects or
articulated motions can be propagated using these methods eciently.
In the past decade, a large number of video object segmentation and tracking (VOST) studies
have been published in the literature. The eld of VOST has a wide range of practical applications,
including video summarization, high denition (HD) video compression, gesture control and human
interaction. For instance, VOST methods are widely applied to video summarization that exploits
visual object across multiple videos [
36
], and provide a useful tool that assists video retrieval or web
browsing [
138
]. In the led of video compression, VOST is used in video-coding standards MPEG-4
to implement content-based features and high coding eciency [
85
]. In particular, the VOST
can encode the video shot as a still background mosaic obtained after compensating the moving
object by utilizing the content-based representation provided by MPEG-4 [
37
]. Moreover, VOST
can estimate the non-rigid target to achieve accurate tracking positioning and mask description,
which can identify its motion instructions [
180
]. They can replace simple human body language,
especially various gesture controls.
1.1 Challenges and issues
Many problems in video object segmentation and tracking are very challenging. In general, VOS
and VOT have some common challenges, such as background clutter, low resolution, occlusion,
deformation, motion blur, scale variation, etc. But there are some specic characteristics determined
by the objectives and tasks, for example, objects in the VOT can be complex due to fast motion, out-
of-view, and real-time processing. In addition, segmenting and tracking the eects of heterogeneous
object, interacting object, edge ambiguity, shape complexity, etc. A more detailed description is
given in [126, 184].
To address these problems, tremendous progress has been made in the development of video object
segmentation and tracking algorithms. These are mainly dierent from each other based on how
they handle the following issues in visual segmentation and tracking: (i) which application scenario
is suitable for VOST? (ii) Which object representation (i.e. point, superpixel, patch, and object) is
adapted to VOS? (iii) Which image features are appropriate for VOST? (iv) How to model the motion
of an object in VOST? (v) How to per-process and post-process CNN-based VOS methods? (vi)
Which datasets are suitable for the evaluation VOST, and what are their characteristics? A number
of VOST methods have been proposed that attempt to answer these issues for various scenarios.
Motivated by the objective, this survey divides the video object segmentation and tracking methods
into broad categories and provides a comprehensive review of some representative approaches. We
hope to help readers gain valuable VOST knowledge and choose the most appropriate application
for their specic VOST tasks. In addition, we will discuss video object segmentation and tracking
new trends in the community, and hope to provide several interesting ideas to new methods.
1.2 Organization and contributions of this survey
As shown in Fig. 1, we summarize our organization in this survey. To investigate a suitable
application scenario for VOST, we group these methods into ve main categories: unsupervised
VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based
tracking methods.
The unsupervised VOS algorithm typically relies on certain restrictive assumptions about the
application scenario, so it does not have to be manually annotated in the rst frame. According
, Vol. 1, No. 1, Article 1. Publication date: January 2019.
1:4 R. Yao et al.
VOST
methods
Unsupervised VOS
methods
Semi-supervised VOS
methods
Interactive VOS
methods
Weakly supervised
VOS methods
Segmentation-based
Tracking methods
Background subtraction
Point trajectory
Over-segmentation
CNN based
“Object-like”segments
Spatio-temporal graphs
CNN based
Graph partitioning
CNN based
Bottom-up based
Joint-based
Fig. 1. Taxonomy of video object segmentation and tracking.
to discover primary objects using appearance and motion cues, in Sec. 2.1, we categorize them as
background subtraction, point trajectory, over-segmentation, “object-like” segments, and convolu-
tional neural networks based methods. In Tab. 1, we also summarize some object representation,
for example, pixel, superpixel, supervoxel, and patch, and image features. In Sec. 2.2, we describe
the semi-supervised VOS methods for modeling the appearance representations and temporal
connections, and performing segmentation and tracking jointly. In Tab. 3, we discuss various
of per-process and post-process CNN-based VOS methods. In Sec. 2.3, interactive VOS methods
are summarized by the way of user interaction and motion cues. In Sec. 2.4, we discuss various
weakly supervised information for video object segmentation. In Sec. 2.5, we group and describe
the segmentation-based tracking methods, and explain the advantages or disadvantages of dierent
bottom-up and joint-based frameworks, as shown in Tab. 5 and Tab. 6. In addition, we investigate a
number of video datasets for video object segmentation and tracking, and explain the metrics of
pixel-wise mask and bounding box based techniques. Finally, we present several interesting issues
for the future research in Sec. 4, and help researchers in other related elds to explore the possible
benets of VOST techniques.
Although there are surveys on VOS [
47
,
126
] and VOT [
103
,
185
,
201
], they are not directly
applicable to joint video object segmentation and tracking, unlike our surveys. First, Perazzi et
al. [
126
] present a dataset and evaluation metrics for VOS methods, Erdem et al. [
47
] measure to
evaluate quantitatively the performance of VOST methods in 2004. In comparison, we focuses on
the summary of methods of video object segmentation, but also object tracking. Second, Yilmaz et
al. [
201
] and Li et al. [
103
] discuss generic object tracking algorithms, and Wu et la. [
185
] evaluate
the performance of single object tracking, therefore, they are dierent from our segmentation-based
tracking discussion.
, Vol. 1, No. 1, Article 1. Publication date: January 2019.
Video Object Segmentation and Tracking: A Survey 1:5
In this survey, we provide a comprehensive review of video object segmentation and tracking,
and summarize our contributions as follows: (i) As shown in Fig. 1, a hierarchical categorization
existing approaches is provided in video object segmentation and tracking. We roughly classify
methods into ve categories. Then, for each category, dierent methods are further categorized.
(ii) We provide a detailed discussion and overview of the technical characteristics of the dierent
methods in unsupervised VOS, semi-supervised VOS, interactive VOS, and segmentation-based
tracking. (iii) We summarize the characteristics of the related video dataset, and provide a variety
of evaluation metrics.
2 MAJOR METHODS
In the section, video object segmentation and tracking methods are grouped into ve categories:
unsupervised video object segmentation methods, semi-supervised video object segmentation meth-
ods, interactive video object segmentation methods, weakly supervised video object segmentation
methods, and segmentation-based tracking methods.
2.1 Unsupervised video object segmentation
The unsupervised VOS algorithm does not require any user input, it can automatically nd objects.
In general, they assume that the objects to be segmented and tracked have dierent motions or
appear frequently in the sequence of images. Following we will review and discuss ve groups of
the unsupervised methods.
2.1.1 Background subtraction. Early video segmentation methods were primarily geometric
based and limited to specic motion backgrounds. The classic background subtraction method
simulates the background appearance of each pixel and treats rapidly changing pixels as foreground.
Any signicant change in the image and background model represents a moving object. The pixels
that make up the changed region are marked for further processing. A connected component
algorithm is used to estimate the connected region corresponding to the object. Therefore, the above
process is called background subtraction. Video object segmentation is achieved by constructing
a representation of the scene called the background model and then nding deviations from the
model for each input frame.
According to the dimension of the used motion, background subtraction methods can be divided
into stationary backgrounds [
44
,
61
,
150
], backgrounds undergoing 2D parametric motion [
11
,
38
,
76, 136], and backgrounding undergoing 3D motions [19, 75, 160].
2.1.1.1 Stationary backgrounds. Background subtraction became popular following the work
of Wren et al. [
182
]. They use a multiclass statistical model of color pixel,
I(x, y)
, of a stationary
background with a single 3D (
Y , U
, and
V
color space) Gaussian,
I(x, y) ∼ N (µ(x, y), Σ(x, y))
. The
model parameters (the mean
µ(x, y)
and the covariance
Σ(x, y)
) are learned from the color observa-
tions in several consecutive frames. For each pixel
(x, y)
in the input video frame, after the model of
the background is derived, they calculate the likelihood that their color is from
N (µ(x, y), Σ(x, y))
,
and the deviation from the pixel. The foreground model is marked as a foreground pixel. However,
Gao et al. [
54
] show that a single Gaussian would be insucient to model the pixel value while
accounting for acquisition noise. Therefore, some work begin to improve the performance of
background modeling by using a multimodal statistical model to describe the background color per
pixel. For example, Stauer and Grimson [
150
] build models each pixel as a mixture of Gaussians
(MoG) and uses an on-line approximation to update the model. Rather than explicitly modeling the
values of all the pixels as one particular type of distribution, they model the values of a particular
pixel as a mixture of Gaussians. In [
44
], Elgammal and Davis use nonparametric kernel density
, Vol. 1, No. 1, Article 1. Publication date: January 2019.
剩余38页未读,继续阅读
syp_net
- 粉丝: 158
- 资源: 1196
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功