【免费】视频目标检测与跟踪：综述论文

视频目标检测与跟踪

需积分: 0 96 浏览量更新于2023-05-13 1 收藏 1.86MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Video Object Segmentation and Tracking: A Survey

RUI YAO, China Uinversity of Mining and Technology

GUOSHENG LIN, Nanyang Technological University

SHIXIONG XIA, JIAQI ZHAO, and YONG ZHOU, China Uinversity of Mining and Technology

Object segmentation and object tracking are fundamental research area in the computer vision community.

These two topics are dicult to handle some common challenges, such as occlusion, deformation, motion

blur, and scale variation. The former contains heterogeneous object, interacting object, edge ambiguity,

and shape complexity. And the latter suers from diculties in handling fast motion, out-of-view, and

real-time processing. Combining the two problems of video object segmentation and tracking (VOST) can

overcome their respective diculties and improve their performance. VOST can be widely applied to many

practical applications such as video summarization, high denition video compression, human computer

interaction, and autonomous vehicles. This article aims to provide a comprehensive review of the state-of-the-

art tracking methods, and classify these methods into dierent categories, and identify new trends. First, we

provide a hierarchical categorization existing approaches, including unsupervised VOS, semi-supervised VOS,

interactive VOS, weakly supervised VOS, and segmentation-based tracking methods. Second, we provide a

detailed discussion and overview of the technical characteristics of the dierent methods. Third, we summarize

the characteristics of the related video dataset, and provide a variety of evaluation metrics. Finally, we point

out a set of interesting future works and draw our own conclusions.

Additional Key Words and Phrases: Video object segmentation, object tracking, unsupervised methods,

semi-supervised methods, interactive methods, weakly supervised methods

ACM Reference format:

Rui Yao, Guosheng Lin, Shixiong Xia, Jiaqi Zhao, and Yong Zhou. 2019. Video Object Segmentation and

Tracking: A Survey. 1, 1, Article 1 (January 2019), 39 pages.

DOI: 0000001.0000001

1 INTRODUCTION

The rapid development of intelligent mobile terminals and the Internet has led to an exponential

increase in video data. In order to eectively analyze and use video big data, it is very urgent to

automatically segment and track the objects of interest in the video. Video object segmentation

and tracking are two basic tasks in eld of computer vision. Object segmentation divides the

pixels in the video frame into two subsets of the foreground target and the background region, and

generates the object segmentation mask, which is the core problem of behavior recognition and

video retrieval. Object tracking is used to determine the exact location of the target in the video

This work is supported by the Fundamental Research Funds for the Central Universities (No. 2017XKQY075).

Author’s addresses: R. Yao, S. Xia (corresponding author), J. Zhao, and Y. Zhou, School of Computer Science and Technology,

China University of Mining and Technology, Xuzhou, 221116, China; emails: {ruiyao, xiasx, jiaqizhao, yzhou}cumt.edu.cn;

G. Lin, School of Computer Science and Engineering, Nanyang Technological University; email: gslin@ntu.edu.sg.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

DOI: 0000001.0000001

, Vol. 1, No. 1, Article 1. Publication date: January 2019.

arXiv:1904.09172v1 [cs.CV] 19 Apr 2019

1:2 R. Yao et al.

image and generate the object bounding box, which is a necessary step for intelligent monitoring,

big data video analysis and so on.

The segmentation and tracking problems of video objects seem to be independent, but they

are actually inseparable. That is to say, the solution to one of the problems usually involves

solving another problem implicitly or explicitly. Obviously, by solving the object segmentation

problem, it is easy to get a solution to the object tracking problem. On the one hand, accurate

segmentation results provide reliable object observations for tracking, which can solve problems

such as occlusion, deformation, scaling, etc., and fundamentally avoid tracking failures. Although

not so obvious, the same is true for object tracking problems, which must provide at least a coarse

solution to the problem of object segmentation. On the other hand, accurate object tracking results

can also guide the segmentation algorithm to determine the object position, which reduces the

impact of object fast movement, complex background, similar objects, etc., and improves object

segmentation performance. A lot of research work has noticed that the simultaneous processing of

the object segmentation and tracking problems, which can overcome their respective diculties

and improve their performance. The related problems can be divided into two major tasks: video

object segmentation (VOS) and video object tracking (VOT).

The goal of video object segmentation is to segment a particular object instance in the entire

video sequence of the object mask on a manual or automatic rst frame, causing great concern in

the computer vision community. Recent VOS algorithms can be organized by their annotations.

The unsupervised and interactive VOS methods denote the two extremes of the degree of user

interaction with the method: at one extreme, the former can produce a coherent space-time

region through the bottom-up process without any user input, that is, without any video-specic

tags [

101

]. In contrast, the latter uses a strongly supervised interaction method

that requires pixel-level precise segmentation of the rst frame (human provisioning is very time

consuming), but also the human needs to loop error correction system [

104

114

175

]. There

are semi-supervised VOS approaches between the two extremes, which requires manual annotation

to dene what is the foreground object and then automatically segment to the rest frames of the

sequence [

125

135

161

]. In addition, because of the convenience of collecting video-level

labels, another way to supervise VOS is to produce masks of objects given the [

154

203

] or natural

language expressions [

]. However, as mentioned above, the VOS algorithm implicitly handles the

process of tracking. That is, the bottom-up approach uses a spatio-temporal motion and appearance

similarity to segment the video in a fully automated manner. These methods read multiple or

all image frames at once to take full advantage of the context of multiple frames, and segment

the precise object mask. The datasets evaluated by these methods are dominated by short-term

videos. Moreover, because these methods iteratively optimize energy functions or ne-turns a deep

network, so it can be slow.

In contrast to VOS, given a sequence of input images, the video object tracking method utilizes a

class-specic detector to robustly predict the motion state (location, size, or orientation, etc.) of

the object in each frame. In general, most of VOT methods are especially suitable for processing

long-term sequences. Since these methods only need to output the location, orientation or size of

the object, the VOT method uses the online manner for fast processing. For example, tracking-by-

detection methods utilize generative [

139

] and/or discriminative [

197

] appearance models to

accurate estimate object state. The impressive results of these methods prove accurate and fast

tracking. However, most algorithms are limited to generating bounding boxes or ellipses for their

output, so that when non-rigid and articulated motions are involved in the object, they are often

subject to visual drift problems. To address this problem, part-based tracking methods [

198

199

]

have been presented, but they still use part of the bounding box for object localization. In order to

, Vol. 1, No. 1, Article 1. Publication date: January 2019.

Video Object Segmentation and Tracking: A Survey 1:3

leverage the precision object masks and fast object location, segmentation-based tracking methods

have been developed which combine video object segmentation and tracking [

173

181

200

Most of methods estimate the object results (i.e. bounding boxes of the object or/and object masks)

by a combination of bottom-up and top-down algorithms. The contours of deformable objects or

articulated motions can be propagated using these methods eciently.

In the past decade, a large number of video object segmentation and tracking (VOST) studies

have been published in the literature. The eld of VOST has a wide range of practical applications,

including video summarization, high denition (HD) video compression, gesture control and human

interaction. For instance, VOST methods are widely applied to video summarization that exploits

visual object across multiple videos [

], and provide a useful tool that assists video retrieval or web

browsing [

138

]. In the led of video compression, VOST is used in video-coding standards MPEG-4

to implement content-based features and high coding eciency [

]. In particular, the VOST

can encode the video shot as a still background mosaic obtained after compensating the moving

object by utilizing the content-based representation provided by MPEG-4 [

]. Moreover, VOST

can estimate the non-rigid target to achieve accurate tracking positioning and mask description,

which can identify its motion instructions [

180

]. They can replace simple human body language,

especially various gesture controls.

1.1 Challenges and issues

Many problems in video object segmentation and tracking are very challenging. In general, VOS

and VOT have some common challenges, such as background clutter, low resolution, occlusion,

deformation, motion blur, scale variation, etc. But there are some specic characteristics determined

by the objectives and tasks, for example, objects in the VOT can be complex due to fast motion, out-

of-view, and real-time processing. In addition, segmenting and tracking the eects of heterogeneous

object, interacting object, edge ambiguity, shape complexity, etc. A more detailed description is

given in [126, 184].

To address these problems, tremendous progress has been made in the development of video object

segmentation and tracking algorithms. These are mainly dierent from each other based on how

they handle the following issues in visual segmentation and tracking: (i) which application scenario

is suitable for VOST? (ii) Which object representation (i.e. point, superpixel, patch, and object) is

adapted to VOS? (iii) Which image features are appropriate for VOST? (iv) How to model the motion

of an object in VOST? (v) How to per-process and post-process CNN-based VOS methods? (vi)

Which datasets are suitable for the evaluation VOST, and what are their characteristics? A number

of VOST methods have been proposed that attempt to answer these issues for various scenarios.

Motivated by the objective, this survey divides the video object segmentation and tracking methods

into broad categories and provides a comprehensive review of some representative approaches. We

hope to help readers gain valuable VOST knowledge and choose the most appropriate application

for their specic VOST tasks. In addition, we will discuss video object segmentation and tracking

new trends in the community, and hope to provide several interesting ideas to new methods.

1.2 Organization and contributions of this survey

As shown in Fig. 1, we summarize our organization in this survey. To investigate a suitable

application scenario for VOST, we group these methods into ve main categories: unsupervised

VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based

tracking methods.

The unsupervised VOS algorithm typically relies on certain restrictive assumptions about the

application scenario, so it does not have to be manually annotated in the rst frame. According

, Vol. 1, No. 1, Article 1. Publication date: January 2019.

1:4 R. Yao et al.

VOST

methods

Unsupervised VOS

methods

Semi-supervised VOS

methods

Interactive VOS

methods

Weakly supervised

VOS methods

Segmentation-based

Tracking methods

Background subtraction

Point trajectory

Over-segmentation

CNN based

“Object-like”segments

Spatio-temporal graphs

CNN based

Graph partitioning

CNN based

Bottom-up based

Joint-based

Fig. 1. Taxonomy of video object segmentation and tracking.

to discover primary objects using appearance and motion cues, in Sec. 2.1, we categorize them as

background subtraction, point trajectory, over-segmentation, “object-like” segments, and convolu-

tional neural networks based methods. In Tab. 1, we also summarize some object representation,

for example, pixel, superpixel, supervoxel, and patch, and image features. In Sec. 2.2, we describe

the semi-supervised VOS methods for modeling the appearance representations and temporal

connections, and performing segmentation and tracking jointly. In Tab. 3, we discuss various

of per-process and post-process CNN-based VOS methods. In Sec. 2.3, interactive VOS methods

are summarized by the way of user interaction and motion cues. In Sec. 2.4, we discuss various

weakly supervised information for video object segmentation. In Sec. 2.5, we group and describe

the segmentation-based tracking methods, and explain the advantages or disadvantages of dierent

bottom-up and joint-based frameworks, as shown in Tab. 5 and Tab. 6. In addition, we investigate a

number of video datasets for video object segmentation and tracking, and explain the metrics of

pixel-wise mask and bounding box based techniques. Finally, we present several interesting issues

for the future research in Sec. 4, and help researchers in other related elds to explore the possible

benets of VOST techniques.

Although there are surveys on VOS [

126

] and VOT [

103

185

201

], they are not directly

applicable to joint video object segmentation and tracking, unlike our surveys. First, Perazzi et

al. [

126

] present a dataset and evaluation metrics for VOS methods, Erdem et al. [

] measure to

evaluate quantitatively the performance of VOST methods in 2004. In comparison, we focuses on

the summary of methods of video object segmentation, but also object tracking. Second, Yilmaz et

al. [

201

] and Li et al. [

103

] discuss generic object tracking algorithms, and Wu et la. [

185

] evaluate

the performance of single object tracking, therefore, they are dierent from our segmentation-based

tracking discussion.

, Vol. 1, No. 1, Article 1. Publication date: January 2019.

Video Object Segmentation and Tracking: A Survey 1:5

In this survey, we provide a comprehensive review of video object segmentation and tracking,

and summarize our contributions as follows: (i) As shown in Fig. 1, a hierarchical categorization

existing approaches is provided in video object segmentation and tracking. We roughly classify

methods into ve categories. Then, for each category, dierent methods are further categorized.

(ii) We provide a detailed discussion and overview of the technical characteristics of the dierent

methods in unsupervised VOS, semi-supervised VOS, interactive VOS, and segmentation-based

tracking. (iii) We summarize the characteristics of the related video dataset, and provide a variety

of evaluation metrics.

2 MAJOR METHODS

In the section, video object segmentation and tracking methods are grouped into ve categories:

unsupervised video object segmentation methods, semi-supervised video object segmentation meth-

ods, interactive video object segmentation methods, weakly supervised video object segmentation

methods, and segmentation-based tracking methods.

2.1 Unsupervised video object segmentation

The unsupervised VOS algorithm does not require any user input, it can automatically nd objects.

In general, they assume that the objects to be segmented and tracked have dierent motions or

appear frequently in the sequence of images. Following we will review and discuss ve groups of

the unsupervised methods.

2.1.1 Background subtraction. Early video segmentation methods were primarily geometric

based and limited to specic motion backgrounds. The classic background subtraction method

simulates the background appearance of each pixel and treats rapidly changing pixels as foreground.

Any signicant change in the image and background model represents a moving object. The pixels

that make up the changed region are marked for further processing. A connected component

algorithm is used to estimate the connected region corresponding to the object. Therefore, the above

process is called background subtraction. Video object segmentation is achieved by constructing

a representation of the scene called the background model and then nding deviations from the

model for each input frame.

According to the dimension of the used motion, background subtraction methods can be divided

into stationary backgrounds [

150

], backgrounds undergoing 2D parametric motion [

76, 136], and backgrounding undergoing 3D motions [19, 75, 160].

2.1.1.1 Stationary backgrounds. Background subtraction became popular following the work

of Wren et al. [

182

]. They use a multiclass statistical model of color pixel,

I(x, y)

, of a stationary

background with a single 3D (

Y , U

, and

color space) Gaussian,

I(x, y) ∼ N (µ(x, y), Σ(x, y))

. The

model parameters (the mean

µ(x, y)

and the covariance

Σ(x, y)

) are learned from the color observa-

tions in several consecutive frames. For each pixel

(x, y)

in the input video frame, after the model of

the background is derived, they calculate the likelihood that their color is from

N (µ(x, y), Σ(x, y))

and the deviation from the pixel. The foreground model is marked as a foreground pixel. However,

Gao et al. [

] show that a single Gaussian would be insucient to model the pixel value while

accounting for acquisition noise. Therefore, some work begin to improve the performance of

background modeling by using a multimodal statistical model to describe the background color per

pixel. For example, Stauer and Grimson [

150

] build models each pixel as a mixture of Gaussians

(MoG) and uses an on-line approximation to update the model. Rather than explicitly modeling the

values of all the pixels as one particular type of distribution, they model the values of a particular

pixel as a mixture of Gaussians. In [

], Elgammal and Davis use nonparametric kernel density

, Vol. 1, No. 1, Article 1. Publication date: January 2019.

剩余38页未读，继续阅读

syp_net

粉丝: 158
资源: 1196

会员权益专享

视频目标检测与跟踪：综述论文

视频中运动物体检测并框出

yolov5进行目标检测

用于视频的目标检测

目标检测综述2023

目标检测算法发展综述

目标检测图像分割 目标检测图像分割 综述

雷达与视频融合的目标检测的综述？

推荐 多目标跟踪最新的文献

帮我生成一段首先介绍目标检测之后介绍3D目标检测的综述

密集目标检测研究综述

半监督目标检测深度学习方法综述

用python写一篇目标检测综述

写一篇关于目标跟踪的综述

transformer在目标检测中的应用综述

帮我写一篇有关LMB多目标跟踪的1000字综述

实时目标检测算法综述

写一篇关于多目标跟踪的综述

基于深度学习的目标跟踪算法综述

具体说明yolov5国内外研究现状，并举例说明

半监督目标检测的综述

会员权益专享

最新资源

目标检测图像分割目标检测图像分割综述

推荐多目标跟踪最新的文献