双流残差网络：提升视觉目标跟踪鲁棒性与速度

需积分: 9 79 浏览量更新于2024-09-04 收藏 3.48MB PDF 举报

本文主要探讨了"Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks"这一主题，它是一种创新的深度学习方法，针对视觉对象跟踪中的挑战提出了新的解决方案。在当前基于深度学习的视觉跟踪技术广泛应用并取得显著成果的基础上，许多算法主要依赖于物体的外观特征进行目标识别和位置预测，但在面对密集背景干扰、混淆背景和运动模糊等复杂情况时，性能往往受限。传统方法的不足在于仅依赖静态图像信息，而忽视了运动信息对区分目标与背景的重要作用。为了克服这些难题，研究者们受到人类视觉系统中利用运动线索进行目标跟踪能力的启发，设计了一种名为Two-Stream Residual Convolutional Network (TS-RCN)的新型架构。TS-RCN的关键在于将物体的表观特征（如颜色、纹理）与运动信息（如光流）相结合，通过双流结构进行模型的实时更新。这种方法不仅提高了跟踪的鲁棒性，而且有效地降低了由于目标与背景相似度高导致的误跟踪风险。在VOT2018、VOT2019和GOT-10K等多个知名数据集上的实验结果显示，TS-RCN在性能上远超先前的算法，显示出其在复杂场景下的优越性能。值得一提的是，尽管具备强大的性能，TS-RCN的速度依然保持在38.1帧每秒，确保了实际应用中的高效性。这项工作背后的团队包括来自京东、咪咕文化科技和上海大学的研究人员，他们通过跨领域的合作，结合最新的深度学习技术和对人类视觉机制的理解，为视觉跟踪领域的技术发展做出了贡献。总结来说，本文的核心知识点包括： 1. 强调了在视觉跟踪中同时考虑物体表观和运动信息的重要性。 2. 描述了Two-Stream Residual Convolutional Network (TS-RCN)的设计原理，即如何集成双流结构和残差网络来处理表观特征和运动特征。 3. 提供了TS-RCN在VOT竞赛中的优秀表现，证明了其在实际场景中的鲁棒性和速度优势。 4. 显示了跨领域合作在推动视觉跟踪技术进步中的作用，尤其是在处理复杂环境挑战时的有效性。通过阅读这篇论文，读者可以深入了解如何将运动信息融入深度学习目标跟踪中，以及这种融合如何改善追踪性能并提升速度。这对于那些关注视觉跟踪算法改进或在该领域进行研究的人来说，是一篇值得深入研读的论文。

Robust Visual Object Tracking with Two-Stream

Residual Convolutional Networks

Ning Zhang

∗

, Jingen Liu

∗

, Ke Wang

†

, Dan Zeng

‡

and Tao Mei

∗

JD AI Research, Mountain View, USA.

†

Migu Culture & Technology, Beijing, China

‡

Shanghai University, Shanghai, China

JD AI Research, Beijing, China

∗

{ning.zhang,jingen.liu}@jd.com,

†

wangke@migu.cn,

‡

dzeng@shu.edu.cn,

tmei@live.com

Abstract—The current deep learning based visual tracking

approaches have been very successful by learning the target

classiﬁcation and/or estimation model from a large amount

of supervised training data in ofﬂine mode. However, most

of them can still fail in tracking objects due to some more

challenging issues such as dense distractor objects, confusing

background, motion blurs, and so on. Inspired by the human

“visual tracking” capability which leverages motion cues to

distinguish the target from the background, we propose a Two-

Stream Residual Convolutional Network (TS-RCN) for visual

tracking, which successfully exploits both appearance and motion

features for model update. Our TS-RCN can be integrated with

existing deep learning based visual trackers. To further improve

the tracking performance, we adopt a “wider” residual network

ResNeXt as its feature extraction backbone. To the best of our

knowledge, TS-RCN is the ﬁrst end-to-end trainable two-stream

visual tracking system, which makes full use of both appearance

and motion features of the target. We have extensively evaluated

the TS-RCN on most widely used benchmark datasets including

VOT2018, VOT2019, and GOT-10K. The experiment results have

successfully demonstrated that our two-stream model can greatly

outperform the appearance based tracker, and it also achieves

state-of-the-art performance. The tracking system can run at up

to 38.1 FPS.

I. INTRODUCTION

Generic visual object tracking predicts the location of a

class-agnostic object at every frame of a video sequence. It

is a highly challenging task due to its class-agnostic nature,

background distraction, illumination discrepancy, motion blur,

and many more [1], [2]. In general, a visual tracking system

needs to perform two tasks simultaneously: target classiﬁca-

tion and bounding-box estimation [3]. The former task is to

coarsely identify the target object region in current frame from

the background, while the latter further estimates the precise

bounding-box (i.e., tracker state) of the target object.

Recently, researchers have made great progress in visual

object tracking by exploiting the effective power of deep

convolution networks. The Siamese tracking approaches [4]

leverage a large amount of supervised data to learn a more

general region similarity measurement in ofﬂine mode, which

enables tracking to be performed by searching image regions

most similar to the target template. Due to the lack of

background appearance (e.g., distractor objects ), however, the

Siamese approaches are inferior to deal with unseen objects

and distractor objects. To address these limits, Bhat et al. [5]

propose a discriminative learning architecture (DiMP) which

is able to fully exploit both target and background appearance.

Fig. 1: The illustration of limitations for appearance based

visual tracking. The ﬁrst column shows the tracking results

where the Green, Blue, and Red bounding box represent

groundtruth, DiMP tracker, and our TS-RCN tracker, respec-

tively. The second column illustrates the HSV-color visualiza-

tion of the optical ﬂow. In all three cases, the target optical

ﬂow has a different pattern than that of its local background.

Row (a) shows dense similar objects (i.e. crabs) as distractors;

Row (b) shows confusing background textures as distractors

(i.e., the ﬂying drone blends with background buildings); Row

ﬁgure is best viewed in PDF format.

Although DiMP is trained to separate the background from

target, it may still fail when the background becomes more

confusing and challenging. As shown in Fig. 1 (a) and (b),

DiMP is not able to track the targets (blue bounding box) due

to same-class distractors (i.e., similar crabs) and confusing

background texture, respectively. Additionally, the tracking

can be disconnected if the current frame has motion blurs

on the target. For example, the soccer ball is blurred due to

high speed as shown in 1 (c). All the aforementioned issues

can happen in most deep learning based tracking approaches

arXiv:2005.06536v1 [cs.CV] 13 May 2020

下载后可阅读完整内容，剩余7页未读，立即下载

普通网友

粉丝: 1w+

双流残差网络：提升视觉目标跟踪鲁棒性与速度

solidworks材料

Intelligent Systems Fusion, Tracking, and Control - GeeWah Ng

Robust Visual Tracking using L1 Minimization-译文

Robust visual tracking based on multi-cue integration

Robust Object Tracking via Sparsity-based Collaborative Model

Robust visual tracking using joint scale-spatial correlation filters

code-Robust Object Tracking via Sparsity-based Collaborative Model

Robust Visual Tracking via Sparsity-Induced Subspace Learning

A Robust Image Zero-watermarking Using Convolutional Neural Networks

Robust Object Recognition with Cortex-Like Mechanisms.pdf

最新资源