多域卷积神经网络（MDNET）视觉跟踪算法解析

需积分: 10 187 浏览量更新于2024-09-07 收藏 2.72MB PDF 举报

"MDNET论文文档提供了对MDNET（多域卷积神经网络）的理解和应用，作者通过在大量带有追踪地面真相的视频数据上预训练CNN（卷积神经网络），来构建一个通用的目标表示。这种方法旨在提升视觉追踪的性能，尤其是在处理目标在不同环境和条件下变化的情况。 MDNET的核心结构由共享层和多个领域特定层组成。共享层学习通用的目标特征，而领域特定层针对每个单独的训练序列进行定制，用于二分类任务，即确定目标是否存在于特定的领域中。网络的训练是迭代进行的，每个领域分别优化，使得共享层能够学习到具有泛化能力的目标表示。当需要在新的序列中追踪目标时，MDNET的方法是结合预训练的CNN的共享层和一个新的二分类层来构建网络。这个新层在线更新，以适应新序列中的目标。在线追踪过程中，算法会随机采样目标先前状态周围的候选窗口进行评估，从而确定目标的位置。 MDNET的优势在于其能够适应不同的视觉追踪场景，通过迭代训练和在线更新，它能够处理视觉目标的变化，如外观、光照、遮挡等。此外，通过结合共享层和领域特定层，它能够平衡全局特征和局部适应性，提高了追踪的准确性和鲁棒性。在实际应用中，MDNET论文的读者可以从博客和资源页获取更深入的理解和代码实现，这有助于研究者和开发者在视觉追踪任务中利用深度学习技术进行创新和优化。"

Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

Hyeonseob Nam Bohyung Han

Dept. of Computer Science and Engineering, POSTECH, Korea

{namhs09, bhhan}@postech.ac.kr

Abstract

We propose a novel visual tracking algorithm based on

the representations from a discriminatively trained Convo-

lutional Neural Network (CNN). Our algorithm pretrains

a CNN using a large set of videos with tracking ground-

truths to obtain a generic target representation. Our net-

work is composed of shared layers and multiple branches

of domain-speciﬁc layers, where domains correspond to in-

dividual training sequences and each branch is responsible

for binary classiﬁcation to identify the target in each do-

main. We train the network with respect to each domain

iteratively to obtain generic target representations in the

shared layers. When tracking a target in a new sequence,

we construct a new network by combining the shared lay-

ers in the pretrained CNN with a new binary classiﬁcation

layer, which is updated online. Online tracking is performed

by evaluating the candidate windows randomly sampled

around the previous target state. The proposed algorithm

illustrates outstanding performance compared with state-

of-the-art methods in existing tracking benchmarks.

1. Introduction

Convolutional Neural Networks (CNNs) have recently

been applied to various computer vision tasks such as im-

age classiﬁcation [27, 5, 34], semantic segmentation [30],

object detection [13], and many others [37, 36]. Such great

success of CNNs is mostly attributed to their outstanding

performance in representing visual data. Visual tracking,

however, has been less affected by these popular trends

since it is difﬁcult to collect a large amount of training data

for video processing applications and training algorithms

specialized for visual tracking are not available yet, while

the approaches based on low-level handcraft features still

work well in practice [18, 6, 21, 42]. Several recent track-

ing algorithms [20, 39] have addressed the data deﬁciency

issue by transferring pretrained CNNs on a large-scale clas-

siﬁcation dataset such as ImageNet [33]. Although these

methods may be sufﬁcient to obtain generic feature repre-

sentations, its effectiveness in terms of tracking is limited

due to the fundamental inconsistency between classiﬁcation

and tracking problems, i.e., predicting object class labels

versus locating targets of arbitrary classes.

To fully exploit the representation power of CNNs in vi-

sual tracking, it is desirable to train them on large-scale data

specialized for visual tracking, which cover a wide range

of variations in the combination of target and background.

However, it is truly challenging to learn a uniﬁed represen-

tation based on the video sequences that have completely

different characteristics. Note that individual sequences in-

volve different types of targets whose class labels, moving

patterns, and appearances are different, and tracking algo-

rithms suffer from sequence-speciﬁc challenges including

occlusion, deformation, lighting condition change, motion

blur, etc. Training CNNs is even more difﬁcult since the

same kind of objects can be considered as a target in a se-

quence and as a background object in another. Due to such

variations and inconsistencies across sequences, we believe

that the ordinary learning methods based on the standard

classiﬁcation task are not appropriate, and another approach

to capture sequence-independent information should be in-

corporated for better representations for tracking.

Motivated by this fact, we propose a novel CNN archi-

tecture, referred to as Multi-Domain Network (MDNet), to

learn the shared representation of targets from multiple an-

notated video sequences for tracking, where each video is

regarded as a separate domain. The proposed network has

separate branches of domain-speciﬁc layers for binary clas-

siﬁcation at the end of the network, and shares the common

information captured from all sequences in the preceding

layers for generic representation learning. Each domain in

MDNet is trained separately and iteratively while the shared

layers are updated in every iteration. By employing this

strategy, we separate domain-independent information from

domain-speciﬁc one and learn generic feature representa-

tions for visual tracking. Another interesting aspect of our

architecture is that we design the CNN with a small number

of layers compared to the networks for classiﬁcation tasks

such as AlexNet [27] and VGG nets [5, 34].

We also propose an effective online tracking framework

based on the representations learned by MDNet. When a

arXiv:1510.07945v2 [cs.CV] 6 Jan 2016

下载后可阅读完整内容，剩余9页未读，立即下载

cv阿文

粉丝: 34

多域卷积神经网络（MDNET）视觉跟踪算法解析

RT-MDNet视频目标跟踪论文官方预训练网络文件rt-mdnet.pth

Python-RTMDNet实时多域卷积神经网络跟踪器

MDNet代码及注释(pytorch版)

RT-MDNet.zip

RT-MDNet.rar

RT-MDNet-master_深度学习_目标跟踪_truth_MDNet_

mdnet-imagenet-vid.pth

matlab代码移植-py-MDNet:MDNetPyTorch实施

深度解析Matlab实现的MDNet视觉跟踪器代码

RT-MDNet算法和MDNet算法的比较

最新资源