深度时空网络：强化视频对象跟踪的可靠方法

75 浏览量更新于2024-08-27 收藏 3.98MB PDF 举报

深度时空网络（Deep Spatial and Temporal Network，简称DSTN）是一种前沿的研究论文，发表于2020年的IEEE Transactions on Image Processing第29卷。本文的焦点在于提升视觉对象跟踪的可靠性，通过结合深度学习的强大优势来优化追踪性能。在传统的视觉跟踪方法中，通常依赖两个关键要素：一是目标对象的外观特征，二是对象的运动信息。尽管近年来许多技术利用深度学习强化了视觉跟踪，但大部分集中在对象的外观表示上，而较少关注对象运动的利用。 DSTN创新之处在于它开发了一个深度网络架构，特别地融合了每帧中的对象表示和它们在视频中随时间演变的动态特性。这个设计允许DSTN不仅捕捉到对象的精确外观特征，而且能够有效地捕捉和处理物体在时空维度上的变化。通过将粗略到精细的追踪流程与DSTN相结合，该方法能够敏锐地识别出在空间和时间上微妙的差异，从而提高追踪的精度和鲁棒性。 DSTN的核心思想是构建一个能够在连续帧间建立动态联系的模型，这不仅有助于减小因光照、遮挡或姿态变化带来的影响，还能够适应目标的潜在变化，如形状、大小或颜色的变化。它可能采用了一种多层神经网络结构，包括卷积神经网络（CNN）来提取空间特征，以及可能的循环神经网络（RNN）或注意力机制来捕捉时间序列中的模式。在训练过程中，DSTN可能会使用大量标注的视频数据进行监督学习，以便学习到对象在不同情境下的行为模式。 DSTN代表了一种在视觉对象跟踪领域的重要进展，它展示了如何通过深度学习技术将空间和时间信息无缝融合，从而实现更精准、更可靠的追踪性能。这一研究成果对于计算机视觉、机器人导航、视频监控等领域都有着显著的实际应用价值。

1764 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

in the video sequences. For o ne thing, the historical target

appearances usually impact the subsequent target appearances,

thus the temporal information is an important clue to predict

the next state of the target. For another, the appearance of the

target changes gradually during tracking, so that the spatial

deploymen t in target itself is critical for precisely locating.

Consequently, it is widely exploited in object tracking [5],

[7], [32], [37], [41]–[43].

For instance, CLRST tracker [41] adaptively prunes and

selects candidate particles by exploiting temporal consisten cy.

DeepTrack [5] represents temporal adaptation through the

update of CNNs in the CNN pool. STT [44] embeds temporal

information into subspace learning and constructs a spatial

appearance context model to support tracking. RTT [37]

employs long-range contextual cues which is also a type of

spatial-temporal information in tracking. Re3 [45] incorporates

temporal information into the model and trains a recurrent

tracking network that translates the image embedding into an

output bounding box. TSN [7] establishes a temporal model in

a sense of sparse coding. STRCF [32] incorporates both spatial

and temporal regularization into the Discriminative Correlation

Filters (DCF) framework. CREST [46] introduces spatial and

temporal residual learning into DCF. IGS tracker [47] incor-

porates structural information in local part variations using

a global constraint. Another spatial perspective is to model

the spatial object structure [38] to assist CNN layers. Optical

ﬂow is also a form of spatial-temporal model. It can be

directly employed as features [48]. While in [49] a CNN

is trained b ased on the optical ﬂow images. FlowTrack [50]

formulates optical ﬂow estimation as a special layer in a deep

network.

Different from the utilization of these spatial-temporal infor-

mation, we design a spatial structure of the target described

by several LSTM slices and propose temporally related spatial

structures to express the evolutionary process of the target in

a video sequence. The explicit dynamic representation of the

target in a sequence also facilitates vid eo based tasks. Note that

both our previous work published in ICCV 2017 [7] and this

work estimate the similarity between proposals extracted on

the current frame and the historical target representation. But

in the previous paper, a tuple learning module is employed

to represent the historical target. In contrast, an explicit

dynamic representation of the historical target is proposed

in this work and it is implemented by designing a spatial-

temporal LSTM network. In other words, the distinctions of

the previous work and this work mainly focus on the target

representation and network construction. The video target

representation, which can perceive the subtle differences on

spatial and temporal variations of the target, is deployed into

the proposed tracking network and it is trained in a two-stage

form.

In the remaining part of this paper, we ﬁrst present the

architecture of ST-LSTM in Section III where video target

representation is exploited to obtain a precise description of

the future target. Then the tracking framework is explained in

Section IV. Section V reports the experimental results as well

as the implementation details. Finally, Section VI discusses

conclusive remarks about the proposed methodology.

III. ST-LSTM A

RCHITECTURE

Targets may have different behaviors in video sequences:

they may h ave different appearances and move with differ-

ent velocities and acceleratio ns. We need a model that can

understand and learn such target-speciﬁc temporal and spatial

properties from limited initial observations.

LSTM networks have obtained incredible results in tasks

like language translation [51] and speech recognition [52].

Inspired by this, we develop a model to explicitly account

for the behavior of the target in a sequence. In particular,

we construct a spatial LSTM by several LSTM cells for each

frame of a sequence. This spatial LSTM learns the spatial

appearance variation of the target itself. While the naive use of

this spatial LSTM per frame does not capture the inter actions

between frames as it is agnostic to the target behavior of

context frames. We then address this by composing a temporal

LSTM architecture based on the spatial LSTMs as presented

in Fig. 1. The hidden states of the temporal LSTM are expected

to capture time varying target properties. The intention of the

proposed ST-LSTM network is to predict the future appearance

of the target according to the historical appearances and the

ST-LSTM weights are shared across all the sequences.

First of all, the architecture of LSTM cells used in the

proposed ST-LSTM is illustrated in Fig. 2 where ⊗ is an

element-wise multiplication. In addition to a hidden cell hp

t−1

a LSTM cell includes an input gate i

, input modulation gate

, forget gate f

, and output gate o

as described in (1)∼(4)

where σ is sigmoid function.

= σ(W

+ W

t−1

+ b

) (1)

= tanh(W

+ W

t−1

+ b

) (2)

= σ(W

+ W

t−1

+ b

) (3)

= σ(W

+ W

t−1

+ b

) (4)

The memory cell Cp

learns to selectively forget or mem-

orize its previous memory and current input, while the output

gate o

learns how much of the memory cell to transfer to the

hidden cell. This enables LSTM to learn complex and long-

term temporal dynamics for prediction tasks. Given inputs In

t−1

, Cp

t−1

, the LSTM cell updates the hidden cell and

memory cell to hp

, Cp

as stated in (5) and (6).

= o

⊗ tanh (Cp

) (5)

= f

⊗ Cp

t−1

+ i

⊗ g

(6)

ST-LSTM starts by each visual input I

(target of the

frame from a video) going through a feature transformation

(·) with parameters F, generally a CNN (Feature Net in

our network as shown in Fig. 1), to produce a ﬁxed-length

vector representation. The outputs of Φ

(·) are then passed

into a spatial LSTM unit. We deﬁne X

= Φ

)and

spatially reconﬁgure the input of the spatial LSTM model as

=[x

, x

, ..., x

]

where S is the depth of a spatial LSTM

unit.

Denote all the parameters in the spatial LSTM module as

, W

, ..., W

} with t = 1, 2, ..., T where T is the depth

of the temporal LSTM module. Each LSTM cell in a spatial

LSTM unit maps an input x

and a previous time step hidden

剩余13页未读，继续阅读

weixin_38553648

粉丝: 5
资源: 921

深度时空网络：强化视频对象跟踪的可靠方法

基于深度时空卷积神经网络的人群异常行为检测和定位

基于时空深层网络的鲁棒目标跟踪

这是一个基于深度学习的单目标跟踪软件，该软件可以跟踪视频目标或者通过摄像头跟踪目标.zip

kcf2_深度学习_

计算机视觉的高级议题

RGB-D目标跟踪：Matlab代码实现色彩深度数据稳健融合

Matlab实现行人跟踪的STRCF跟踪器方法

深度学习在视频分割中的应用技术研究综述

金字塔扩展深度ConvLSTM在视频显著目标检测中的应用

【深度学习与计算机视觉】：Python框架在视觉任务中的七大应用

最新资源