1764 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
in the video sequences. For o ne thing, the historical target
appearances usually impact the subsequent target appearances,
thus the temporal information is an important clue to predict
the next state of the target. For another, the appearance of the
target changes gradually during tracking, so that the spatial
deploymen t in target itself is critical for precisely locating.
Consequently, it is widely exploited in object tracking [5],
[7], [32], [37], [41]–[43].
For instance, CLRST tracker [41] adaptively prunes and
selects candidate particles by exploiting temporal consisten cy.
DeepTrack [5] represents temporal adaptation through the
update of CNNs in the CNN pool. STT [44] embeds temporal
information into subspace learning and constructs a spatial
appearance context model to support tracking. RTT [37]
employs long-range contextual cues which is also a type of
spatial-temporal information in tracking. Re3 [45] incorporates
temporal information into the model and trains a recurrent
tracking network that translates the image embedding into an
output bounding box. TSN [7] establishes a temporal model in
a sense of sparse coding. STRCF [32] incorporates both spatial
and temporal regularization into the Discriminative Correlation
Filters (DCF) framework. CREST [46] introduces spatial and
temporal residual learning into DCF. IGS tracker [47] incor-
porates structural information in local part variations using
a global constraint. Another spatial perspective is to model
the spatial object structure [38] to assist CNN layers. Optical
flow is also a form of spatial-temporal model. It can be
directly employed as features [48]. While in [49] a CNN
is trained b ased on the optical flow images. FlowTrack [50]
formulates optical flow estimation as a special layer in a deep
network.
Different from the utilization of these spatial-temporal infor-
mation, we design a spatial structure of the target described
by several LSTM slices and propose temporally related spatial
structures to express the evolutionary process of the target in
a video sequence. The explicit dynamic representation of the
target in a sequence also facilitates vid eo based tasks. Note that
both our previous work published in ICCV 2017 [7] and this
work estimate the similarity between proposals extracted on
the current frame and the historical target representation. But
in the previous paper, a tuple learning module is employed
to represent the historical target. In contrast, an explicit
dynamic representation of the historical target is proposed
in this work and it is implemented by designing a spatial-
temporal LSTM network. In other words, the distinctions of
the previous work and this work mainly focus on the target
representation and network construction. The video target
representation, which can perceive the subtle differences on
spatial and temporal variations of the target, is deployed into
the proposed tracking network and it is trained in a two-stage
form.
In the remaining part of this paper, we first present the
architecture of ST-LSTM in Section III where video target
representation is exploited to obtain a precise description of
the future target. Then the tracking framework is explained in
Section IV. Section V reports the experimental results as well
as the implementation details. Finally, Section VI discusses
conclusive remarks about the proposed methodology.
III. ST-LSTM A
RCHITECTURE
Targets may have different behaviors in video sequences:
they may h ave different appearances and move with differ-
ent velocities and acceleratio ns. We need a model that can
understand and learn such target-specific temporal and spatial
properties from limited initial observations.
LSTM networks have obtained incredible results in tasks
like language translation [51] and speech recognition [52].
Inspired by this, we develop a model to explicitly account
for the behavior of the target in a sequence. In particular,
we construct a spatial LSTM by several LSTM cells for each
frame of a sequence. This spatial LSTM learns the spatial
appearance variation of the target itself. While the naive use of
this spatial LSTM per frame does not capture the inter actions
between frames as it is agnostic to the target behavior of
context frames. We then address this by composing a temporal
LSTM architecture based on the spatial LSTMs as presented
in Fig. 1. The hidden states of the temporal LSTM are expected
to capture time varying target properties. The intention of the
proposed ST-LSTM network is to predict the future appearance
of the target according to the historical appearances and the
ST-LSTM weights are shared across all the sequences.
First of all, the architecture of LSTM cells used in the
proposed ST-LSTM is illustrated in Fig. 2 where ⊗ is an
element-wise multiplication. In addition to a hidden cell hp
t−1
,
a LSTM cell includes an input gate i
t
, input modulation gate
g
t
, forget gate f
t
, and output gate o
t
as described in (1)∼(4)
where σ is sigmoid function.
i
t
= σ(W
xi
In
t
+ W
hi
hp
t−1
+ b
i
) (1)
g
t
= tanh(W
xg
In
t
+ W
hg
hp
t−1
+ b
g
) (2)
f
t
= σ(W
xf
In
t
+ W
hf
hp
t−1
+ b
f
) (3)
o
t
= σ(W
xo
In
t
+ W
ho
hp
t−1
+ b
o
) (4)
The memory cell Cp
t
learns to selectively forget or mem-
orize its previous memory and current input, while the output
gate o
t
learns how much of the memory cell to transfer to the
hidden cell. This enables LSTM to learn complex and long-
term temporal dynamics for prediction tasks. Given inputs In
t
,
hp
t−1
, Cp
t−1
, the LSTM cell updates the hidden cell and
memory cell to hp
t
, Cp
t
as stated in (5) and (6).
hp
t
= o
t
⊗ tanh (Cp
t
) (5)
Cp
t
= f
t
⊗ Cp
t−1
+ i
t
⊗ g
t
(6)
ST-LSTM starts by each visual input I
t
(target of the
t
th
frame from a video) going through a feature transformation
Φ
F
(·) with parameters F, generally a CNN (Feature Net in
our network as shown in Fig. 1), to produce a fixed-length
vector representation. The outputs of Φ
F
(·) are then passed
into a spatial LSTM unit. We define X
t
= Φ
F
(I
t
)and
spatially reconfigure the input of the spatial LSTM model as
X
t
=[x
t
1
, x
t
2
, ..., x
t
S
]
T
where S is the depth of a spatial LSTM
unit.
Denote all the parameters in the spatial LSTM module as
{W
t
S
1
, W
t
S
2
, ..., W
t
S
S
} with t = 1, 2, ..., T where T is the depth
of the temporal LSTM module. Each LSTM cell in a spatial
LSTM unit maps an input x
t
i
and a previous time step hidden