As research on LSTMs has progressed, hidden units with
varying connections within the memory unit have been pro-
posed. We use the LSTM unit as described in [45] (Figure 2,
right), which is a slight simplification of the one described
in [10]. Letting σ(x) = (1 + e
−x
)
−1
be the sigmoid non-
linearity which squashes real-valued inputs to a [0, 1] range,
and letting φ(x) =
e
x
−e
−x
e
x
+e
−x
= 2σ(2x) − 1 be the hyper-
bolic tangent nonlinearity, similarly squashing its inputs to
a [−1, 1] range, the LSTM updates for timestep t given in-
puts x
t
, h
t−1
, and c
t−1
are:
i
t
= σ(W
xi
x
t
+ W
hi
h
t−1
+ b
i
)
f
t
= σ(W
xf
x
t
+ W
hf
h
t−1
+ b
f
)
o
t
= σ(W
xo
x
t
+ W
ho
h
t−1
+ b
o
)
g
t
= φ(W
xc
x
t
+ W
hc
h
t−1
+ b
c
)
c
t
= f
t
c
t−1
+ i
t
g
t
h
t
= o
t
φ(c
t
)
x
t
h
t-1
h
t
Output
Cell
z
t
RNN Unit
φ
σ
σ
σ
x
t
h
t-1
h
t
= z
t
Cell
Output
Gate
Input
Gate
Forget
Gate
Input
Modulation
Gate
LSTM Unit
σ
φ
σ
Figure 2: A diagram of a basic RNN cell (left) and an LSTM mem-
ory cell (right) used in this paper (from [45], a slight simplification
of the architecture described in [9], which was derived from the
LSTM initially proposed in [12]).
In addition to a hidden unit h
t
∈ R
N
, the LSTM in-
cludes an input gate i
t
∈ R
N
, forget gate f
t
∈ R
N
, output
gate o
t
∈ R
N
, input modulation gate g
t
∈ R
N
, and mem-
ory cell c
t
∈ R
N
. The memory cell unit c
t
is a summation
of two things: the previous memory cell unit c
t−1
which
is modulated by f
t
, and g
t
, a function of the current input
and previous hidden state, modulated by the input gate i
t
.
Because i
t
and f
t
are sigmoidal, their values lie within the
range [0, 1], and i
t
and f
t
can be thought of as knobs that
the LSTM learns to selectively forget its previous memory
or consider its current input. Likewise, the output gate o
t
learns how much of the memory cell to transfer to the hid-
den state. These additional cells enable the LSTM to learn
extremely complex and long-term temporal dynamics the
RNN is not capable of learning. Additional depth can be
added to LSTMs by stacking them on top of each other, us-
ing the hidden state of the LSTM in layer l − 1 as the input
to the LSTM in layer l.
Recently, LSTMs have achieved impressive results on
language tasks such as speech recognition [10] and ma-
chine translation [38, 5]. Analogous to CNNs, LSTMs are
attractive because they allow end-to-end fine-tuning. For
example, [10] eliminates the need for complex multi-step
pipelines in speech recognition by training a deep bidirec-
tional LSTM which maps spectrogram inputs to text. Even
with no language model or pronunciation dictionary, the
model produces convincing text translations. [38] and [5]
translate sentences from English to French with a multi-
layer LSTM encoder and decoder. Sentences in the source
language are mapped to a hidden state using an encoding
LSTM, and then a decoding LSTM maps the hidden state
to a sequence in the target language. Such an encoder de-
coder scheme allows sequences of different lengths to be
mapped to each other. Like [10] the sequence-to-sequence
architecture for machine translation circumvents the need
for language models.
The advantages of LSTMs for modeling sequential data
in vision problems are twofold. First, when integrated with
current vision systems, LSTM models are straightforward
to fine-tune end-to-end. Second, LSTMs are not confined
to fixed length inputs or outputs allowing simple modeling
for sequential data of varying lengths, such as text or video.
We next describe a unified framework to combine LSTMs
with deep convolutional networks to create a model which
is both spatially and temporally deep.
3. Long-term Recurrent Convolutional Net-
work (LRCN) model
This work proposes a Long-term Recurrent Convolu-
tional Network (LRCN) model combinining a deep hier-
archical visual feature extractor (such as a CNN) with a
model that can learn to recognize and synthesize temporal
dynamics for tasks involving sequential data (inputs or out-
puts), visual, linsguistical or otherwise. Figure 1 depicts the
core of our approach. Our LRCN model works by pass-
ing each visual input v
t
(an image in isolation, or a frame
from a video) through a feature transformation φ
V
(v
t
)
parametrized by V to produce a fixed-length vector rep-
resentation φ
t
∈ R
d
. Having computed the feature-space
representation of the visual input sequence hφ
1
, φ
2
, ..., φ
T
i,
the sequence model then takes over.
In its most general form, a sequence model parametrized
by W maps an input x
t
and a previous timestep hidden state
h
t−1
to an output z
t
and updated hidden state h
t
. There-
fore, inference must be run sequentially (i.e., from top to
bottom, in the Sequence Learning box of Figure 1), by
computing in order: h
1
= f
W
(x
1
, h
0
) = f
W
(x
1
, 0), then
h
2
= f
W
(x
2
, h
1
), etc., up to h
T
. Some of our models stack
multiple LSTMs atop one another as described in Section 2.
The final step in predicting a distribution P (y
t
) at
timestep t is to take a softmax over the outputs z
t
of the
sequential model, producing a distribution over the (in our
case, finite and discrete) space C of possible per-timestep