In the exp eriments, this will be ensured by using the identity function
f
j
:
f
j
(
x
) =
x;
8
x
, and by
setting
w
j j
= 1
:
0. We refer to this as the constant error carrousel (CEC). CEC will b e LSTM's
central feature (see Section 4).
Of course unit
j
will not only be connected to itself but also to other units. This invokes two
obvious, related problems (also inherent in all other gradient-based approaches):
1. Input weight conict:
for simplicity, let us focus on a single additional input weight
w
j i
.
Assume that the total error can b e reduced by switching on unit
j
in resp onse to a certain input,
and keeping it active for a long time (until it helps to compute a desired output). Provided
i
is non-
zero, since the same incoming weight has to b e used for b oth storing certain inputs
and
ignoring
others,
w
j i
will often receive conicting weight up date signals during this time (recall that
j
is
linear): these signals will attempt to make
w
j i
participate in (1) storing the input (by switching
on
j
)
and
(2) protecting the input (by preventing
j
from being switched o by irrelevant later
inputs). This conict makes learning dicult, and calls for a more context-sensitive mechanism
for controlling \write operations" through input weights.
2. Output weight conict:
assume
j
is switched on and currently stores some previous
input. For simplicity, let us fo cus on a single additional outgoing weight
w
kj
. The same
w
kj
has
to b e used for both retrieving
j
's content at certain times
and
preventing
j
from disturbing
k
at other times. As long as unit
j
is non-zero,
w
kj
will attract conicting weight up date signals
generated during sequence pro cessing: these signals will attempt to make
w
kj
participate in (1)
accessing the information stored in
j
and
| at dierent times | (2) protecting unit
k
from being
perturb ed by
j
. For instance, with many tasks there are certain \short time lag errors" that can b e
reduced in early training stages. However, at later training stages
j
may suddenly start to cause
avoidable errors in situations that already seemed under control by attempting to participate in
reducing more dicult \long time lag errors". Again, this conict makes learning dicult, and
calls for a more context-sensitive mechanism for controlling \read operations" through output
weights.
Of course, input and output weight conicts are not sp ecic for long time lags, but occur for
short time lags as well. Their eects, however, b ecome particularly pronounced in the long time
lag case: as the time lag increases, (1) stored information must be protected against perturbation
for longer and longer p erio ds, and | especially in advanced stages of learning | (2) more and
more already correct outputs also require protection against p erturbation.
Due to the problems ab ove the naive approach do es not work well except in case of certain
simple problems involving lo cal input/output representations and non-rep eating input patterns
(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right.
4 LONG SHORT-TERM MEMORY
Memory cells and gate units
. To construct an architecture that allows for constant error ow
through sp ecial, self-connected units without the disadvantages of the naive approach, we extend
the constant error carrousel CEC embo died by the self-connected, linear unit
j
from Section 3.2
by introducing additional features. A multiplicative
input gate unit
is introduced to protect the
memory contents stored in
j
from perturbation by irrelevant inputs. Likewise, a multiplicative
output gate unit
is introduced which protects other units from perturbation by currently irrelevant
memory contents stored in
j
.
The resulting, more complex unit is called a
memory cell
(see Figure 1). The
j
-th memory cell
is denoted
c
j
. Each memory cell is built around a central linear unit with a xed self-connection
(the CEC). In addition to
net
c
j
,
c
j
gets input from a multiplicative unit
out
j
(the \output gate"),
and from another multiplicative unit
in
j
(the \input gate").
in
j
's activation at time
t
is denoted
by
y
in
j
(
t
),
out
j
's by
y
out
j
(
t
). We have
y
out
j
(
t
) =
f
out
j
(
net
out
j
(
t
));
y
in
j
(
t
) =
f
in
j
(
net
in
j
(
t
));
6