simplification of the LSTM provides a network that yields classification accuracies at least as good
as the standard LSTM and often performs substantially better – a result not achieved by the models
proposed in the afore-mentioned studies.
3 JUST ANOTHER NETWORK
Recurrent neural networks (RNNs) typically create a lossy summary
h
T
of a sequence. It is lossy
because it maps an arbitrarily long sequence
x
1:T
into a fixed length vector. As mentioned before,
recent work has shown that this forgetting property of LSTMs is one of the most important (Greff
et al., 2015; Jozefowicz et al., 2015). Hence, we propose a simple transformation of the LSTM
that leaves it with only a forget gate, and since this is Just Another NETwork (JANET), we name it
accordingly. We start from the standard LSTM (Lipton et al., 2015), which, with symbols taking their
standard meaning, is defined as
i
t
= σ(U
i
h
t−1
+ W
i
x
t
+ b
i
)
o
t
= σ(U
o
h
t−1
+ W
o
x
t
+ b
o
)
f
t
= σ(U
f
h
t−1
+ W
f
x
t
+ b
f
)
c
t
= f
t
c
t−1
+ i
t
tanh(U
c
h
t−1
+ W
c
x
t
+ b
c
)
h
t
= o
t
tanh(c
t
). (1)
To transform the above into the JANET architecture, the input and output gates are removed. It
seems sensible to have the accumulation and deletion of information be related, therefore we couple
the input and forget modulation as in Greff et al. (2015), which is similar to the leaky unit imple-
mentation (Jaeger, 2002,
§
8.1). Furthermore, the
tanh
activation of
h
t
shrinks the gradients during
backpropagation, which could exacerbate the vanishing gradient problem, and since the weights
U
∗
can accommodate values beyond the range [-1,1], we can remove this unnecessary, potentially
problematic, tanh nonlinearity. The resulting JANET is given by
f
t
= σ(U
f
h
t−1
+ W
f
x
t
+ b
f
)
c
t
= f
t
c
t−1
+ (1 − f
t
) tanh(U
c
h
t−1
+ W
c
x
t
+ b
c
)
h
t
= c
t
. (2)
Intuitively, allowing slightly more information to accumulate than the amount forgotten would make
sequence analysis easier. We found this to be true empirically by subtracting a pre-specified value
β
from the input control component
2
, as given by
s
t
= U
f
h
t−1
+ W
f
x
t
+ b
f
˜
c
t
= tanh(U
c
h
t−1
+ W
c
x
t
+ b
c
)
c
t
= σ(s
t
) c
t−1
+ (1 − σ(s
t
− β))
˜
c
t
h
t
= c
t
. (3)
We speculate that the value for
β
is dataset dependent, however, we found that setting
β = 1
provides
the best results for the datasets analysed in this study, which have sequence lengths varying from 200
to 784.
If we follow the standard parameter initialization scheme for LSTMs, the JANET quickly encounters
a problem. The standard procedure is to initialize the weights
U
∗
and
W
∗
to be distributed as
U
−
√
6
/
√
n
l
+n
l+1
,
√
6
/
√
n
l
+n
l+1
, where
n
l
is the size of each layer
l
(He et al., 2015b; Glorot and
Bengio, 2010), and to initialize all biases to zero except for the forget gate bias
b
f
, which is initialized
to one (Jozefowicz et al., 2015). Hence, if the values of both input and hidden layers are zero-centred
over time,
f
t
will be centred around
σ(1) = 0.7311
. In this case, the memory values
c
t
of the
JANET would not be retained for more than a couple of time steps. This problem is best exemplified
by the MNIST dataset (LeCun, 1998) processed in scanline order (Cooijmans et al., 2016); each
training example contains many consecutive zero-valued subsequences, each of length 10 to 20. In
the best case scenario – a length 10 zero-valued subsequence – the memory values at the end of the
subsequence would be centred around
c
t+10
= f
10
t
c
t
≤ 0.7311
10
c
t
≤ 0.04363c
t
. (4)
2
β is a constant-valued column vector of the appropriate size.
3