Learning Recurrent Neural Networks with Hessian-Free Optimization
James Martens JMARTENS@CS.TORONTO.EDU
Ilya Sutskever ILYA@CS.UTORONTO.CA
University of Toronto, Canada
Abstract
In this work we resolve the long-outstanding
problem of how to effectively train recurrent neu-
ral networks (RNNs) on complex and difficult
sequence modeling problems which may con-
tain long-term data dependencies. Utilizing re-
cent advances in the Hessian-free optimization
approach (Martens, 2010), together with a novel
damping scheme, we successfully train RNNs on
two sets of challenging problems. First, a col-
lection of pathological synthetic datasets which
are known to be impossible for standard op-
timization approaches (due to their extremely
long-term dependencies), and second, on three
natural and highly complex real-world sequence
datasets where we find that our method sig-
nificantly outperforms the previous state-of-the-
art method for training neural sequence mod-
els: the Long Short-term Memory approach of
Hochreiter and Schmidhuber (1997). Addition-
ally, we offer a new interpretation of the gen-
eralized Gauss-Newton matrix of Schraudolph
(2002) which is used within the HF approach of
Martens.
1. Introduction
A Recurrent Neural Network (RNN) is a neural network
that operates in time. At each timestep, it accepts an in-
put vector, updates its (possibly high-dimensional) hid-
den state via non-linear activation functions, and uses it
to make a prediction of its output. RNNs form a rich
model class because their hidden state can store informa-
tion as high-dimensional distributed representations (as op-
posed to a Hidden Markov Model, whose hidden state is es-
sentially log n-dimensional) and their nonlinear dynamics
can implement rich and powerful computations, allowing
the RNN to perform modeling and prediction tasks for se-
quences with highly complex structure.
Appearing in Proceedings of the 28
th
International Conference
on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011
by the author(s)/owner(s).
Figure 1. The architecture of a recurrent neural network.
Gradient-based training of RNNs might appear straight-
forward because, unlike many rich probabilistic sequence
models (Murphy, 2002), the exact gradients can be cheaply
computed by the Backpropagation Through Time (BPTT)
algorithm (Rumelhart et al., 1986). Unfortunately, gradi-
ent descent and other 1st-order methods completely fail to
properly train RNNs on large families of seemingly sim-
ple yet pathological synthetic problems that separate a tar-
get output and from its relevant input by many time steps
(Bengio et al., 1994; Hochreiter and Schmidhuber, 1997).
In fact, 1st-order approaches struggle even when the sep-
aration is only 10 timesteps (Bengio et al., 1994). An un-
fortunate consequence of these failures is that these highly
expressive and potentially very powerful time-series mod-
els are seldom used in practice.
The extreme difficulty associated with training RNNs is
likely due to the highly volatile relationship between the
parameters and the hidden states. One way that this volatil-
ity manifests itself, which has a direct impact on the per-
formance of gradient-descent, is in the so-called “vanish-
ing/exploding gradients” phenomenon (Bengio et al., 1994;
Hochreiter, 1991), where the error-signals exhibit expo-
nential decay/growth as they are back-propagated through
time. In the case of decay, this leads to the long-term error
signals being effectively lost as they are overwhelmed by
un-decayed short-term signals, and in the case of exponen-
tial growth there is the opposite problem that the short-term
error signals are overwhelmed by the long-term ones.
During the 90’s there was intensive research by the ma-
chine learning community into identifying the source of
difficultly in training RNNs as well as proposing meth-
ods to address it. However, none of these methods be-
came widely adopted, and an analysis by Hochreiter and
Schmidhuber (1996) showed that they were often no bet-
ter than random guessing. In an attempt to sidestep the
difficulty of training RNNs on problems exhibiting long-