it contains not only a huge range of dictionary words, but also many character
sequences that would not be included in text corpora traditionally used for
language modelling. For example foreign words (including letters from non-
Latin alphabets such as Arabic and Chinese), indented XML tags used to define
meta-data, website addresses, and markup used to indicate page formatting such
as headings, bullet points etc. An extract from the Hutter prize dataset is shown
in Figs. 3 and 4.
The first 96M bytes in the data were evenly split into sequences of 100 bytes
and used to train the network, with the remaining 4M were used for validation.
The data contains a total of 205 one-byte unicode symbols. The total number
of characters is much higher, since many characters (especially those from non-
Latin languages) are defined as multi-symbol sequences. In keeping with the
principle of modelling the smallest meaningful units in the data, the network
predicted a single byte at a time, and therefore had size 205 input and output
layers.
Wikipedia contains long-range regularities, such as the topic of an article,
which can span many thousand words. To make it possible for the network to
capture these, its internal state (that is, the output activations h
t
of the hidden
layers, and the activations c
t
of the LSTM cells within the layers) were only reset
every 100 sequences. Furthermore the order of the sequences was not shuffled
during training, as it usually is for neural networks. The network was therefore
able to access information from up to 10K characters in the past when making
predictions. The error terms were only backpropagated to the start of each 100
byte sequence, meaning that the gradient calculation was approximate. This
form of truncated backpropagation has been considered before for RNN lan-
guage modelling [23], and found to speed up training (by reducing the sequence
length and hence increasing the frequency of stochastic weight updates) without
affecting the network’s ability to learn long-range dependencies.
A much larger network was used for this data than the Penn data (reflecting
the greater size and complexity of the training set) with seven hidden layers of
700 LSTM cells, giving approximately 21.3M weights. The network was trained
with stochastic gradient descent, using a learn rate of 0.0001 and a momentum
of 0.9. It took four training epochs to converge. The LSTM derivates were
clipped in the range [−1, 1].
As with the Penn data, we tested the network on the validation data with
and without dynamic evaluation (where the weights are updated as the data
is predicted). As can be seen from Table 2 performance was much better with
dynamic evaluation. This is probably because of the long range coherence of
Wikipedia data; for example, certain words are much more frequent in some
articles than others, and being able to adapt to this during evaluation is ad-
vantageous. It may seem surprising that the dynamic results on the validation
set were substantially better than on the training set. However this is easily
explained by two factors: firstly, the network underfit the training data, and
secondly some portions of the data are much more difficult than others (for
example, plain text is harder to predict than XML tags).
To put the results in context, the current winner of the Hutter Prize (a
9