arXiv:1312.3005v3 [cs.CL] 4 Mar 2014
One Billion Word Benchmark for Measuring Progress in
Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants
Google
1600 Amphitheatre Parkway
Mountain View, CA 94043, USA
Phillipp Koehn
University of Edinburgh
10 Crichton Street, Room 4.19
Edinburgh, EH8 9AB, UK
Tony Robinson
Cantab Research Ltd
St Johns Innovation Centre
Cowley Road, Cambridge, CB4 0WS, UK
Abstract
We propose a new benchmark corpus to be
used for measuring progress in statistical lan-
guage modeling. With almost one billion
words of training data, we hope this bench-
mark will be useful to quickly evaluate novel
language modeling techniques, and to com-
pare their contribution when combined with
other advanced techniques. We show perfor-
mance of several well-known types of lan-
guage models, with the best results achieved
with a recurrent neural network based lan-
guage model. The baseline unpruned Kneser-
Ney 5-gram model achieves perplexity 67.6.
A combination of techniques leads to 35%
reduction in perplexity, or 10% reduction in
cross-entropy (bits), over that baseline.
The benchmark is available as a
code.google.com project; besides the
scripts needed to rebuild the training/held-out
data, it also makes available log-probability
values for each word in each of ten held-out
data sets, for each of the baseline n-gram
models.
1 Introduction
Statistical language modeling has been applied to a
wide range of applications and domains with great
success. To name a few, automatic speech recogni-
tion, machine translation, spelling correction, touch-
screen “soft” keyboards and many natural language
processing applications depend on the quality of lan-
guage models (LMs).
The performance of LMs is determined mostly by
several factors: the amount of training data, quality
and match of the training data to the test data, and
choice of modeling technique for estimation from
the data. It is widely accepted that the amount of
data, and the ability of a given estimation algorithm
to accomodate large amounts of training are very im-
portant in providing a solution that competes suc-
cessfully with the entrenched n-gram LMs. At the
same time, scaling up a novel algorithm to a large
amount of data involves a large amount of work, and
provides a significant barrier to entry for new mod-
eling techniques. By choosing one billion words as
the amount of training data we hope to strike a bal-
ance between the relevance of the benchmark in the
world of abundant data, and the ease with which any
researcher can evaluate a given modeling approach.
This follows the work of Goodman (2001a), who
explored performance of various language modeling
techniques when applied to large data sets. One of
the key contributions of our work is that the experi-
ments presented in this paper can be reproduced by
virtually anybody with an interest in LM, as we use
a data set that is freely available on the web.
Another contribution is that we provide strong
baseline results with the currently very popular neu-
ral network LM (Bengio et al., 2003). This should
allow researchers who work on competitive tech-
niques to quickly compare their results to the current
state of the art.
The paper is organized as follows: Section 2 de-
scribes how the training data was obtained; Section
3 provides a short overview of the language model-
ing techniques evaluated; finally, Section 4 presents
results obtained and Section 5 concludes the paper.