Towards End-to-End Speech Recognition
with Recurrent Neural Networks
Alex Graves GRAVES@CS.TORONTO.EDU
Google DeepMind, London, United Kingdom
Navdeep Jaitly NDJAITLY@CS.TORONTO.EDU
Department of Computer Science, University of Toronto, Canada
Abstract
This paper presents a speech recognition sys-
tem that directly transcribes audio data with text,
without requiring an intermediate phonetic repre-
sentation. The system is based on a combination
of the deep bidirectional LSTM recurrent neural
network architecture and the Connectionist Tem-
poral Classification objective function. A mod-
ification to the objective function is introduced
that trains the network to minimise the expec-
tation of an arbitrary transcription loss function.
This allows a direct optimisation of the word er-
ror rate, even in the absence of a lexicon or lan-
guage model. The system achieves a word error
rate of 27.3% on the Wall Street Journal corpus
with no prior linguistic information, 21.9% with
only a lexicon of allowed words, and 8.2% with a
trigram language model. Combining the network
with a baseline system further reduces the error
rate to 6.7%.
1. Introduction
Recent advances in algorithms and computer hardware
have made it possible to train neural networks in an end-
to-end fashion for tasks that previously required signifi-
cant human expertise. For example, convolutional neural
networks are now able to directly classify raw pixels into
high-level concepts such as object categories (Krizhevsky
et al., 2012) and messages on traffic signs (Ciresan et al.,
2011), without using hand-designed feature extraction al-
gorithms. Not only do such networks require less human
effort than traditional approaches, they generally deliver
superior performance. This is particularly true when very
large amounts of training data are available, as the bene-
Proceedings of the 31
st
International Conference on Machine
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-
right 2014 by the author(s).
fits of holistic optimisation tend to outweigh those of prior
knowledge.
While automatic speech recognition has greatly benefited
from the introduction of neural networks (Bourlard & Mor-
gan, 1993; Hinton et al., 2012), the networks are at present
only a single component in a complex pipeline. As with
traditional computer vision, the first stage of the pipeline
is input feature extraction: standard techniques include
mel-scale filterbanks (Davis & Mermelstein, 1980) (with
or without a further transform into Cepstral coefficients)
and speaker normalisation techniques such as vocal tract
length normalisation (Lee & Rose, 1998). Neural networks
are then trained to classify individual frames of acous-
tic data, and their output distributions are reformulated as
emission probabilities for a hidden Markov model (HMM).
The objective function used to train the networks is there-
fore substantially different from the true performance mea-
sure (sequence-level transcription accuracy). This is pre-
cisely the sort of inconsistency that end-to-end learning
seeks to avoid. In practice it is a source of frustration to
researchers, who find that a large gain in frame accuracy
can translate to a negligible improvement, or even deterio-
ration in transcription accuracy. An additional problem is
that the frame-level training targets must be inferred from
the alignments determined by the HMM. This leads to an
awkward iterative procedure, where network retraining is
alternated with HMM re-alignments to generate more accu-
rate targets. Full-sequence training methods such as Max-
imum Mutual Information have been used to directly train
HMM-neural network hybrids to maximise the probability
of the correct transcription (Bahl et al., 1986; Jaitly et al.,
2012). However these techniques are only suitable for re-
training a system already trained at frame-level, and require
the careful tuning of a large number of hyper-parameters—
typically even more than the tuning required for deep neu-
ral networks.
While the transcriptions used to train speech recognition
systems are lexical, the targets presented to the networks