LISTEN, ATTEND AND SPELL: A NEURAL NETWORK FOR
LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION
William Chan
Carnegie Mellon University
Navdeep Jaitly, Quoc Le, Oriol Vinyals
Google Brain
ABSTRACT
We present Listen, Attend and Spell (LAS), a neural speech recog-
nizer that transcribes speech utterances directly to characters with-
out pronunciation models, HMMs or other components of traditional
speech recognizers. In LAS, the neural network architecture sub-
sumes the acoustic, pronunciation and language models making it
not only an end-to-end trained system but an end-to-end model. In
contrast to DNN-HMM, CTC and most other models, LAS makes no
independence assumptions about the probability distribution of the
output character sequences given the acoustic sequence. Our system
has two components: a listener and a speller. The listener is a pyra-
midal recurrent network encoder that accepts filter bank spectra as
inputs. The speller is an attention-based recurrent network decoder
that emits each character conditioned on all previous characters, and
the entire acoustic sequence. On a Google voice search task, LAS
achieves a WER of 14.1% without a dictionary or an external lan-
guage model and 10.3% with language model rescoring over the top
32 beams. In comparison, the state-of-the-art CLDNN-HMM model
achieves a WER of 8.0% on the same set.
Index Terms— Recurrent neural network, neural attention, end-
to-end speech recognition
1. INTRODUCTION
State-of-the-art speech recognizers of today are complicated systems
comprising of various components - acoustic models, language mod-
els, pronunciation models and text normalization. Each of these
components make assumptions about the underlying probability dis-
tributions they model. For example n-gram language models and
Hidden Markov Models (HMMs) make strong Markovian indepen-
dence assumptions between words/symbols in a sequence. Connec-
tionist Temporal Classification (CTC) and DNN-HMM systems as-
sume that neural networks make independent predictions at different
times and use HMMs or language models (which make their own in-
dependence assumptions) to introduce dependencies between these
predictions over time [1, 2, 3]. End-to-end training of such mod-
els attempts to mitigate these problems by training the components
jointly [4, 5, 6]. In these models, acoustic models are updated based
on a WER proxy, while the pronunciation and language models are
rarely updated [7], if at all.
In this paper we introduce Listen, Attend and Spell (LAS), a
neural network that learns to transcribe an audio sequence signal to
a word sequence, one character at a time, without using explicit lan-
guage models, pronunciation models, HMMs, etc. LAS does not
make any independence assumptions about the nature of the prob-
ability distribution of the output character sequence, given the in-
put acoustic sequence. This method is based on the sequence-to-
sequence learning framework with attention [8, 9, 10, 11, 12, 13]. It
consists of an encoder Recurrent Neural Network (RNN), which is
named the listener, and a decoder RNN, which is named the speller.
The listener is a pyramidal RNN that converts speech signals into
high level features. The speller is an RNN that transduces these
higher level features into output utterances by specifying a proba-
bility distribution over the next character, given all of the acoustics
and the previous characters. At each step the RNN uses its inter-
nal state to guide an attention mechanism [10, 11, 12] to compute a
“context” vector from the high level features of the listener. It uses
this context vector, and its internal state to both update its internal
state and to predict the next character in the sequence. The entire
model is trained jointly, from scratch, by optimizing the probability
of the output sequence using a chain rule decomposition. We call
this an end-to-end model because all the components of a traditional
speech recognizer are integrated into its parameters, and optimized
together during training, unlike end-to-end training of conventional
models that attempt to adjust acoustic models to work well with the
other fixed components of a speech recognizer.
Our model was inspired by [11, 12] that showed how end-to-
end recognition could be performed on the TIMIT phone recognition
task. We note a recent paper from the same group that describes
an application of these ideas to WSJ [14]. Our paper independently
explores the challenges associated with the application of these ideas
to large scale conversational speech recognition on a Google voice
search task. We defer a discussion of the relationship between these
and other methods to section 5.
2. MODEL
In this section, we formally describe LAS. Let x = (x
1
, . . . , x
T
)
be the input sequence of filter bank spectra features and y =
(hsosi, y
1
, . . . , y
S
, heosi), y
i
∈ {a, · · · , z, 0, · · · , 9, hspacei,
hcommai, hperiodi, hapostrophei, hunki} be the output sequence
of characters. Here hsosi and heosi are the special start-of-sentence
token, and end-of-sentence tokens, respectively, and hunki are
unknown tokens such as accented characters.
LAS models each character output y
i
as a conditional distribu-
tion over the previous characters y
<i
and the input signal x using the
chain rule for probabilities:
P (y|x) =
Y
i
P (y
i
|x, y
<i
) (1)
This objective makes the model a discriminative, end-to-end
model, because it directly predicts the conditional probability of
character sequences, given the acoustic signal.
LAS consists of two sub-modules: the listener and the speller.
The listener is an acoustic model encoder that performs an operation
called Listen. The Listen operation transforms the original signal x
into a high level representation h = (h
1
, . . . , h
U
) with U ≤ T. The
speller is an attention-based character decoder that performs an op-
eration we call AttendAndSp ell. The AttendAndSpell operation
4960978-1-4799-9988-0/16/$31.00 ©2016 IEEE ICASSP 2016