EESEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND
WFST-BASED DECODING
Yajie Miao, Mohammad Gowayyed, Florian Metze
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
ABSTRACT
The performance of automatic speech recognition (ASR) has
improved tremendously due to the application of deep neu-
ral networks (DNNs). Despite this progress, building a new
ASR system remains a challenging task, requiring various
resources, multiple training stages and significant expertise.
This paper presents our Eesen framework which drastically
simplifies the existing pipeline to build state-of-the-art ASR
systems. Acoustic modeling in Eesen involves learning a
single recurrent neural network (RNN) predicting context-
independent targets (phonemes or characters). To remove the
need for pre-generated frame labels, we adopt the connection-
ist temporal classification (CTC) objective function to infer
the alignments between speech and label sequences. A dis-
tinctive feature of Eesen is a generalized decoding approach
based on weighted finite-state transducers (WFSTs), which
enables the efficient incorporation of lexicons and language
models into CTC decoding. Experiments show that com-
pared with the standard hybrid DNN systems, Eesen achieves
comparable word error rates (WERs), while at the same time
speeding up decoding significantly.
Index Terms— Recurrent neural network, connectionist
temporal classification, end-to-end ASR
1. INTRODUCTION
Automatic speech recognition (ASR) has traditionally lever-
aged the hidden Markov model/Gaussian mixture model
(HMM/GMM) paradigm for acoustic modeling. HMMs act
to normalize the temporal variability, whereas GMMs com-
pute the emission probabilities of HMM states. In recent
years, the performance of ASR has been improved dramat-
ically by the introduction of deep neural networks (DNNs)
as acoustic models [1, 2, 3]. In the hybrid HMM/DNN
approach, DNNs are used to classify speech frames into clus-
tered context-dependent (CD) states (i.e., senones). On a
variety of ASR tasks, DNN models have shown significant
gains over the GMM models. Despite these advances, build-
ing a state-of-the-art ASR system remains a complicated,
expertise-intensive task. First, acoustic modeling typically
requires various resources such as dictionaries and phonetic
questions. Under certain conditions (e.g., in low-resource lan-
guages), these resources may be unavailable, which restricts
or delays the deployment of ASR. Second, in the hybrid
approach, training of DNNs still relies on GMM models to
obtain (initial) frame-level labels. Building GMM models
normally goes through multiple stages (e.g., CI phone, CD
states, etc.), and every stage involves different feature pro-
cessing techniques (e.g., LDA, fMLLR, etc.). Third, the
development of ASR systems highly relies on ASR experts
to determine the optimal configurations of a multitude of
hyper-parameters, for instance, the number of senones and
Gaussians in the GMM models.
Previous work has made various attempts to reduce the
complexity of ASR. In [4, 5], researchers propose to flat-start
DNNs and thus get ride of GMM models. However, this
GMM-free approach still requires iterative procedures such
as generating forced alignments and decision trees. Mean-
while, another line of work [6, 7, 8, 9, 10] has focused on
end-to-end ASR, i.e., modeling the mapping between speech
and labels (words, phonemes, etc.) directly without any in-
termediate components (e.g., GMMs). On this aspect, Graves
et al. [11] introduce the connectionist temporal classification
(CTC) objective function to infer speech-label alignments au-
tomatically. This CTC technique is further investigated in
[6, 7, 8, 12] on large-scale acoustic modeling tasks. Although
showing promising results, research on end-to-end ASR faces
two major obstacles. First, it is challenging to incorporate
lexicons and language models into decoding. When decod-
ing CTC-trained models, past work [6, 8, 10] has success-
fully constrained search paths with lexicons. However, how
to integrate word-level language models efficiently still is an
unanswered question [10]. Second, the community lacks a
shared experimental platform for the purpose of benchmark-
ing. End-to-end systems described in the literature differ not
only in their model architectures but also in their decoding
methods. For example, [6] and [8] adopt two distinct ver-
sions of beam search for decoding CTC models. These setup
variations hamper rigorous comparisons not only across end-
to-end systems, but also between the end-to-end and existing
hybrid approaches.
In this paper, we resolve these issues by presenting and
publicly releasing our Eesen framework. Acoustic model-
ing in Eesen is viewed as a sequence-to-sequence learning
problem. We exploit deep recurrent neural networks (RNNs)
arXiv:1507.08240v1 [cs.CL] 29 Jul 2015