DRAFT ACCEPTED BY IEEE TRANS. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 3
as those possible with deep belief network pre-training when
training a deep (the encoder and decoder in their architecture
both had three hidden layers) autoencoder with a nonlinear
conjugate gradient algorithm. Both [56] and [57] investigate
why training deep feed-forward neural networks can often
be easier with some form of pre-training or a sophisticated
optimizer of the sort used in [58].
Since the time of the early hybrid architectures, the vector
processing capabilities of modern GPUs and the advent of
more effective training algorithms for deep neural nets have
made much more powerful architectures feasible. Much previ-
ous hybrid ANN-HMM work focused on context-independent
or rudimentary context-dependent phone models and small
to mid-vocabulary tasks (with notable exceptions such as
[45]), possibly masking some of the potential advantages of
the ANN-HMM hybrid approach. Additionally, GMM-HMM
training is much easier to parallelize in a computer cluster
setting, which historically gave such systems a significant
advantage in scalability. Also, since speaker and environment
adaptation is generally easier for GMM-HMM systems, the
GMM-HMM approach has been the dominant one in the past
two decades for speech recognition. That being said, if we
consider the wider use of neural networks in acoustic modeling
beyond the hybrid approach, neural network feature extraction
is an important component of many state-of-the-art acoustic
models.
B. Introduction to the DNN-HMM approach
The primary contributions of this work are the develop-
ment of a context-dependent, pre-trained, deep neural network
HMM hybrid acoustic model (CD-DNN-HMM); a description
of our recipe for applying this sort of model to LVSR prob-
lems; and an analysis of our results which show substantial
improvements in recognition accuracy for a difficult LVSR
task over discriminatively-trained pure CD-GMM-HMM sys-
tems. Our work differs from earlier context-dependent ANN-
HMMs [42] [41] in two key respects. First, we used deeper,
more expressive neural network architectures and thus em-
ployed the unsupervised DBN pre-training algorithm to make
sure training would be effective. Second, we used posterior
probabilities of senones (tied triphone HMM states) [48] as
the output of the neural network, instead of the combination of
context-independent phone and context class used previously
in hybrid architectures. This second difference also distin-
guishes our work from earlier uses of DNN-HMM hybrids for
phone recognition [30]–[32], [59]. Note that [59], which also
appears in this issue, is the context-independent version of our
approach and builds the foundation for our work. The work in
this paper focuses on context-dependent DNN-HMMs using
posterior probabilities of senones as network outputs and can
be successfully applied to large vocabulary tasks. Training the
neural network to predict a distribution over senones causes
more bits of information to be present in the neural network
training labels. It also incorporates context-dependence into
the neural network outputs (which, since we are not using a
Tandem approach, lets us use a decoder based on triphone
HMMs), and it may have additional benefits. Our evaluation
was done on LVSR instead of phoneme recognition tasks as
was the case in [30]–[32], [59]. It represents the first large
vocabulary application of a pre-trained, deep neural network
approach. Our results show that our CD-DNN-HMM sys-
tem provides dramatic improvements over a discriminatively
trained CD-GMM-HMM baseline.
The remainder of this paper is organized as follows. In
section II we briefly introduce restricted Boltzmann machines
(RBMs) and deep belief nets, and outline the general pre-
training strategy we use. In section III, we describe the
basic ideas, the key properties, and the training and decoding
strategies of our CD-DNN-HMMs. In section IV we analyze
experimental results on a 65K+ vocabulary business search
dataset collected from the Bing mobile voice search applica-
tion (formerly known as Live Search for mobile [36], [60])
under real usage scenarios. Section V offers conclusions and
directions for future work.
II. DEEP BELIEF NETWORKS
Deep belief networks (DBNs) are probabilistic generative
models with multiple layers of stochastic hidden units above
a single bottom layer of observed variables that represent a
data vector. DBNs have undirected connections between the
top two layers and directed connections to all other layers from
the layer above. There is an efficient unsupervised algorithm,
first described in [24], for learning the connection weights in a
DBN that is equivalent to training each adjacent pair of layers
as an restricted Boltzmann machine (RBM). There is also a
fast, approximate, bottom-up inference algorithm to infer the
states of all hidden units conditioned on a data vector. After the
unsupervised, or pre-training phase, Hinton et al. [24] used the
up-down algorithm to optimize all of the DBN weights jointly.
During this fine-tuning phase, a supervised objective function
could also be optimized.
In this work, we use the DBN weights resulting from the
unsupervised pre-training algorithm to initialize the weights of
a deep, but otherwise standard, feed-forward neural network
and then simply use the backpropagation algorithm [61] to
fine-tune the network weights with respect to a supervised
criterion. Pre-training followed by stochastic gradient descent
is our method of choice for training deep neural networks
because it often outperforms random initialization for the
deeper architectures we are interested in training and provides
results very robust to the initial random seed. The generative
model learned during pre-training helps prevent overfitting,
even when using models with very high capacity and can aid
in the subsequent optimization of the recognition weights.
Although empirical results ultimately are the best reason for
the use of a technique, our motivation for even trying to find
and apply deeper models that might be capable of learning
rich, distributed representations of their input is also based on
formal and informal arguments by other researchers in the
machine learning community. As argued in [62] and [63],
insufficiently deep architectures can require an exponential
blow-up in the number of computational elements needed to
represent certain functions satisfactorily. Thus one primary
motivation for using deeper models such as neural networks