32 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 1, JANUARY 2012
layers) autoencoder with a nonlinear conjugate gradient algo-
rithm. Both [56] and [57] investigate why training deep feed-for-
ward neural networks can often be easier with some form of
pre-training or a sophisticated optimizer of the sort used in [58].
Since the time of the early hybrid architectures, the vector
processing capabilities of modern GPUs and the advent of more
effective training algorithms for deep neural nets have made
much more powerful architectures feasible. Much previous hy-
brid ANN-HMM work focused on context-independent or rudi-
mentary context-dependent phone models and small to mid-vo-
cabulary tasks (with notable exceptions such as [45]), possibly
masking some of the potential advantages of the ANN-HMM
hybrid approach. Additionally, GMM-HMM training is much
easier to parallelize in a computer cluster setting, which his-
torically gave such systems a significant advantage in scala-
bility. Also, since speaker and environment adaptation is gener-
ally easier for GMM-HMM systems, the GMM-HMM approach
has been the dominant one in the past two decades for speech
recognition. That being said, if we consider the wider use of
neural networks in acoustic modeling beyond the hybrid ap-
proach, neural network feature extraction is an important com-
ponent of many state-of-the-art acoustic models.
B. Introduction to the DNN-HMM Approach
The primary contributions of this work are the development
of a context-dependent, pre-trained, deep neural network HMM
hybrid acoustic model (CD-DNN-HMM); a description of our
recipe for applying this sort of model to LVSR problems; and an
analysis of our results which show substantial improvements in
recognition accuracy for a difficult LVSR task over discrimina-
tively-trained pure CD-GMM-HMM systems. Our work differs
from earlier context-dependent ANN-HMMs [42], [41] in two
key respects. First, we used deeper, more expressive neural
network architectures and thus employed the unsupervised
DBN pre-training algorithm to make sure training would be
effective. Second, we used posterior probabilities of senones
(tied triphone HMM states) [48] as the output of the neural
network, instead of the combination of context-independent
phone and context class used previously in hybrid architectures.
This second difference also distinguishes our work from earlier
uses of DNN-HMM hybrids for phone recognition [30]–[32],
[59]. Note that [59], which also appears in this issue, is the
context-independent version of our approach and builds the
foundation for our work. The work in this paper focuses on
context-dependent DNN-HMMs using posterior probabilities
of senones as network outputs and can be successfully applied
to large vocabulary tasks. Training the neural network to predict
a distribution over senones causes more bits of information to
be present in the neural network training labels. It also incor-
porates context-dependence into the neural network outputs
(which, since we are not using a Tandem approach, lets us use a
decoder based on triphone HMMs), and it may have additional
benefits. Our evaluation was done on LVSR instead of phoneme
recognition tasks as was the case in [30]–[32], [59]. It repre-
sents the first large-vocabulary application of a pre-trained,
deep neural network approach. Our results show that our
CD-DNN-HMM system provides dramatic improvements over
a discriminatively trained CD-GMM-HMM baseline.
The remainder of this paper is organized as follows. In
Section II, we briefly introduce RBMs and deep belief nets, and
outline the general pre-training strategy we use. In Section III,
we describe the basic ideas, the key properties, and the training
and decoding strategies of our CD-DNN-HMMs. In Section IV,
we analyze experimental results on a 65
vocabulary busi-
ness search dataset collected from the Bing mobile voice search
application (formerly known as Live Search for mobile [36],
[60]) under real usage scenarios. Section V offers conclusions
and directions for future work.
II. D
EEP
BELIEF NETWORKS
Deep belief networks (DBNs) are probabilistic generative
models with multiple layers of stochastic hidden units above
a single bottom layer of observed variables that represent a
data vector. DBNs have undirected connections between the
top two layers and directed connections to all other layers from
the layer above. There is an efficient unsupervised algorithm,
first described in [24], for learning the connection weights in a
DBN that is equivalent to training each adjacent pair of layers
as an restricted Boltzmann machine (RBM). There is also a
fast, approximate, bottom-up inference algorithm to infer the
states of all hidden units conditioned on a data vector. After
the unsupervised pre-training phase, Hinton et al. [24] used the
up-down algorithm to optimize all of the DBN weights jointly.
During this fine-tuning phase, a supervised objective function
could also be optimized.
In this paper, we use the DBN weights resulting from the un-
supervised pre-training algorithm to initialize the weights of a
deep, but otherwise standard, feed-forward neural network and
then simply use the backpropagation algorithm [61] to fine-tune
the network weights with respect to a supervised criterion. Pre-
training followed by stochastic gradient descent is our method
of choice for training deep neural networks because it often
outperforms random initialization for the deeper architectures
we are interested in training and provides results very robust to
the initial random seed. The generative model learned during
pre-training helps prevent overfitting, even when using models
with very high capacity and can aid in the subsequent optimiza-
tion of the recognition weights.
Although empirical results ultimately are the best reason for
the use of a technique, our motivation for even trying to find and
apply deeper models that might be capable of learning rich, dis-
tributed representations of their input is also based on formal
and informal arguments by other researchers in the machine
learning community. As argued in [62] and [63], insufficiently
deep architectures can require an exponential blow-up in the
number of computational elements needed to represent certain
functions satisfactorily. Thus, one primary motivation for using
deeper models such as neural networks with many layers is that
they have the potential to be much more representationally ef-
ficient for some problems than shallower models like GMMs.
Furthermore, GMMs as used in speech recognition typically
have a large number of Gaussians with independently parame-
terized means which may result in those Gaussians being highly
localized and thus may result in such models only performing
local generalization. In effect, such a GMM would partition the
input space into regions each modeled by a single Gaussian.