Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks
Frank Seide
1
, Gang Li,
1
and Dong Yu
2
1
Microsoft Research Asia, Beijing, P.R.C.
2
Microsoft Research, Redmond, USA
{fseide,ganl,dongyu}@microsoft.com
Abstract
We apply the recently proposed Context-Dependent Deep-
Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text
transcription. For single-pass speaker-independent recognition
on the RT03S Fisher portion of phone-call transcription bench-
mark (Switchboard), the word-error rate is reduced from 27.4%,
obtained by discriminatively trained Gaussian-mixture HMMs,
to 18.5%—a 33% relative improvement.
CD-DNN-HMMs combine classic artificial-neural-network
HMMs with traditional tied-state triphones and deep-belief-
network pre-training. They had previously been shown to re-
duce errors by 16% relatively when trained on tens of hours of
data using hundreds of tied states. This paper takes CD-DNN-
HMMs further and applies them to transcription using over 300
hours of training data, over 9000 tied states, and up to 9 hidden
layers, and demonstrates how sparseness can be exploited.
On four less well-matched transcription tasks, we observe
relative error reductions of 22–28%.
Index Terms: speech recognition, deep belief networks, deep
neural networks
1. Introduction
Since the early 90’s, artificial neural networks (ANNs) have
been used to model the state emission probabilities of HMM
speech recognizers [1]. While traditional Gaussian mixture
model (GMM)-HMMs model context dependency through tied
context-dependent states (e.g. CART-clustered crossword tri-
phones [2]), ANN-HMMs were never used to do so directly.
Instead, networks were often factorized, e.g. into a monophone
and a context-dependent part [3], or hierarchically decomposed
[4]. It has been commonly assumed that hundreds or thousands
of triphone states were just too many to be accurately modeled
or trained in a neural network. Only recently did Yu et al. dis-
cover that doing so is not only feasible but works very well [5].
Context-dependent deep-neural-network HMMs, or CD-
DNN-HMMs [5, 6], apply the classical ANN-HMMs of the
90’s to traditional tied-state triphones directly, exploiting Hin-
ton’s deep-belief-network (DBN) pre-training procedure. This
was shown to lead to a very promising and possibly disruptive
acoustic model as indicated by a 16% relative recognition error
reduction over discriminatively trained GMM-HMMs on a busi-
ness search task [5, 6], which features short query utterances,
tens of hours of training data, and hundreds of tied states.
This paper takes this model a step further and serves sev-
eral purposes. First, we show that the exact same CD-DNN-
HMM can be effectively scaled up in terms of training-data size
(from 24 hours to over 300), model complexity (from 761 tied
triphone states to over 9000), depth (from 5 to 9 hidden lay-
ers), and task (from voice queries to speech-to-text transcrip-
tion). This is demonstrated on a publicly available benchmark,
the Switchboard phone-call transcription task (2000 NIST Hub5
and RT03S sets). We should note here that ANNs have been
trained on up to 2000 hours of speech before [7], but with much
fewer output units (monophones) and fewer hidden layers.
Second, we advance the CD-DNN-HMMs by introducing
weight sparseness and the related learning strategy and demon-
strate that this can reduce recognition error or model size.
Third, we present the statistical view of the multi-layer per-
ceptron (MLP) and DBN and provide empirical evidence for
understanding which factors contribute most to the accuracy im-
provements achieved by the CD-DNN-HMMs.
2. The Context-Dependent
Deep Neural Network HMM
A deep neural network (DNN) is a conventional multi-layer per-
ceptron (MLP, [8]) with many hidden layers, optionally initial-
ized using the DBN pre-training algorithm. In the following,
we want to recap the DNN from a statistical viewpoint and de-
scribe its integration with context-dependent HMMs for speech
recognition. For a more detailed description, please refer to [6].
2.1. Multi-Layer Perceptron—A Statistical View
An MLP as used in this paper models the posterior probabil-
ity P
s|o
(s|o) of a class s given an o
bservation vector o,asa
stack of (L +1)layers of log-linear models. The first L layers,
=0...L − 1, model posterior probabilities of h
idden binary
vectors h
given input vectors v
, while the top layer L models
the desired class posterior as
P
h|v
(h
|v
)=
N
j=1
e
z
j
(v
)·h
j
e
z
j
(v
)·1
+ e
z
j
(v
)·0
, 0 ≤ <L
P
L
s|v
(s|v
L
)=
e
z
L
s
(v
L
)
s
e
z
L
s
(v
L
)
=softmax
s
(z
L
(v
L
))
z
(v
)=(W
)
T
v
+ a
with weight matrices W
and bias vectors a
, where h
j
and
z
j
(v
) are the j-th component of h
and z
(v
), respectively.
The precise modeling of P
s|o
(s|o) requires integration over
all possible values of h
across all layers which is infeasi-
ble. An effective practical trick is to replace the marginaliza-
tion with the “mean-field approximation” [9]. Given observa-
tion o, we set v
0
= o and choose the conditional expectation
E
h|v
{h
|v
} = σ
z
(v
)
as input v
+1
to the next layer,
where σ
j
(z)=1/(1 + e
−z
j
).
Copyright © 2011 ISCA 28
-
31 August 2011, Florence, Italy
INTERSPEECH 2011
437