388 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013
The Deep Tensor Neural Network With Applications
to Large Vocabulary Speech Recognition
Dong Yu, Senior Member, IEEE,LiDeng, Fellow, IEEE,and FrankSeide, Member, IEEE
Abstract—The recently proposed context-dependent deep
neural network hidden Markov models (CD-DNN-HMMs) have
been proved highly promising for l
arge vocabulary speech recog-
nition. In this paper, we develop a more advanced type of DNN,
which we call the d eep tensor neural network (DTNN). The D TNN
extends the conventional DN
Nbyreplacingoneormoreofits
layers with a double-projection (DP) layer, in which each input
vector is projected into two nonlinear subspaces, and a tensor
layer, in which two subsp
ace p rojections interact with each other
and jointly predict the next layer in the deep architecture. In
addition, we describe an approach to map the tensor layers to
the conventional s igm
oid layers so that the former can be treated
and trained in a similar way to the latter. With this mapping we
can consider a DTNN as the D NN augmented with DP layers so
that not only the B
P learning algorithm of DTNNs can be cleanly
derived but also new types of DTNNs can be more easily devel-
oped. Evaluation on Switchboard tasks indicates that DTNNs can
outperform the
already high-performing DNNs with 4–5% and
3% relative word error reduction, respectively, using 30-hr and
309-hr training sets.
Index Terms—Automatic speech recognition, CD-DNN-HMM,
large vocabu
lary, tensor deep neural networks.
I. INTRODUCTION
R
ECENTLY, the context-dependent deep neural net-
work h idden Markov model (CD-DNN-HMM) was
developed for large vocabulary speech recognition (LVSR)
and has been successfully applied to a variety of large scale
tasks by a num ber of research groups worldwide [2]–[9]. The
CD-DNN-HMM adopts and extends the earlier artificial neural
network ( ANN ) HMM hybrid system framework [10]–[12].
In CD-DNN-HMMs, DNNs— mu lti layer perceptrons (MLPs)
with m any hidden layers—replace G aussian mixture models
(GMMs) and directly approximate the em ission probabili-
ties of the tied triphone states (also called senones). In the
first set of successful experiments, CD-DNN-HMMs were
shown to achieve 16% [2], [3] and 33% [4]–[6] relative recog-
nition error reduction over strong, d i s criminatively trained
Manuscript received May 29, 2012; revised September 01, 2012 and Oc-
tober 24, 2012; acce pte d November 03, 2012. Date of publication November
15, 2012; date of current v ersion December 10, 2012. This work significantly
extends and completes th e preliminary work described in [1]. The associate ed-
itor co or din ating the review of this manuscript and approving it for publica tion
wasMarkJ.F.Gales.
D. Yu and L. Deng are with Microsoft Research, Redmond, WA 98052 USA
(e-mail: dongyu@microsoft.com; deng@microsoft.com).
F. Seide is with Microsoft Research Asia, Beijing 100080, China (e-mail:
fseide@microsoft.com).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Dig
ital Object Identifier 10.1109/TASL.2012.2227738
CD-GMM-H MMs, respectively, on a large-vocabulary voice
search (VS) task [13] and the Switchboard (SWB) phone-call
transcription task [14]. Subsequent work on Google voice
search and YouTube data [7] and on Broadcast News [8], [9]
confirmed the effectiveness of the CD-DNN-H MMs for large
vocabulary speech recognition.
In this work, we extend the DNN to a novel deep tensor neural
network (DTNN) in wh ich one o r more layers are d oub le-pro-
jection (DP) and tensor layers (see Section III for the explana-
tion). The basic idea of the D TNN comes from the mo tiv ation
and assumption that the underlying factors, such as the spoken
words, the speaker identit y, noise and channel distortion , and
so on, which affect the observed acoustic signals of speech can
be factorized and be approximately represented as interactions
between two nonlinear subspaces. This type of multi-way in-
teraction was hypothesized and explored in neuroscience as a
model for the central nervous system [15], which conceptually
features brain function as comprising functional geometries via
metric tensors in the internal central nervous system represen-
tation-spaces, both in sensorimotor and connected manifolds.
In DTNN, we represen t the hid den, underlying factors by pro-
jecting the input onto two separate subspaces through a double-
projection (DP) layer in the ot herwise conventional DN N. We
subsequently model the interactions among these two subspaces
and the output neurons through a tensor with three-way connec-
tions. We propose a novel approach to reduce the tensor layer to
a c on vention al sigmoid layer so that the model can be better
understood and the decoding and learning algorithms can be
cleanly develo ped. Based on this reductio n, we also introduce
alternative types of DTNNs. We empirically compare the con-
ventional DNN and the new DTN N on the MNIST handwritten
digit recognition task and the SWB phone-call transcription task
[14]. The experimental results demonstrate that the DTNN gen -
erally outperforms the c on vention al DN N.
This paper is organized as follows. We brieflyreviewthere-
lated w ork in Section II and introd uce the general architecture
of the DTNN in Section III, in which the detailed components
of the DTNN and the forward computations are also described.
Section IV is dedicated to the a lg or ithms we developed in this
work for learning DTNN weight m atrices and tensors. The ex-
perimental results on MNIST digit recognition task and SWB
task are presented and analyzed in Section V . We conclude the
paperinSectionVI.
II. R
ELATED WORK
In re
cent years, an extension from matrix to tensor has been
pro
posed to model three-way interactions and to improve the
1558-7916/$31.00 © 2012 IEEE