JOINT TRAINING OF FRONT-END AND BACK-END DEEP NEURAL NETWORKS FOR
ROBUST SPEECH RECOGNITION
Tian Gao
1
, Jun Du
1
, Li-Rong Dai
1
, Chin-Hui Lee
2
1
University of Science and Technology of China, Hefei, Anhui, P. R. China
2
Georgia Institute of Technology, Atlanta, Georgia, USA
gtian09@mail.ustc.edu.cn, {jundu,lrdai}@ustc.edu.cn, chl@ece.gatech.edu
ABSTRACT
Based on the recently proposed speech pre-processing front-
end with deep neural networks (DNNs), we first investigate d-
ifferent feature mapping directly from noisy speech via DNN
for robust speech recognition. Next, we propose to jointly
train a single DNN for both feature mapping and acoustic
modeling. In the end, we show that the word error rate (WER)
of the jointly trained system could be significantly reduced
by the fusion of multiple DNN pre-processing systems which
implies that features obtained from different domains of the
DNN-enhanced speech signals are strongly complementary.
Testing on the Aurora4 noisy speech recognition task our best
system with multi-condition training can achieves an average
WER of 10.3%, yielding a relative reduction of 16.3% over
our previous DNN pre-processing only system with a WER
of 12.3%. To the best of our knowledge, this represents the
best published result on the Aurora4 task without using any
adaptation techniques.
Index Terms— robust speech recognition, deep neural
network, feature mapping, joint training, system fusion
1. INTRODUCTION
With the fast development of mobile internet, the speech-
enabled applications using automatic speech recognition (AS-
R) are becoming increasingly popular. However, the noise
robustness is one of the critical issues to make ASR system
widely used in real world. Historically, most of ASR system-
s use Mel-frequency cepstral coefficients (MFCCs) and their
derivatives as speech features, and a set of Gaussian mixture
continuous density HMMs (CDHMMs) for modeling basic
speech units. Many techniques [1, 2, 3] have been proposed
to address this issues. One category of techniques is the so-
called data-driven approach based on stereo-data [4, 5], which
is also the topic of this study.
The recent breakthrough of deep learning [6, 7], espe-
cially the application of deep neural networks (DNNs) in the
This work was supported by the National Natural Science Foundation of
China under Grants No. 61305002.
ASR area [8, 9, 10], marks a new milestone that DNN-HMM
for acoustic modeling becomes the state-of-the-art replac-
ing GMM-HMM. It is believed that the first several layers
of DNN play the role of extracting highly nonlinear and
discriminative features which are robust to irrelevant vari-
abilities. This makes DNN-HMM inherently noise robust to
some extent as verified on the Aurora4 task [11].
In [12, 13], several front-end techniques were shown to
yield further performance gains on top of the DNN-HMM
system for tasks with small vocabularies or constrained gram-
mars. However for large vocabulary tasks, the convention-
al enhancement approach as in [14], effective for the GMM-
HMM systems, might even lead to a system degradation for
DNN-HMM with log mel-filterbank (LMFB) features under
the well-matched training-testing conditions [11].
Meanwhile, the data-driven approaches using stereo da-
ta via recurrent neural network (RNN) and DNN proposed
in [15, 16] can improve the speech recognition accuracy on
small vocabulary tasks. More recently, masking techniques
[17, 18, 19] were successfully applied to noisy speech recog-
nition. In [19], the approach using time-frequency masking
combined with feature mapping via DNN claimed to achieve
the best results on the Aurora4 task. Unfortunately, for multi-
condition training using DNN-HMM with LMFB features,
this approach still resulted in a worse performance similar to
those concluded in [11]. In [20], we propose a pre-processing
approach via DNN as a regression model to enhance noisy
speech for robust speech recognition and was shown to out-
perform the masking approach [19].
In this study, we report our recent progress to further im-
prove the ASR performance of multi-condition training espe-
cially when both additive noise and convolutional distortion
are involved in the test data. First, instead of extracting a-
coustic features from the enhanced speech waveform, DNN is
adopted directly as a highly nonlinear mapping function to es-
timate the clean speech features from observed noisy speech.
Second, we employ a hybrid DNN architecture to joint train
DNNs for both feature mapping and acoustic modeling. The
proposed joint training allows error back-propagation to the
feature mapping layers and the input of the hybrid DNN is