A Universal VAD Based on Jointly Trained Deep Neural Networks
Qing Wang
1
, Jun Du
1
, Xiao Bao
1
, Zi-Rui Wang
1
, Li-Rong Dai
1
, Chin-Hui Lee
2
1
University of Science and Technology of China, P. R. China
2
Georgia Institute of Technology, USA
{xiaosong,baox,cs211}@mail.ustc.edu.cn, {jundu,lrdai}@ustc.edu.cn, chl@ece.gatech.edu
Abstract
In this paper, we propose a joint training approach to voice
activity detection (VAD) to address the issue of performance
degradation due to unseen noise conditions. Two key tech-
niques are integrated into this deep neural network (DNN) based
VAD framework. First, a regression DNN is trained to map the
noisy to clean speech features similar to DNN-based speech
enhancement. Second, the VAD part to discriminate speech
against noise backgrounds is also a DNN trained with a large
amount of diversified noisy data synthesized by a wide range
of additive noise types. By stacking the classification DNN
on top of the enhancement DNN, this integrated DNN can be
jointly trained to perform VAD. The feature mapping DNN
serves as a noise normalization module aiming at explicitly gen-
erating the “clean” features which are easier to be correctly
recognized by the following classification DNN. Our experi-
ment results demonstrate the proposed noise-universal DNN-
based VAD algorithm achieves a good generalization capacity
to unseen noises, and the jointly trained DNNs consistently and
significantly outperform the conventional classification-based
DNN for all the noise types and signal-to-noise levels tested.
Index Terms: voice activity detection, deep neural network,
feature mapping, joint training
1. Introduction
Voice activity detection (VAD) is a very fundamental prepro-
cessing module for many speech applications, such as speech
coding, speech recognition, speaker recognition, and spoken
language identification. In the mobile internet era, most speech-
activated devices use a push-to-talk function as a manual VAD
mechanism to record speech, implying that high-performance
VAD is still an unsolved problem in real-world scenarios, espe-
cially in non-stationary or low signal-to-noise ratio (SNR) en-
vironments. Recent VAD research could be traced back to the
late 1950s [1]. For the past several decades, many approaches
were investigated and they could be categorized into three broad
classes. The first class focused on the study of different acous-
tic features or metrics, e.g., linear prediction coding (LPC) pa-
rameters [2], zero-crossing rate (ZCR) [3], periodicity measure
[4], cepstral features [5], formant shape [6], the higher-order
statistics of the LPC residual [7], the long-term spectral diver-
gence (LTSD) [8], and fusion of multiple features [9]. The sec-
ond class was the statistical model based VAD algorithm orig-
inated from Ephraim & Malah’s work for speech enhancement
[10]. In [11], a Gaussian model was adopted for VAD with a
decision-directed approach [12] to estimate the signal parame-
ters. It achieved a better VAD performance over the conven-
tional approaches. Later, the statistical model based approaches
were improved by using soft decision schemes [13, 14], or
other model assumptions, e.g., replacing the Gaussian by the
Gamma and Laplacian distributions [15, 16]. The third class,
often referred to as the so-called supervised learning approach,
directly utilized classification models to discriminate speech
against noise, instead of making model assumptions about the
interaction between the speech and noise signals. Classifier de-
signs, such as support vector machine (SVM) [17], conditional
random field (CRF) [18], and non-negative sparse coding [19],
have been investigated.
Recently, the deep learning techniques [20, 21] have been
increasingly popular for many speech areas, e.g., speech recog-
nition [22], speech enhancement [23, 24] and separation [25].
Several representative work for VAD were based on deep neural
networks [26, 27, 28] and recurrent neural networks [29]. The
deep learning approaches indeed could significantly improve
the VAD performance compared with other classification mod-
els under the matched noise conditions. But the generalization
capability problem to unseen noise conditions was not explic-
itly discussed and addressed in previous work. Inspired by the
recent success to handle the unseen noises in speech enhance-
ment [24], in this work first we propose a universal VAD based
on deep neural network by using a large amount of diversified
noisy data synthesized by a wide range of additive noises. But
our preliminary experiments show that the classification DNN
for VAD with only two-dimensional output can not handle the
diversified noisy training data well and the performance of DNN
is quickly saturated when using more than two hidden layers.
Motivated by the recent work for noise robust speech recogni-
tion [30, 31, 32], we present a novel feature mapping front-end
by using a regression DNN as a noise normalization module
to estimate the clean speech features which make the VAD de-
cision easier with the subsequent classification DNN. Further-
more, the feature mapping DNN can be jointly trained with the
conventional classification DNN, namely the joint training of
the front-end and back-end DNNs for VAD. Our experiments
demonstrate the superiority of the jointly trained DNN for all
unseen noise types and levels.
2. DNN Based VAD System Overview
The overall flowchart of VAD system is illustrated in Fig. 1. In
the training stage, first the acoustic features of both clean speech
and synthesized noisy speech training data are extracted. Multi-
resolution cochleagram (MRCG) features are adopted, which
are well verified for speech recognition [33] and VAD [28].
Then two DNNs, namely feature mapping DNN and the clas-
sification DNN, are trained. Please note that the stereo-data
of clean speech and noisy speech MRCG features should be
adopted to train the feature mapping DNN while only the noisy
speech features are needed for the conventional classification
DNN training. Finally a generic DNN can be generated by joint
training of both feature mapping and classification DNN. In the
Copyright © 2015 ISCA September 6
-
10, 2015, Dresden, Germany
INTERSPEECH 2015
2282