深度神经网络联合训练的通用VAD解决未知噪声条件下的性能下降

62 浏览量更新于2024-08-27 收藏 496KB PDF 举报

本文主要探讨了一种基于深度神经网络联合训练的通用语音活动检测（Voice Activity Detection, VAD）方法，针对在未见过的噪声条件下性能下降的问题提出了创新解决方案。该研究论文由清华大学和美国佐治亚理工学院的研究团队共同完成，作者包括Qing Wang、Jun Du、Xiao Bao、Zi-Rui Wang、Li-Rong Dai以及Chin-Hui Lee。首先，研究者提出了一种回归深度神经网络（Regression DNN），其目标是将嘈杂的语音特征映射到类似于深度神经网络（DNN）增强后的清晰语音特征。这种方法借鉴了语音增强技术，通过学习噪声对语音信号的影响，能够在处理噪声背景时更加精确地还原语音特征。其次，为了提高VAD部分的性能，论文构建了一个专门用于区分语音与噪声背景的DNN。这个DNN利用大量的多样化噪声合成数据进行训练，涵盖了各种附加噪声类型，从而提高了模型在面对未知噪声环境时的鲁棒性。论文的核心创新在于将分类DNN与增强DNN相结合，形成一个集成的DNN架构。这种联合训练方式使得整个VAD系统能够同时优化特征映射和噪声分类任务。回归DNN作为噪声归一化模块，其作用是明确生成易于处理的“干净”特征，这对于提高VAD的准确性和泛化能力至关重要。通过这种方式，研究人员旨在开发出一种能在各种噪声条件，无论是已知还是未知，都能有效工作的通用VAD系统。这种技术不仅提升了VAD的性能，而且有可能推动语音处理领域在实际应用中的适应性和鲁棒性提升，比如在嘈杂的会议通话、智能家居或自动驾驶等场景中，对于准确识别和过滤语音信号具有重要意义。

A Universal VAD Based on Jointly Trained Deep Neural Networks

Qing Wang

, Jun Du

, Xiao Bao

, Zi-Rui Wang

, Li-Rong Dai

, Chin-Hui Lee

University of Science and Technology of China, P. R. China

Georgia Institute of Technology, USA

{xiaosong,baox,cs211}@mail.ustc.edu.cn, {jundu,lrdai}@ustc.edu.cn, chl@ece.gatech.edu

Abstract

In this paper, we propose a joint training approach to voice

activity detection (VAD) to address the issue of performance

degradation due to unseen noise conditions. Two key tech-

niques are integrated into this deep neural network (DNN) based

VAD framework. First, a regression DNN is trained to map the

noisy to clean speech features similar to DNN-based speech

enhancement. Second, the VAD part to discriminate speech

against noise backgrounds is also a DNN trained with a large

amount of diversiﬁed noisy data synthesized by a wide range

of additive noise types. By stacking the classiﬁcation DNN

on top of the enhancement DNN, this integrated DNN can be

jointly trained to perform VAD. The feature mapping DNN

serves as a noise normalization module aiming at explicitly gen-

erating the “clean” features which are easier to be correctly

recognized by the following classiﬁcation DNN. Our experi-

ment results demonstrate the proposed noise-universal DNN-

based VAD algorithm achieves a good generalization capacity

to unseen noises, and the jointly trained DNNs consistently and

signiﬁcantly outperform the conventional classiﬁcation-based

DNN for all the noise types and signal-to-noise levels tested.

Index Terms: voice activity detection, deep neural network,

feature mapping, joint training

1. Introduction

Voice activity detection (VAD) is a very fundamental prepro-

cessing module for many speech applications, such as speech

coding, speech recognition, speaker recognition, and spoken

language identiﬁcation. In the mobile internet era, most speech-

activated devices use a push-to-talk function as a manual VAD

mechanism to record speech, implying that high-performance

VAD is still an unsolved problem in real-world scenarios, espe-

cially in non-stationary or low signal-to-noise ratio (SNR) en-

vironments. Recent VAD research could be traced back to the

late 1950s [1]. For the past several decades, many approaches

were investigated and they could be categorized into three broad

classes. The ﬁrst class focused on the study of different acous-

tic features or metrics, e.g., linear prediction coding (LPC) pa-

rameters [2], zero-crossing rate (ZCR) [3], periodicity measure

[4], cepstral features [5], formant shape [6], the higher-order

statistics of the LPC residual [7], the long-term spectral diver-

gence (LTSD) [8], and fusion of multiple features [9]. The sec-

ond class was the statistical model based VAD algorithm orig-

inated from Ephraim & Malah’s work for speech enhancement

[10]. In [11], a Gaussian model was adopted for VAD with a

decision-directed approach [12] to estimate the signal parame-

ters. It achieved a better VAD performance over the conven-

tional approaches. Later, the statistical model based approaches

were improved by using soft decision schemes [13, 14], or

other model assumptions, e.g., replacing the Gaussian by the

Gamma and Laplacian distributions [15, 16]. The third class,

often referred to as the so-called supervised learning approach,

directly utilized classiﬁcation models to discriminate speech

against noise, instead of making model assumptions about the

interaction between the speech and noise signals. Classiﬁer de-

signs, such as support vector machine (SVM) [17], conditional

random ﬁeld (CRF) [18], and non-negative sparse coding [19],

have been investigated.

Recently, the deep learning techniques [20, 21] have been

increasingly popular for many speech areas, e.g., speech recog-

nition [22], speech enhancement [23, 24] and separation [25].

Several representative work for VAD were based on deep neural

networks [26, 27, 28] and recurrent neural networks [29]. The

deep learning approaches indeed could signiﬁcantly improve

the VAD performance compared with other classiﬁcation mod-

els under the matched noise conditions. But the generalization

capability problem to unseen noise conditions was not explic-

itly discussed and addressed in previous work. Inspired by the

recent success to handle the unseen noises in speech enhance-

ment [24], in this work ﬁrst we propose a universal VAD based

on deep neural network by using a large amount of diversiﬁed

noisy data synthesized by a wide range of additive noises. But

our preliminary experiments show that the classiﬁcation DNN

for VAD with only two-dimensional output can not handle the

diversiﬁed noisy training data well and the performance of DNN

is quickly saturated when using more than two hidden layers.

Motivated by the recent work for noise robust speech recogni-

tion [30, 31, 32], we present a novel feature mapping front-end

by using a regression DNN as a noise normalization module

to estimate the clean speech features which make the VAD de-

cision easier with the subsequent classiﬁcation DNN. Further-

more, the feature mapping DNN can be jointly trained with the

conventional classiﬁcation DNN, namely the joint training of

the front-end and back-end DNNs for VAD. Our experiments

demonstrate the superiority of the jointly trained DNN for all

unseen noise types and levels.

2. DNN Based VAD System Overview

The overall ﬂowchart of VAD system is illustrated in Fig. 1. In

the training stage, ﬁrst the acoustic features of both clean speech

and synthesized noisy speech training data are extracted. Multi-

resolution cochleagram (MRCG) features are adopted, which

are well veriﬁed for speech recognition [33] and VAD [28].

Then two DNNs, namely feature mapping DNN and the clas-

siﬁcation DNN, are trained. Please note that the stereo-data

of clean speech and noisy speech MRCG features should be

adopted to train the feature mapping DNN while only the noisy

speech features are needed for the conventional classiﬁcation

DNN training. Finally a generic DNN can be generated by joint

training of both feature mapping and classiﬁcation DNN. In the

10, 2015, Dresden, Germany

INTERSPEECH 2015

2282

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38518958

粉丝: 0
资源: 883

深度神经网络联合训练的通用VAD解决未知噪声条件下的性能下降

a statistical model-based voice activity detection

Improving deep neural network based speech enhancement in low SNR environments

Low Frequency Ultrasonic Voice Activity Detection using Convolutional Neural Networks

DOA, VAD and KWS for ReSpeaker Microphone Array

VAD.rar_NOISE_silk VAD_silk VAD_vad

WebRTC_VAD.zip_vad_vad 编译 webrtc_webrtc_webrtc VAD_webrtc vad

vad_语音检测_VAD话音检测_VAD语音_vad_

VAD.rar_vad

vad.rar_ vad _speech_vad_vad matlab_语音分割

vad.zip_vad_webrtc_webrtc VAD_webrtc vad_witch

最新资源