深度神经网络在多噪声环境下的语音增强研究

需积分: 50 114 浏览量更新于2024-09-08 3 收藏 1.51MB PDF 举报

本文主要探讨了在多噪声环境中利用深度神经网络（DNN）进行语音增强的问题，这与当前大多数只关注单个噪声污染下语音增强的研究存在显著差异，因为现实世界的环境往往同时包含多种噪音源，如办公室中既有固定噪声（如空调声）又有非固定噪声（如人声交流）。研究的焦点在于提高在复杂条件下的语音质量，特别是在办公环境中，如何有效地处理多种同时存在的静止和非静止噪声。首先，作者介绍了当前语音增强技术的局限性，即它们通常假设只有单一噪声类型，而忽略了实际场景中的多样性。接着，他们提出了基于DNN的不同策略来应对这一挑战。这些策略可能包括： 1. **多层感知器（Multilayer Perceptron, MLP）模型**：DNN的结构能够捕捉到不同噪声类型的复杂交互，通过学习和拟合噪声特征，提取出有用的语音信号，减少背景噪音的影响。 2. **卷积神经网络（Convolutional Neural Networks, CNN）和循环神经网络（Recurrent Neural Networks, RNN）结合**：CNN用于处理空间和频率域的噪声特征，RNN则可以捕捉时间序列中的上下文信息，两者结合起来能更好地处理混合噪声中的时变成分。 3. **注意力机制（Attention Mechanism）**：引入注意力机制可以帮助网络集中于关键的语音部分，忽略无关的噪声，从而提高语音的清晰度。 4. **端到端训练**：研究人员还探索了将语音编码中的心理声学模型融入到DNN训练中的方法，通过模拟人类听觉系统的工作原理，优化网络对语音和噪声的分离能力。 5. **深度联合学习（Joint Learning）**：通过将语音识别任务和语音增强任务联合训练，网络可以在提升语音质量的同时，间接提高噪声抑制的性能。 6. **超参数优化**：通过调整网络架构、学习率等超参数，以找到最有效的配置，最大化语音增强的效果。结论部分，该研究指出深度神经网络在处理多噪声条件下语音增强的潜力，并强调了心理声学模型在指导DNN训练中的价值。该工作对于提高实际应用场景中语音通信的质量具有重要意义，为未来的语音处理技术开辟了新的研究方向。索引词包括：深度神经网络、语音增强、多种噪声类型、心理声学模型，显示了作者对这一领域的深入理解和创新贡献。

Speech Enhancement In Multiple-Noise Conditions using Deep Neural

Networks

Anurag Kumar

, Dinei Florencio

Carnegie Mellon University, Pittsburgh, PA, USA - 15217

Microsoft Research, Redmond, WA USA - 98052

alnu@andrew.cmu.edu, dinei@microsoft.com

Abstract

In this paper we consider the problem of speech enhancement

in real-world like conditions where multiple noises can simulta-

neously corrupt speech. Most of the current literature on speech

enhancement focus primarily on presence of single noise in

corrupted speech which is far from real-world environments.

Speciﬁcally, we deal with improving speech quality in ofﬁce en-

vironment where multiple stationary as well as non-stationary

noises can be simultaneously present in speech. We propose

several strategies based on Deep Neural Networks (DNN) for

speech enhancement in these scenarios. We also investigate a

DNN training strategy based on psychoacoustic models from

speech coding for enhancement of noisy speech.

Index Terms: Deep Neural Network, Speech Enhancement,

Multiple Noise Types, Psychoacoustic Models

1. Introduction

Speech Enhancement (SE) is an important research problem in

audio signal processing. The goal is to improve the quality and

intelligibility of speech signals corrupted by noise. Due to its

application in several areas such as automatic speech recogni-

tion, mobile communication, hearing aids etc. it has been an

actively researched topic and several methods have been pro-

posed over the past several decades [1] [2].

The simplest method to remove additive noise by subtract-

ing an estimate of noise spectrum from noisy speech spectrum

was proposed back in 1979 by Boll [3]. The wiener ﬁltering [4]

based approach was proposed in the same year. MMSE esti-

mator [5] which performs non-linear estimation of short time

spectral amplitude (STSA) of speech signal is another impor-

tant work. A superior version of MMSE estimation referred to

as Log-MMSE tries to minimize the mean square-error in the

log-spectral domain [6]. Other popular classical methods in-

clude signal-subspace based methods [7] [8].

In recent years deep neural network (DNN) based learning

architectures have been found to be very successful in related

areas such as speech recognition [9–12]. The success of deep

neural networks (DNNs) in automatic speech recognition led

to investigation of deep neural networks for noise suppression

for ASR [13] and speech enhancement [14] [15] [16] as well.

The central theme in using DNNs for speech enhancement is

that corruption of speech by noise is a complex process and a

complex non-linear model like DNN is well suited for modeling

it [17] [18].

Although, there are very few exhaustive works on utility of

DNNs for speech enhancement, it has shown promising results

and can outperform classical SE methods. A common aspect

in several of these works [14] [18] [16] [19] [15] is evaluation

on matching or seen noise conditions. Matching or seen con-

ditions implies the test noise types (e.g crowd noise) are same

as training. Unlike classical methods which are motivated by

signal processing aspects, DNN based methods are data driven

approaches and matched noise conditions might not be ideal for

evaluating DNNs for speech enhancement. In fact in several

cases, the “noise data set” used to create the noisy test utter-

ances is same as the one used in training. This results in high

similarity (same) between the training and test noises where it

is not hard to expect that DNN would outperform other meth-

ods. Thus, a more thorough analysis even in matched conditions

needs to be done by using variations of the selected noise types

which have not been used during training.

Unseen or mismatched noise conditions refer to the situa-

tions when the model (e.g DNN) has not seen the test noise types

during training. For unseen noise conditions and enhancement

using DNNs, [17] is a notable work. [17] trains the network

on a large variety of noise types and show that signiﬁcant im-

provements can be achieved in mismatched noise conditions by

exposing the network to large number of noise types. In [17]

“noise data set” used to create the noisy test utterances is dis-

joint from that used during training although some of the test

noise types such as Car, Exhibition would be similar to a few

training noise types such as Trafﬁc and Car Noise, Crowd Noise.

Some post-processing strategies were also used in this work

to obtain further improvements. Although, unseen noise con-

ditions present a relatively difﬁcult scenario compared to the

seen one, it is still far from real-world applications of speech

enhancement. In real-world we expect the model to not only

perform equally well on large variety of noise types (seen or

unseen) but also on non-stationary noises. More importantly,

speech signals are usually corrupted by multiple noises of dif-

ferent types in real world situations and hence removal of sin-

gle noise signals as done in all of the previous works is restric-

tive. In environments around us, multiple noises occur simulta-

neously with speech. This multiple noise types conditions are

clearly much harder and complex to remove or suppress. To

analyze and study speech enhancement in these complex situa-

tions we propose to move to an environment speciﬁc paradigm.

In this paper we focus on ofﬁce-environment noises and pro-

pose different methods based on DNNs for speech enhance-

ment in ofﬁce-environment. We collect large number of ofﬁce-

environment noises and in any given utterance several of these

noises can be simultaneously present along with speech (details

of dataset in later sections). We also show that noise-aware

training [20] proposed for noise robust speech recognition are

helpful in speech enhancement as well in these complex noise

conditions. We speciﬁcally propose to use running noise esti-

mate cues, instead of stationary noise cues used in [20]. We

also propose and evaluate strategies combining DNN and psy-

INTERSPEECH 2016

September 8–12, 2016, San Francisco, USA

http://dx.doi.org/10.21437/Interspeech.2016-883738

下载后可阅读完整内容，剩余4页未读，立即下载

qq_21536383

粉丝: 0
资源: 6

深度神经网络在多噪声环境下的语音增强研究

DeepDenoisingAutoencoder:用于语音增强（DDAE）的Tensorflow实现

matlab将代码放大-sednn_modify:徐永刚和Kong秋强使用DNN进行Python3.5和Windows版本的语音增强

语音增强算法及实现

sourceCode.zip_DNN python_DNN PYTHON_DNN实现_语音_语音 python

DNN.zip_DNN 神经网络_DNN特征_语音 特征_语音特征_语音神经网络

DeepLearnToolbox-master.zip_DNN_DNN matlab预测_DNN 预测_matlab dnn_深

DNN.py_Pythondnn_机器学习_DNN神经网络_DNN_

matlab由频域变时域的代码-DNN_Kalman_Filter:DNN辅助的Kalman滤波器用于时域语音增强

一种改进的DNN_HMM的语音识别方法_李云红.pdf

ASR_Kaldi_DNN_Chinese_data.zip

最新资源

DNN.zip_DNN 神经网络_DNN特征_语音特征_语音特征_语音神经网络