Speech Enhancement In Multiple-Noise Conditions using Deep Neural
Networks
Anurag Kumar
1
, Dinei Florencio
2
1
Carnegie Mellon University, Pittsburgh, PA, USA - 15217
2
Microsoft Research, Redmond, WA USA - 98052
alnu@andrew.cmu.edu, dinei@microsoft.com
Abstract
In this paper we consider the problem of speech enhancement
in real-world like conditions where multiple noises can simulta-
neously corrupt speech. Most of the current literature on speech
enhancement focus primarily on presence of single noise in
corrupted speech which is far from real-world environments.
Specifically, we deal with improving speech quality in office en-
vironment where multiple stationary as well as non-stationary
noises can be simultaneously present in speech. We propose
several strategies based on Deep Neural Networks (DNN) for
speech enhancement in these scenarios. We also investigate a
DNN training strategy based on psychoacoustic models from
speech coding for enhancement of noisy speech.
Index Terms: Deep Neural Network, Speech Enhancement,
Multiple Noise Types, Psychoacoustic Models
1. Introduction
Speech Enhancement (SE) is an important research problem in
audio signal processing. The goal is to improve the quality and
intelligibility of speech signals corrupted by noise. Due to its
application in several areas such as automatic speech recogni-
tion, mobile communication, hearing aids etc. it has been an
actively researched topic and several methods have been pro-
posed over the past several decades [1] [2].
The simplest method to remove additive noise by subtract-
ing an estimate of noise spectrum from noisy speech spectrum
was proposed back in 1979 by Boll [3]. The wiener filtering [4]
based approach was proposed in the same year. MMSE esti-
mator [5] which performs non-linear estimation of short time
spectral amplitude (STSA) of speech signal is another impor-
tant work. A superior version of MMSE estimation referred to
as Log-MMSE tries to minimize the mean square-error in the
log-spectral domain [6]. Other popular classical methods in-
clude signal-subspace based methods [7] [8].
In recent years deep neural network (DNN) based learning
architectures have been found to be very successful in related
areas such as speech recognition [9–12]. The success of deep
neural networks (DNNs) in automatic speech recognition led
to investigation of deep neural networks for noise suppression
for ASR [13] and speech enhancement [14] [15] [16] as well.
The central theme in using DNNs for speech enhancement is
that corruption of speech by noise is a complex process and a
complex non-linear model like DNN is well suited for modeling
it [17] [18].
Although, there are very few exhaustive works on utility of
DNNs for speech enhancement, it has shown promising results
and can outperform classical SE methods. A common aspect
in several of these works [14] [18] [16] [19] [15] is evaluation
on matching or seen noise conditions. Matching or seen con-
ditions implies the test noise types (e.g crowd noise) are same
as training. Unlike classical methods which are motivated by
signal processing aspects, DNN based methods are data driven
approaches and matched noise conditions might not be ideal for
evaluating DNNs for speech enhancement. In fact in several
cases, the “noise data set” used to create the noisy test utter-
ances is same as the one used in training. This results in high
similarity (same) between the training and test noises where it
is not hard to expect that DNN would outperform other meth-
ods. Thus, a more thorough analysis even in matched conditions
needs to be done by using variations of the selected noise types
which have not been used during training.
Unseen or mismatched noise conditions refer to the situa-
tions when the model (e.g DNN) has not seen the test noise types
during training. For unseen noise conditions and enhancement
using DNNs, [17] is a notable work. [17] trains the network
on a large variety of noise types and show that significant im-
provements can be achieved in mismatched noise conditions by
exposing the network to large number of noise types. In [17]
“noise data set” used to create the noisy test utterances is dis-
joint from that used during training although some of the test
noise types such as Car, Exhibition would be similar to a few
training noise types such as Traffic and Car Noise, Crowd Noise.
Some post-processing strategies were also used in this work
to obtain further improvements. Although, unseen noise con-
ditions present a relatively difficult scenario compared to the
seen one, it is still far from real-world applications of speech
enhancement. In real-world we expect the model to not only
perform equally well on large variety of noise types (seen or
unseen) but also on non-stationary noises. More importantly,
speech signals are usually corrupted by multiple noises of dif-
ferent types in real world situations and hence removal of sin-
gle noise signals as done in all of the previous works is restric-
tive. In environments around us, multiple noises occur simulta-
neously with speech. This multiple noise types conditions are
clearly much harder and complex to remove or suppress. To
analyze and study speech enhancement in these complex situa-
tions we propose to move to an environment specific paradigm.
In this paper we focus on office-environment noises and pro-
pose different methods based on DNNs for speech enhance-
ment in office-environment. We collect large number of office-
environment noises and in any given utterance several of these
noises can be simultaneously present along with speech (details
of dataset in later sections). We also show that noise-aware
training [20] proposed for noise robust speech recognition are
helpful in speech enhancement as well in these complex noise
conditions. We specifically propose to use running noise esti-
mate cues, instead of stationary noise cues used in [20]. We
also propose and evaluate strategies combining DNN and psy-
Copyright © 2016 ISCA
INTERSPEECH 2016
September 8–12, 2016, San Francisco, USA
http://dx.doi.org/10.21437/Interspeech.2016-883738