Multi-objective Learning and Mask-based Post-processing for Deep Neural
Network based Speech Enhancement
Yong Xu
1∗
, Jun Du
1
, Zhen Huang
2
, Li-Rong Dai
1
, Chin-Hui Lee
2
1
National Engineering Laboratory for Speech and Language Information Processing,
University of Science and Technology of China, China
2
School of Electrical and Computer Engineering, Georgia Institute of Technology, USA
xuyong62@mail.ustc.edu.cn, jundu@ustc.edu.cn, chl@ece.gatech.edu
Abstract
We propose a multi-objective framework to learn both sec-
ondary targets not directly related to the intended task of speech
enhancement (SE) and the primary target of the clean log-power
spectra (LPS) features to be used directly for constructing the
enhanced speech signals. In deep neural network (DNN) based
SE we introduce an auxiliary structure to learn secondary con-
tinuous features, such as mel-frequency cepstral coefficients
(MFCCs), and categorical information, such as the ideal binary
mask (IBM), and integrate it into the original DNN architecture
for joint optimization of all the parameters. This joint estima-
tion scheme imposes additional constraints not available in the
direct prediction of LPS, and potentially improves the learning
of the primary target. Furthermore, the learned secondary in-
formation as a byproduct can be used for other purposes, e.g.,
the IBM-based post-processing in this work. A series of experi-
ments show that joint LPS and MFCC learning improves the SE
performance, and IBM-based post-processing further enhances
listening quality of the reconstructed speech.
Index Terms: speech enhancement, deep neural network, min-
imum mean square error, multi-objective learning, binary mask
1. Introduction
Classical speech enhancement (SE) approaches, such as spec-
tral subtraction [1], MMSE-based spectral amplitude estimator
[2, 3] and optimally modified log-MMSE estimator [4, 5], are
considered as unsupervised techniques having been studied ex-
tensively for several decades. Based on key assumptions for the
interactions between speech and noise, the tremendous progress
has been made for those techniques in the past. However some
issues, such as fast changing noise (e.g., machine gun [6]) and
negative spectrum estimation, still need to be addressed.
On the other hand, supervised machine learning approaches
have also been developed in recent years. They were shown
to generate enhanced speech with good qualities [7]. Non-
negative matrix factorization (NMF) based speech enhancement
[7, 8] was one notable example in which speech and noise ba-
sis models were learned separately from training speech and
noise databases. Then the clean speech could be decomposed
given the noisy speech. However, speech and noise are as-
sumed uncorrelated and it limited the quality of the enhanced
speech signals. Following recent successes in deep learning
based speech processing [9, 10, 11] we have recently proposed a
deep neural network (DNN) based speech enhancement frame-
∗
This work is done while Yong Xu was visiting Georgia Tech in
2014-2015.
work [12, 13, 14] in which DNN was regarded as a regression
model to predict the clean log-power spectra (LPS) features [15]
from noisy LPS features. DNN also acts as a mapping function
to learn the relationship between clean and noisy speech fea-
tures without imposing any assumption. Similar DNN-based
speech denoising methods were also proposed in [16, 17]. In
[18, 19], DNN-based method was demonstrated to be better
than the NMF-based methods in speech separation. In DNN-
based speech enhancement, the minimum mean square error
(MMSE) between the target features and the predicted features
was always used as the objective function. It is difficult to de-
sign a better cost function to directly optimize the DNN model,
especially for features that are correlated. In [19] it was shown
that other cost functions, such as the Kullback Leibler diver-
gence [20] or the Itakura-Saito divergence [21], all performed
worse than the MMSE.
In this paper, a multi-objective learning framework is pro-
posed to optimize a joint objective function, encompassing er-
rors not only for the primary clean LPS features but also errors
in secondary targets for continuous features, such as MFCC,
and for categorical information, such as ideal binary mask
(IBM) [22]. This joint optimization of different but related tar-
gets can potentially improve the DNN prediction performance
of the primary target LPS which is then used to reconstruct the
enhanced waveform. In the LPS domain, the target values of
different frequency bins were predicted independently without
any correlation constraint, and some knowledge in auditory per-
ception [23] is not easily utilized. Nonetheless in the MFCC
domain, mel-filtering is first applied and the correlation of each
channel is represented in the MFCC coefficients. Furthermore,
IBM is the most important concept in the computational audi-
tory scene analysis (CASA) [23]. IBM which represents the
noise-dominant or speech-dominant meta information can also
improve DNN training and the estimated IBM could further be
used for post-processing. Finally, MFCC and IBM can be com-
bined together to help predict the target clean LPS features.
In our SE experiments, we find that learning MFCC and/or
IBM as secondary tasks provides improvements to DNN-based
speech enhancement. Furthermore, IBM-based post-processing
also gives an additional 1.5 dB improvement of segmental
signal-to-noise ratio (SSNR) [15].
2. Multi-objective Learning for DNN-based
Speech Enhancement
In [12, 13], DNN is adopted as a mapping function to predict
the clean LPS features from the noisy LPS features. The re-
lationship between the clean and noisy speech features can be
Copyright © 2015 ISCA September 6
-
10, 2015, Dresden, Germany
INTERSPEECH 2015
1508