Two-Stage Multi-Target Joint Learning for Monaural Speech Separation
Shuai Nie
1
, Shan Liang
1
, Wei Xue
1
, Xueliang Zhang
2
, Wenju Liu
1
, Like Dong
3
, Hong Yang
3
1
National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences
2
College of Computer Science, Inner Mongolia University
3
Electric Power Research Institute of ShanXi Electric Power Company, China State Grid Corp
{shuai.nie, sliang, wxue, lwj}@nlpr.ia.ac.cn cszxl@imu.edu.cn
Abstract
Recently, supervised speech separation has been extensively
studied and shown considerable promise. Due to the tem-
poral continuity of speech, speech auditory features and sep-
aration targets present prominent spectro-temporal structures
and strong correlations over the time-frequency (T-F) domain,
which can be exploited for speech separation. However, many
supervised speech separation methods independently model
each T-F unit with only one target and much ignore these useful
information. In this paper, we propose a two-stage multi-target
joint learning method to jointly model the related speech sepa-
ration targets at the frame level. Systematic experiments show
that the proposed approach consistently achieves better separa-
tion and generalization performances in the low signal-to-noise
ratio(SNR) conditions.
Index Terms: speech separation, multi-target learning, compu-
tational auditory scene analysis (CASA)
1. Introduction
In real-world environments, the background interference sub-
stantially degrades the speech intelligibility and the perfor-
mance of many applications, such as speech communication
and automatic speech recognition (ASR) [1, 7, 12, 18]. To ad-
dress this issue, the speech separation, which aims to extract the
speech signal from the mixture, has been studied for decades.
However, it is still a challenging task to achieve effective speech
separations in real-world environments, especially when the
signal-to-noise ratio (SNR) is low and only one microphone is
available.
Speech separation can be formulated as a supervised learn-
ing problem [12, 24, 26]. Typically, a supervised speech sepa-
ration learns a function that maps the noisy features extracted
from the mixture to certain ideal masks or clean spectra that
can be used to separate the target speech from the mixture. As
a new trend, compared to the traditional speech enhancemen-
t [13], supervised speech separation has shown to be substan-
tially promising for challenging acoustic conditions [12,24,26].
Supervised speech separation has two main types of train-
ing targets, i.e. the mask-based targets [23] and spectra-based
one [26]. For the mask-based targets, the algorithm learns the
best approximation of an ideal mask computed using the clean
and noisy speech, such as the ideal ratio mask(IRM) [14, 25],
while for the spectra-based targets, it learns the best approxi-
mation of the clean speech spectra, such as the Gammatone fre-
quency power spectrum(GF) [9]. Both the IRM and GF can be
This research was partly supported by the China National Na-
ture Science Foundation (No.91120303, No.61273267, No.90820011,
No.61403370 and No.61365006).
Channel
Frame
100 200 300
16
32
48
64
Frame
Channel
100 200 300
16
32
48
64
Figure 1: Left: the GF of clean speech; Right: the IRM com-
puted with the clean speech and white noise (mixed at 0 dB)
used to generate the separated speech with the improved intelli-
gibility and/or perceptual quality [23]. Intuitively, the IRM and
the GF of clean speech present similar spectro-temporal struc-
tures as is shown by the example in Fig. 1. In fact, mathemati-
cally, the IRM can be derived from the GFs of clean speech and
noise, which is computed as follows:
IRM(t, f )=
S
2
(t, f)
S
2
(t, f)+N
2
(t, f)
(1)
where S
2
(t, f) and N
2
(t, f) are the GFs of clean speech and
noise in the time-frequency (T-F) unit of channel f and frame
t, respectively. Moreover, due to the sparsity of speech in the
T-F domain, the GF keeps relatively invariant harmonic struc-
ture in various auditory environments, and the IRM is inherently
bounded and less sensitive to estimation errors [15]. These cor-
relations and complementarity can be exploited for speech sepa-
ration. But they are much ignored in previous works. Therefore,
jointly modeling the IRM and GF in one model will probably
improve the separation performance.
In this paper, we propose a multi-target deep neural net-
work (DNN) to jointly model the IRM and GF. Its target is the
combination of the IRM and the GF of clean speech. To fur-
ther improve the separation performance, a two-stage method is
explored. In the first stage, the multi-target DNN is trained to
learn a function that maps the noisy features to the joint targets
for all frequency channels in one frame. Compared to the in-
dividual T-F unit, modeling at the frame level can capture the
correlations over the frequency domain in speech. Moreover, to
exploit the spectro-temporal structures in speech auditory fea-
tures and joint targets, we use denoising autoencoders (DAE)
to model them by self-learning, respectively. Then, the learned
DAEs are combined with a linear transformation matrix W
h
to
initialize the multi-target DNN. Finally, according to the differ-
ent errors produced by output nodes, a backpropagation (BP)
algorithm with bias weights is further explored to fine tune the
multi-target DNN. In the second stage, the estimated IRM and
GF are integrated into another DNN to obtain the final separa-
tion result with higher smoothness and perceptual quality.
Copyright © 2015 ISCA September 6
-
10, 2015, Dresden, Germany
INTERSPEECH 2015
1503