SPEECH SEPARATION BASED ON SIGNAL-NOISE-DEPENDENT DEEP NEURAL
NETWORKS FOR ROBUST SPEECH RECOGNITION
Yan-Hui Tu
1
, Jun Du
1
, Li-Rong Dai
1
, Chin-Hui Lee
2
1
University of Science and Technology of China, HeFei, AnHui, P.R. China
2
Georgia Institute of Technology, Atlanta, Georgia, USA
tuyanhui@mail.ustc.edu.cn, {jundu,lrdai}@ustc.edu.cn, chl@ece.gatech.edu
ABSTRACT
In this paper, we propose a new signal-noise-dependent (SND) deep
neural network (DNN) framework to further improve the separation
and recognition performance of the recently developed technique for
general DNN-based speech separation. We adopt a divide and con-
quer strategy to design the proposed SND-DNNs with higher resolu-
tions that a single general DNN could not well accommodate for all
the speaker mixing variabilities at different levels of signal-to-noise
ratios (SNRs). In this study two kinds of SNR-dependent DNNs,
namely positive and negative DNNs, are trained to cover the mixed
speech signals with positive and negative SNR levels, respectively.
At the separation stage, a first-pass separation using a general DNN
can give an accurate SNR estimation for a model selection. Experi-
mental results on the Speech Separation Challenge (SSC) task show
that SND-DNNs could yield significant performance improvements
for both speech separation and recognition over a general DNN. Fur-
thermore, this purely front-end processing method achieves a rela-
tive word error rate reduction of 11.6% over a state-of-the-art recog-
nition system where a complicated joint decoding framework needs
to be implemented in the back-end.
Index Terms— single-channel speech separation, robust speech
recognition, deep neural networks, semi-supervised mode
1. INTRODUCTION
Speech separation aims at separating the voice of each speaker when
multiple speakers talk simultaneously. It is important for many ap-
plications, for example automatic speech recognition (ASR). While
significant progress has been made in improving the noise robust-
ness of ASR systems, most techniques focus on improving the per-
formance of the back-end recogniser. In this study, we use the sepa-
ration system as our front-end pre-processor for ASR. So the perfor-
mance of the ASR system depends heavily on the quality of acoustic
pre-processing. The separating algorithms can be often classified
into unsupervised and supervised modes. In the former, speaker i-
dentities and the reference speech of each speaker are not available
in the training stage, while the information of both the target and the
interfering speakers is provided in the supervised modes.
One broad class of single-channel speech separation is the so-
called computational auditory scene analysis (CASA) [1], usually
in an unsupervised mode. CASA-based approaches [2]-[6], use the
psychoacoustic cues, such as pitch, voice onset/offset, temporal con-
tinuity, harmonic structures, and modulation correlation, to segregate
This work was supported by the National Natural Science Foundation of
China under Grants No. 61305002.
a voice of interest by masking the interfering sources. For exam-
ple, in [5], pitch and amplitude modulation were adopted to separate
the voiced portions of co-channel speech. In [6], unsupervised clus-
tering was used to separate speech regions into two speaker groups
by maximizing the ratio of between-cluster and within-cluster dis-
tances. Recently, a data-driven approach [7] separates the underly-
ing clean speech segments by matching each mixed speech segment
against a composite training segment.
In the supervised approaches, speech separation is often formu-
lated as an estimation problem based on:
x
m
= x
t
+ x
i
(1)
where x
m
, x
t
, x
i
are speech signals of the mixture, target speaker,
and interfering speaker, respectively. To solve this under-determined
equation, a general strategy is to represent the speakers by two mod-
els, and use a certain criterion to reconstruct the sources given the
single mixture. An early study in [8] adopted a factorial hidden
Markov model (FHMM) to describe a speaker, and the estimated
sources are used to generate a binary mask. To further impose tem-
poral constraints on speech signals for separation, the work in [9] in-
vestigates the phone-level dynamics using HMMs [10]. For FHMM-
based speech separation, 2-D Viterbi algorithms and approximations
have been used to estimate the inference [11]. In [12], FHMM was
adopted to model vocal tract characteristics for detecting pitch to re-
construct speech sources. In [13, 14, 15] Gaussian mixture models
(GMMs) were employed to model speakers, and the minimum mean
squared error (MMSE) or maximum a posteriori (MAP) estimator is
used to recover the speech signals. The factorial-max vector quanti-
zation model (MAXVQ) was also used to infer the mask signals in
[16]. Other popular approaches include nonnegative matrix factor-
ization (NMF) based models [17].
Recently, speech separation based on deep learning approaches
becomes increasingly popular, which can be divided into two broad
classes. One is in a supervised mode, where deep neural network-
s (DNNs) or recurrent neural networks (RNNs) [18] are adopted to
separate the mixed speech given the information of the target speak-
er, interfering speaker, and even the signal-to-noise ratio (SNR). The
other one is in a semi-supervised mode where only the information
of the target speaker is provided. Our recent work [19, 20, 21] be-
longs to the latter. In [19, 20], we solve the separation problem in E-
q. (1) by using DNN to directly model the highly nonlinear relation-
ship among speech features of the target speaker, the interference s-
peaker and the mixed signals. Its effectiveness has also been verified
for robust speech recognition [21]. As our DNN approach is semi-
supervised, a large mount of training data with different interfering
speakers at different SNRs can be included to address the problem
of unseen information. However a single general DNN might not