Single-Channel Speech Separation Based on Deep Clustering with Local
Optimization
Taotao Fu
1,2
, Ge Yu
1
, Lili Guo
1
, Yan Wang
1
, Ji Liang
1
1
Key Laboratory of Space Utilization, Technology and Engineering Center for
Space Utilization, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
Beijing, China
e-mail: futaotao2@foxmail.com
Abstract—There are many challenges in single-channel multi-
person mixed speech separation, such as modeling the
temporal continuity of the speech signals and improving the
frame separation performance simultaneously. In this paper, a
separation method based on Deep Clustering with local
optimization by the improved Non-Negative Matrix
Factorization (NMF) combined with Factorial Conditional
Random Fields (FCRF) is proposed. Primarily, the separated
voices are achieved by Deep Clustering model which are
trained by the Bi-directional Long Short Term Memory
(BLSTM) and clustered by the similar features. Then,
separated voice are locally optimized by the improved NMF
with K-means++ and FCRF iteratively. The results show the
algorithm improves the separation performance, which
satisfies both the local optimum of the speech signal on each
frame and the continuity of the whole speech signal.
Keywords-single-channel speech separation; K-means++;
deep clustering; NMF; FCRF
I. INTRODUCTION
The Blind Speech Separation (BSS) is a challenging
research subject and becomes a popular research area in
signal processing field in recent years. It derives from
“cocktail-party” [1] problem that effectively identifies a
particular speaker's voice in a noisy environment. Many
algorithms addressing the instantaneous BSS problems have
been proposed but the current research results show the
problem are far from being well resolved, especially in the
single-channel speech separation, in which two or more
individual speech signals should be separated from a single-
channel signal. Due to the underdetermined problem, single-
channel speech separation is much more difficult than the
problem with multi-input signals. With the wide application
in automatic speech recognition and music transcription etc.,
the signal channel speech separation becomes a new research
hotspot in speech signal processing.
A variety of methods have been applied to single-channel
speech separation, including non-negative matrix
decomposition (NMF) [2], computational auditory scene
analysis (CASA) [3], factorial hidden Markov model
(FHMM) [4] and Deep Clustering (DC) [5].
In NMF, the mixed speech is separated by utilizing
voice’s non-negative characteristics appropriately, while the
temporal continuity of the speech signal can’t be well
modeled because of the assumption that the adjacent frames
of voice signals are independent [2]. FHMM is often used for
continuous modeling the process of the speech mixture [4].
Mysore [6] proposed a non-negative hidden Markov method
to model the temporal continuity of speech by combining
NMF and HMM, and also a non-negative factorial hidden
Markov model is proposed to separate the mixed speech
signals of two speakers. Li [7] present an algorithm based on
NMF and FCRF to describe the spectral structure and the
temporal continuity of the speech signal. The problem of
“cocktail-party” source separation in a deep learning
framework called deep clustering. Hershey [5] proposed a
deep network-based analogue for spectral clustering to
achieve a better separation result than that under the methods
of NMF combined with FCRF and HMM in Signal to
Distortion Ratio (SDR). Compared with the NMF combined
with FCRF, the Deep Clustering may lead to the
discontinuity by the lack of temporal continuity modeling of
the speech signals.
In this paper, a new single-channel speech separation
method is proposed based on Deep Clustering with local
optimization to achieve a better local separation and reduces
the distortion of speech. This paper is organized as follows.
The Deep Clustering model for single-channel speech
separation is established by using BLSTM and then the
separated voice signals are obtained by clustering the
BLSTM output firstly. Secondly, the separated voice signals
are locally optimized by the improved NMF with K-
means++ clustering and FCRF, with iterative process. At last,
several experiments are performed to validate our proposed
method.
II. S
PEECH SEPARATION BASED ON DEEP CLUSTERING
A. Deep Clustering
We define the mixed source signal and as
i
X
. Then the
Short Time Fourier Transform (STFT) process is employed
to get the signal spectrum as
,tf
X
, where
t
indexed the
44
2017 3rd International Conference on Frontiers of Signal Processing
978-1-5386-1038-1/17/$31.00 ©2017 IEEE