A Pitch State Dependent Dictionary Design Method
for Single-Channel Speech Separation
Haiyan Guo
1
,
2
, Zhen Yang
1
, Linghua Zhang
1
, Lei Ye
1
1. Key Laboratory of Broadband Wireless Communication and Sensor Network Technology (Ministry of Education), Nanjing
University of Posts and Telecommunications, Nanjing, China
2. School of Information Science and Engineering, Southeast University, Nanjing, China
Abstract—In this paper, we propose to design a new pitch
state dependent dictionary to perform single-channel speech
separation. The dictionary is designed in two stages, which are
sub-dictionary learning and sub-dictionary concatenation. In
sub-dictionary learning, pitch state information is taken into
account to learn a set of discriminative sub-dictionaries for each
speaker in time-domain. To be specific, each sub-dictionary is
generated as a matrix composed of the speaker's training frames
of similar pitch states as columns. Moreover, we utilize a frequent
pattern mining method to further reduce the sub-dictionary size.
In sub-dictionary concatenation, we propose to select an
appropriate weight pair to match the learned sub-dictionaries to
generate a dictionary for separation. Experimental results show
that the proposed method achieves better overall performance
than two dictionary-based methods and a source-filter-based
method also using pitch information.
Keywords—speech separation; sparse decomposition;
dictionary; data mining
I. INTRODUCTION
Single-channel speech separation (SCSS) is considered as
the most difficult speech separation problem due to the lack of
information on mixing matrix. Recently, it is generally divided
into two groups: computational auditory scene analysis
(CASA) [1-3] and model-based method [4-14]. CASA tries to
achieve human performance based on the perceptual
organization of speech. Model-based method uses pre-trained
models or dictionaries to incorporate priori information to help
separation. Dictionary-based method, which performs
separation by mapping a mixture onto the dictionary, can be
considered as a type of model-based method. Recent works
show that it can improve the separation performance of the
dictionary-based method by using sparsity [15-18].
In dictionary-based method, there exist two main problems
to be solved, which are sub-dictionary learning and sparse
decomposition. In sub-dictionary learning, it generally learns
a uniform dictionary to represent all the source frames of a
speaker, e.g. non-negative matrix factorization (NMF) [15-16,
19-20] and K-SVD [21]. In this way, mixed frames are
separated mainly due to the speakers' characteristic. However,
there obviously exist other important features which can also
help separation, e.g. pitch information. Therefore, in this paper,
we propose a sub-dictionary learning method by taking pitch
state information into account as well. To be specific, we
propose to learn a set of discriminative sub-dictionaries for
each speaker. Each sub-dictionary is generated as a matrix
composed of the speaker's training frames of similar pitch
states as columns. And then, a frequent pattern mining method
is utilized to reduce the sub-dictionary size for computational
convenience.
In sparse decomposition, it aims to find the sparsest
representation of the mixture into the dictionary which is the
union of the learned sub-dictionaries. Each source is estimated
by computing the part which falls into the corresponding sub-
dictionary. Therefore, it is unavoidable that some energy of a
source
i
s is represented over a cross sub-dictionary ij
j
,D .
To suppress the cross representation, a discriminative
dictionary learning (DDL) based method is proposed in [18].
It takes the relationship between the sub-dictionaries into
account and optimizes sub-dictionaries jointly to learn a
structured dictionary. In [20], it proposes a new method of
discriminative learning of NMF which optimizes all basis
vectors jointly to reconstruct both clean and mixed signals as
well. In this paper, we propose to further suppress cross
representation by concatenating sub-dictionaries appropriately.
Specially, we propose to select an appropriate weight pair to
match the learned sub-dictionaries to generate a dictionary for
separation.
The rest of this paper is organized as follows. Dictionary-
based SCSS model is introduced in Section II. A pitch state
dependent dictionary designing method is proposed in Section
III, which includes two stages: pitch state dependent sub-
dictionary learning and sub-dictionary concatenation.
Experiments results are given in Section IV. Finally,
conclusions are summarized in Section V.
II. DICATIONARY-BASED SCSS MODEL
We consider the linear SCSS problem in which two
underlying speech sources 2,1, i
i
s need to be recovered
from a single speech mixture
.
2
1
)()(
i
i
tt sy (1)
Generally, separation is performed frame by frame with the
window length of l and
%
overlap.
Traditionally, in the dictionary-based method, it first learns
sub-dictionaries 2,1, i
i
D and then generate dictionary
as
978-1-5090-2860-3/16/$31.00 ©2016 IEEE