A PAIRWISE ALGORITHM FOR PITCH ESTIMATION AND SPEECH SEPARATION USING
DEEP STACKING NETWORK
Hui Zhang
1
, Xueliang Zhang
1
, Shuai Nie
2
, Guanglai Gao
1
, Wenju Liu
2
1
Computer Science Department, Inner Mongolia University, Hohhot, China, 010021
2
National Laboratory of Patten Recognition (NLPR), Institute of Automation, University of Chinese
Academy of Sciences, Beijing, China, 100190
alzhu.san@163.com, cszxl@imu.edu.cn, nss90221@gmail.com, csggl@imu.edu.cn, lwj@nlpr.ia.ac.cn
ABSTRACT
Pitch information is an important cue for speech separation.
However, pitch estimation in noisy condition is also a task
as challenging as speech separation. In this paper, we
propose a supervised learning architecture which combines
these two problems concisely. The proposed algorithm is
based on deep stacking network (DSN) which provides a
method of stacking simple processing modules in building
deep architecture. In the training stage, an ideal binary mask
is used as target. The input vector includes the outputs
of lower module and frame-level features which consist of
spectral and pitch-based features. In the testing stage, each
module provides an estimated binary mask which is employed
to re-estimate pitch. Then we update the pitch-based features
to the next module. This procedure is embedded iteratively
in DSN, and we obtain the final separation results from the
last module of DSN. Systematic evaluations show that the
proposed approach produces high quality estimated binary
mask and outperforms recent systems in generalization.
Index Terms— Speech separation, Pitch estimation,
Computational auditory scene analysis, Supervised learning
1. INTRODUCTION
In realistic environments, noise usually degrades the speech
intelligibility of hearing-impaired listeners or performance
of automatic speech recognition (ASR) systems. Speech
separation aims to remove noise by separating target speech
from background interference. It is helpful for both hearing
aids wearers and ASR systems [1, 2]. Computational auditory
scene analysis (CASA) is a promising method to solve the
speech separation problem [3].
CASA defines the goal of speech separation as computing
an ideal binary mask (IBM) [4], which is useful for
improving speech intelligibility [5] and the performance
of speech/speaker recognition [6, 7]. The IBM is a time-
frequency (T-F) mask, which can be computed from premixed
This research was supported in part by the China National
Nature Science Foundation (No.61365006, No.61263037, No.61305027,
No.91120303, No.61273267, No.61403370, and No.90820011).
target and interference. Specifically, in a T-F unit, if the
signal-to-noise ratio (SNR) is greater than a local SNR
criterion (LC), the corresponding mask element in the IBM
is set to 1 (target-dominant). Otherwise, the mask element is
set to 0 (interference-dominant).
When adopting IBM as the computational goal of
CASA, we can naturally formulate the speech separation
as a binary classification problem [5]. From the viewpoint
of classification, the feature selection is important. Many
features have been inspected. Those features include:
pitch-based features [8], amplitude modulation spectrum
(AMS) [9], relative spectral transform and perceptual linear
prediction (RASTA-PLP), Mel-frequency cepstral coefficient
(MFCC) and Gammatone frequency cepstral coefficient
(GFCC) [10] etc. Wang et al. [10] suggest that pitch-based
features have a good generalization in speech separation.
Pitch-based features are derived from pitch. But
extracting pitch from noisy speech is also a difficult task,
especially in low SNR conditions. Generally speaking, on one
hand, if the target voice is separated from the background,
we can obtain the pitch easily. On the other hand, speech
separation performance will get better if pitch estimation is
accuracy. Since these two tasks could benefit from each other,
speech separation and pitch extraction in noisy conditions are
considered to be a “chicken-and-egg” problem.
In this paper, we propose a supervised learning system to
deal with this “chicken-and-egg” problem more concisely.
• Pitch extraction and speech separation are boosted
alternately. (Section 2.1)
• Frame-level features are adopted, which consist of
spectral features, and pitch-based features. (Section
2.2)
• We use deep stacking network (DSN) to implement our
idea of working on the two problems (pitch extraction
and speech separation) alternately. (Section 2.3)
• Systematic evaluations show the proposed approach
produces high quality estimated binary masks and
outperforms recent systems in unmatched noisy
conditions. (Section 3)