Improved Speaker-Dependent Separation for CHiME-5 Challenge
Jian Wu
1,2
∗
, Yong Xu
3
, Shi-Xiong Zhang
3
, Lian-Wu Chen
2
, Meng Yu
3
, Lei Xie
1†
, Dong Yu
3
1
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
2
Tencent AI Lab, Shenzhen, China
3
Tencent AI Lab, Bellevue, USA
{jianwu,lxie}@nwpu-aslp.org, {lucayongxu,auszhang,lianwuchen,raymondmyu,dyu}@tencent.com
Abstract
This paper summarizes several contributions for improving the
speaker-dependent separation system for CHiME-5 challenge,
which aims to solve the problem of multi-channel, highly-
overlapped conversational speech recognition in a dinner party
scenario with reverberations and non-stationary noises. Specif-
ically, we adopt a speaker-aware training method by using i-
vector as the target speaker information for multi-talker speech
separation. With only one unified separation model for all
speakers, we achieve a 10% absolute improvement in terms of
word error rate (WER) over the previous baseline of 80.28%
on the development set by leveraging our newly proposed data
processing techniques and beamforming approach. With our
improved back-end acoustic model, we further reduce WER to
60.15% which surpasses the result of our submitted CHiME-5
challenge system without applying any fusion techniques.
Index Terms: CHiME-5 challenge, speaker-dependent speech
separation, robust speech recognition, speech enhancement,
beamforming
1. Introduction
With the recent progress in front-end audio processing, acous-
tic modeling and language modeling, automatic speech recog-
nition (ASR) techniques are widely deployed in our daily life.
However, the performance of ASR will severely degrade in
challenging acoustic environments (e.g., overlapping, noisy and
reverberated speech), mainly due to the unseen complicated
acoustic conditions in the training. Many previous work on
acoustic robustness focused on one aspect, e.g., speech sep-
aration [1, 2, 3, 4], enhancement [5, 6, 7, 8, 9], dereverber-
ation [10, 11, 12], and etc. Those experiments were con-
ducted on simulated data, which is not realistic in real appli-
cations. Recently released CHiME-5 challenge [13] provided
a large-scale multi-speaker conversational corpus recorded via
Microsoft Kinect in real home environments and targeted at
the problem of distant multi-microphone conversational speech
recognition. As the recordings are extremely overlapped among
multiple speakers and corrupted by the reverberation and back-
ground noises, WERs reported on the dataset are fairly high. In
this paper, we make several efforts based on our previously sub-
mitted speaker-dependent system [14] which ranked 3rd under
unconstrained language model (LM) and 5th under constrained
LM for the single device track, respectively.
The difficulties of CHiME-5 are three-fold. First, the natu-
ral conversation contains casual contents, sometimes occupied
by laugh and coughing. Speaker interference is common in con-
versational speech as well, which causes degradation on speech
∗
work done during internship at Tencent AI lab.
†
corresponding author
GWPE CGMM
OMLSA
SimulatorVA D
4灤 4灤
1灤
test data
training data
segment
noise
reference
mixture
masks
Figure 1: Flow chart of data processing and simulation
recognition. Second, hardware devices, far-field wave propa-
gation and ambient noises cause audio clipping, signal attenu-
ation and noise corruption, respectively. Furthermore, the lack
of the clean speech for supervised training greatly limits the al-
gorithm design and external datasets are not allowed according
to the rule of CHiME-5. By considering these aspects, robust
front-end processing of target speaker enhancement is critical
for improving the ASR performance.
Recent studies have made great efforts in multi-channel
speech enhancement [7, 8, 9, 15] and most of them depended
on the Time-Frequency (TF) masks. Deep learning based beam-
forming became the most popular approach since CHiME-3 and
CHiME-4 challenge [16]. However, in CHiME-5, it’s difficult
to train a speech enhancement mask estimator and obtain accu-
rate predictions due to the lack of the oracle clean data required
by supervised training. On the other hand, there are many limi-
tations on performing recently proposed monaural blind speech
separation methods, e.g., DPCL [1], uPIT [2], because it’s nec-
essary to do speaker tracking due to the permutation issue. The
number of speakers is also a prerequisite for those monaural
speech separation approaches, while it is infeasible in CHiME-
5 challenge. However, considering that the target speaker ID is
given in each utterance, we tried speaker-dependent (SD) sep-
aration in [14] and Du et al. used a speaker dependent system
along with a two-stage separation method in [17].
In this paper, we focus on single-array track (only one ref-
erence array used) and achieve significant improvement with
the following contributions. First, we process data by mak-
ing use of GWPE [18], CGMM [8, 19] and OMLSA [20] to
further remove the interference in the non-overlapped data seg-
ments, which are used as the training target in the SD models.
In [14], suffering from low-quality training targets, the system
just achieved 2% absolute reduction on WER. Second, inspired
by [21, 22, 23], we incorporate i-vectors as auxiliary features,
which aims to extract the target speaker. With the speaker-aware
(SA) training technique, we achieve much better results using
only one mask estimation model. Third, we investigate the
beamforming performance, and observe that with more accu-
rate speaker masks, generalized eigenvalue (GEV) [24] beam-
former performs better than minimum variance distortionless
response (MVDR) [25] beamformer. Finally, we report 10%
absolute WER reduction on the development set and 20% with
our improved acoustic model (AM) which is based on the fac-
Copyright © 2019 ISCA
INTERSPEECH 2019
September 15–19, 2019, Graz, Austria
http://dx.doi.org/10.21437/Interspeech.2019-1569466