CHiME-5挑战赛的改进说话者依赖分离技术

32 浏览量更新于2024-08-31 收藏 849KB PDF 举报

"这篇论文总结了对CHiME-5挑战赛中改进的说话人依赖的语音分离系统的几个贡献，该挑战旨在解决多通道、高度重叠的对话式语音识别问题，特别是在有混响和非平稳噪声的晚餐聚会场景中。具体来说，通过将i-vector作为多说话者语音分离的目标说话人信息，采用了一种说话人感知的训练方法。仅使用一个统一的分离模型来处理所有说话人，我们在开发集上利用新的数据处理技术和波束形成方法，相对于之前80.28%的词错误率（WER）基线实现了10%的绝对改进。我们的改进包括对原始数据的预处理，以增强说话人特性和噪声抑制技术，以及在测试阶段优化的后处理步骤，以进一步提高识别性能。" 本文主要关注的是在复杂环境下的多说话人语音识别，特别是针对CHiME-5挑战所设定的晚餐聚会场景。CHiME-5挑战是语音处理领域的一个重要基准，其目标是处理多通道、高度重叠的对话语音，同时要考虑混响和非平稳背景噪声的影响。这些条件使得传统的单说话人识别技术难以应对。作者提出了一种基于i-vector的说话人感知训练方法，这是他们改进系统的关键。i-vector是一种用于表示说话人特性的统计模型，它可以捕获说话人的长期个性特征。通过将i-vector作为目标信息，系统能够更好地区分不同说话人的声音，从而实现更有效的说话人分离。为了提升性能，他们开发了新的数据处理技术，这可能包括说话人增强和噪声抑制策略。这些技术有助于在原始信号中突出说话人的特征，同时降低背景噪声的影响，使识别算法能够更准确地捕捉到每个说话人的语音。此外，他们还应用了一种波束形成技术，这是一种信号处理技术，可以定向接收或增强来自特定方向的声音信号，而抑制其他方向的噪声。这种技术在多通道设置中特别有用，因为它可以帮助定位并分离来自不同说话人的声音。在测试阶段，他们实施了优化的后处理步骤，这可能是为了进一步提高识别的准确性，例如通过错误校正和上下文信息的整合来减少词错误率。这篇研究论文通过结合i-vector、数据处理技术、波束形成和后处理步骤，成功提高了在CHiME-5挑战中的说话人依赖的语音分离性能，显著降低了词错误率，为多说话人识别在实际环境中的应用提供了有价值的进展。

Improved Speaker-Dependent Separation for CHiME-5 Challenge

Jian Wu

1,2

∗

, Yong Xu

, Shi-Xiong Zhang

, Lian-Wu Chen

, Meng Yu

, Lei Xie

1†

, Dong Yu

School of Computer Science, Northwestern Polytechnical University, Xi’an, China

Tencent AI Lab, Shenzhen, China

Tencent AI Lab, Bellevue, USA

{jianwu,lxie}@nwpu-aslp.org, {lucayongxu,auszhang,lianwuchen,raymondmyu,dyu}@tencent.com

Abstract

This paper summarizes several contributions for improving the

speaker-dependent separation system for CHiME-5 challenge,

which aims to solve the problem of multi-channel, highly-

overlapped conversational speech recognition in a dinner party

scenario with reverberations and non-stationary noises. Specif-

ically, we adopt a speaker-aware training method by using i-

vector as the target speaker information for multi-talker speech

separation. With only one uniﬁed separation model for all

speakers, we achieve a 10% absolute improvement in terms of

word error rate (WER) over the previous baseline of 80.28%

on the development set by leveraging our newly proposed data

processing techniques and beamforming approach. With our

improved back-end acoustic model, we further reduce WER to

60.15% which surpasses the result of our submitted CHiME-5

challenge system without applying any fusion techniques.

Index Terms: CHiME-5 challenge, speaker-dependent speech

separation, robust speech recognition, speech enhancement,

beamforming

1. Introduction

With the recent progress in front-end audio processing, acous-

tic modeling and language modeling, automatic speech recog-

nition (ASR) techniques are widely deployed in our daily life.

However, the performance of ASR will severely degrade in

challenging acoustic environments (e.g., overlapping, noisy and

reverberated speech), mainly due to the unseen complicated

acoustic conditions in the training. Many previous work on

acoustic robustness focused on one aspect, e.g., speech sep-

aration [1, 2, 3, 4], enhancement [5, 6, 7, 8, 9], dereverber-

ation [10, 11, 12], and etc. Those experiments were con-

ducted on simulated data, which is not realistic in real appli-

cations. Recently released CHiME-5 challenge [13] provided

a large-scale multi-speaker conversational corpus recorded via

Microsoft Kinect in real home environments and targeted at

the problem of distant multi-microphone conversational speech

recognition. As the recordings are extremely overlapped among

multiple speakers and corrupted by the reverberation and back-

ground noises, WERs reported on the dataset are fairly high. In

this paper, we make several efforts based on our previously sub-

mitted speaker-dependent system [14] which ranked 3rd under

unconstrained language model (LM) and 5th under constrained

LM for the single device track, respectively.

The difﬁculties of CHiME-5 are three-fold. First, the natu-

ral conversation contains casual contents, sometimes occupied

by laugh and coughing. Speaker interference is common in con-

versational speech as well, which causes degradation on speech

∗

work done during internship at Tencent AI lab.

†

corresponding author

GWPE CGMM

OMLSA

SimulatorVA D

4灤 4灤

1灤

test data

training data

segment

noise

reference

mixture

masks

Figure 1: Flow chart of data processing and simulation

recognition. Second, hardware devices, far-ﬁeld wave propa-

gation and ambient noises cause audio clipping, signal attenu-

ation and noise corruption, respectively. Furthermore, the lack

of the clean speech for supervised training greatly limits the al-

gorithm design and external datasets are not allowed according

to the rule of CHiME-5. By considering these aspects, robust

front-end processing of target speaker enhancement is critical

for improving the ASR performance.

Recent studies have made great efforts in multi-channel

speech enhancement [7, 8, 9, 15] and most of them depended

on the Time-Frequency (TF) masks. Deep learning based beam-

forming became the most popular approach since CHiME-3 and

CHiME-4 challenge [16]. However, in CHiME-5, it’s difﬁcult

to train a speech enhancement mask estimator and obtain accu-

rate predictions due to the lack of the oracle clean data required

by supervised training. On the other hand, there are many limi-

tations on performing recently proposed monaural blind speech

separation methods, e.g., DPCL [1], uPIT [2], because it’s nec-

essary to do speaker tracking due to the permutation issue. The

number of speakers is also a prerequisite for those monaural

speech separation approaches, while it is infeasible in CHiME-

5 challenge. However, considering that the target speaker ID is

given in each utterance, we tried speaker-dependent (SD) sep-

aration in [14] and Du et al. used a speaker dependent system

along with a two-stage separation method in [17].

In this paper, we focus on single-array track (only one ref-

erence array used) and achieve signiﬁcant improvement with

the following contributions. First, we process data by mak-

ing use of GWPE [18], CGMM [8, 19] and OMLSA [20] to

further remove the interference in the non-overlapped data seg-

ments, which are used as the training target in the SD models.

In [14], suffering from low-quality training targets, the system

just achieved 2% absolute reduction on WER. Second, inspired

by [21, 22, 23], we incorporate i-vectors as auxiliary features,

which aims to extract the target speaker. With the speaker-aware

(SA) training technique, we achieve much better results using

only one mask estimation model. Third, we investigate the

beamforming performance, and observe that with more accu-

rate speaker masks, generalized eigenvalue (GEV) [24] beam-

former performs better than minimum variance distortionless

response (MVDR) [25] beamformer. Finally, we report 10%

absolute WER reduction on the development set and 20% with

our improved acoustic model (AM) which is based on the fac-

INTERSPEECH 2019

September 15–19, 2019, Graz, Austria

http://dx.doi.org/10.21437/Interspeech.2019-1569466

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38678521

粉丝: 3

CHiME-5挑战赛的改进说话者依赖分离技术

现代风格浏览器历史记录插件：Improved History-crx

django-improved-inlines：扩展django内联功能，添加过滤和模板支持

OpenAI开源项目improved-diffusion的中文注释与学习分享

Improved delay-dependent consensus stability criteria of high-order multi-agent systems with time delays

Improved I-vector representation for speaker Diarization

Improved Electromagnetism-like Algorithm for Nonlinear Bilevel Programming Problem

Improved Intra-coding Methods for H.264/AVC

Improved explicit-form equations for estimating dynamic wheel sinkage and compaction resistance on deformable terrain

香农代码的matlab-An-improved-method-for-R-peak-detection-by-using-Shannon-en

An-Improved-Algorithm-for-Otsu.rar_数学计算_matlab_

最新资源