两阶段多目标联合学习提升单通道语音分离性能

60 浏览量更新于2024-08-27 收藏 254KB PDF 举报

本文主要探讨了单阶段语音分离领域的两阶段多目标联合学习方法。随着监督式语音分离技术的深入研究并展现出巨大潜力，传统的方法通常独立地对每个时频（T-F）单元进行建模，只关注一个目标，而忽视了语音听觉特征和分离目标之间的时空相关性。在实际的语音信号中，这些特征和目标之间存在着显著的谱时结构，且不同目标间可能存在高度关联性。这些关联性对于提高语音分离的准确性和效率至关重要。作者们提出了一个两阶段的多目标联合学习框架，旨在利用这种内在的关联性。首先，该方法在第一阶段捕捉和模型化语音听觉特征的时空特性，通过考虑多个目标之间的相互影响，而非孤立处理。这一步可能包括使用深度学习模型，如卷积神经网络（CNN）或循环神经网络（RNN），来提取特征表示，以便捕捉语音信号的时间连续性和频率依赖性。在第二阶段，联合学习被用来同时优化所有目标的分离任务，而不是单独训练每个目标。这可能是通过一个多任务学习算法，比如注意力机制或多任务神经网络，来共享底层的特征表示，从而提高整体性能。这种方法允许模型学习到不同目标间的共同特征，同时保持它们各自的独特性，这对于复杂场景下的语音分离尤其有效。实验结果显示，与传统的单目标方法相比，这种两阶段多目标联合学习方法在语音分离任务上表现更优，能够更准确地分离出不同的语音信号，尤其是在存在噪声干扰或者多个说话者的情况下。这表明，通过利用语音的时空结构和目标之间的关系，可以显著提升语音分离的性能，并有望为未来的音频处理和理解应用提供新的解决方案。总结来说，这篇研究论文提出了一种创新的策略，它不仅解决了独立模型存在的问题，还通过深度学习和联合学习的技术手段，提高了单阶段语音分离的性能和效率，为语音信号处理领域的进一步发展提供了有价值的新思路。

Two-Stage Multi-Target Joint Learning for Monaural Speech Separation

Shuai Nie

, Shan Liang

, Wei Xue

, Xueliang Zhang

Wenju Liu

National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences

College of Computer Science, Inner Mongolia University

{shuai.nie, sliang, wxue, lwj}@nlpr.ia.ac.cn cszxl@imu.edu.cn

Abstract

Recently, supervised speech separation has been extensively s-

tudied and shown considerable promise. However, many meth-

ods independently model each time-frequency(T-F) unit with

only one target and ignore the correlations over time-frequency

domains in speech auditory features and separation targets.

Moreover, due to temporal continuity of speech, speech au-

ditory features and separation targets have prominent spectro-

temporal structures, and some targets are highly related. These

information can be exploited for speech separation. In this pa-

per, we propose a two-stage multi-target joint learning method

to model the correlation in auditory features and separation tar-

gets. Systematic experiments show that the proposed approach

consistently achieves better separation and generalization per-

formances in the low signal-to-noise ratio(SNR) condition.

Index Terms: Speech Separation, Multi-target Learning, Com-

putational Auditory Scene Analysis (CASA)

1. Introduction

In real-world environments, background interferences substan-

tially degrade speech intelligibility and many applications per-

formances, such as the speech communication and automatic

speech recognition (ASR) [1, 7, 11, 17]. To address this issue,

decades of efforts have been devoted to speech separations, but

separations in real-world environments, especially in the low

SNR and monaural conditions, are still challenging tasks. As

a new trend, compared to the traditional speech enhancemen-

t [12], supervised speech separation has shown to be substan-

tially promising for challenging acoustic conditions [11, 23].

Speech separation can be formulated as a supervised learn-

ing problem. Typically, supervised speech separation learns a

function that maps the noisy features extracted from the mix-

ture to certain ideal masks or clean spectra that can be used to

separate the target speech from mixture signals.

Supervised speech separation has two main types of train-

ing targets, the mask-based and spectra-based approximation

targets [22, 25]. The mask-based targets learn the best approxi-

mation of an ideal mask computing using the clean and noisy

speech, such as the ideal ratio mask(IRM) [13, 24], and the

spectra-based targets learn the best approximation of the spec-

tra of clean speech, such as the Gammatone frequency power

spectrum(GF) [9]. The IRM and GF both can be used to gener-

ate the separated speech with the improved intelligibility and/or

perceptual quality [22], and they are highly related. Intuitively,

the IRM and the GF of clean speech present similar spectro-

temporal structures as the example shows in Fig. 1. Mathemati-

This research was partly supported by the China National Na-

ture Science Foundation (No.91120303, No.61273267, No.90820011,

No.61403370 and No.61365006).

Channel

Frame

100 200 300

Frame

Channel

100 200 300

Figure 1: Left Fig: the GF of clean speech; Right Fig: the IRM

at 0dB. The structures of the GF and IRM look very similar.

cally, the IRM can be derived from the GFs of clean speech and

noise, which is written as follows:

IRM (t, f ) =

(t, f)

(t, f) + N

(t, f)

(1)

where S

(t, f) and N

(t, f) are the GFs of clean speech and

noise in the T-F unit at channel f and frame t, respectively.

Moreover, due to the sparse distribution of speech on the T-F

domain, the GF can keep relatively invariant harmonic struc-

ture in various auditory environments, and the IRM is inher-

ently bounded and is less sensitive to estimation errors [14].

These correlations and complementarity can be exploited for

speech separation. But in previous works, they are much ig-

nored. Therefore, jointly modeling the IRM and GF in one

model will probably improve the separation performance.

In this paper, we propose a multi-target deep neural net-

work (DNN) to jointly model the IRM and GF. Its target is the

combination of the IRM and the GF of clean speech. To fur-

ther improve the separation performance, a two-stage method

is proposed. In the ﬁrst stage, the proposed multi-target DNN

is trained to learn a function that maps the noisy features to

the joint targets for the whole frame, rather than for the indi-

vidual T-F unit. Modeling at the frame level can capture the

correlations over the T-F domain in speech. Moreover, to ex-

ploit the spectro-temporal structures in speech auditory features

and joint targets, we use denoising autoencoders (DAE) to mod-

el them by self-learning, respectively. The learned DAEs are

combined with a linear transformation matrix W

to initialize

the multi-target DNN. In addition, according to the differences

of errors produced by output nodes, a backpropagation (BP) al-

gorithm with bias weights is further explored to ﬁne tune the

multi-target DNN. In the second stage, the estimated IRM and

GF are integrated into another DNN to obtain the ﬁnal separa-

tion with higher smoothness and perceptual quality.

2. First Stage: Multi-Target Joint Learning

Typically, related tasks share some common information that

can be exploited to improve each other by the multi-task joint

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38618784

粉丝: 11
资源: 884

两阶段多目标联合学习提升单通道语音分离性能

藏经阁-万物互联语音交互从端开始-前端处理从技术到商业.pdf

基于卷积循环网络与非局部模块的语音增强方法.docx

Speech separation based on signal-noise-dependentdeep neural networks for robust speech recognition

单通道语音分离相对多通道语音分离的优点

基于CNN的单通道语音分离算法

比较先进的深度学习语音分离的网络模型有哪些

Python语音分离代码实现

matlab jade语音分离

matlab实现语音分离

HMFCC语音分离技术优势

最新资源