两阶段多目标联合学习提升单通道语音分离性能

96 浏览量更新于2024-08-26 收藏 185KB PDF 举报

本文主要探讨了单通道语音分离领域的一种创新方法——两阶段多目标联合学习（Two-Stage Multi-Target Joint Learning for Monaural Speech Separation）。随着监督学习在语音分离领域的广泛应用，已经取得了显著的进步。由于语音信号的时序连续性和在时频（T-F）域内的声学特征及分离目标之间的显著时空结构和强相关性，传统的许多方法往往倾向于独立地针对每个时频单元模型一个目标，而忽视了这些潜在的相关信息。在传统的单目标模型中，每个T-F单元只考虑一个特定的目标，如噪声抑制或语音源分离，这可能导致性能受限，无法充分利用信号的复杂结构。作者提出了一种新颖的方法，即通过两阶段的学习过程来解决这个问题。首先，该方法将整个帧级别的语音分离任务视为一个多目标问题，允许模型同时处理多个相关的分离任务，比如识别多个说话人或者分离背景噪声与目标语音。第一阶段，采用联合学习策略，模型能够捕捉到各个目标之间的相互依赖和关联，通过共享特征表示，提升整体的分离效果。这一步旨在利用所有目标的时空相关性，提高模型对复杂场景下的语音分离能力。在这个阶段，模型不仅关注单个语音源的提取，还兼顾了其他相关目标的优化，从而实现更全面的处理。第二阶段，针对前一阶段得到的联合表示，进行进一步细化和优化。通过对每个目标进行单独的处理，模型可以在保留全局信息的同时，针对每个目标进行精细化调整，以达到最佳的分离性能。这种分阶段的方法有助于提高模型的灵活性和适应性，使得它能够在处理不同类型的语音混合信号时，展现出更好的分离效果。实验部分系统地评估了该两阶段多目标联合学习方法，通过对比与单一目标模型的性能，展示了其在提高语音分离准确性和鲁棒性方面的优势。此外，文章可能还讨论了如何有效地设计网络架构，选择合适的损失函数以及训练策略，以确保方法的有效实施。这篇研究论文在单通道语音分离领域提出了一个具有前瞻性的方法，它不仅提高了模型对语音信号内在结构的理解，还提升了整体的分离性能，为未来的语音处理任务提供了新的研究方向。

Two-Stage Multi-Target Joint Learning for Monaural Speech Separation

Shuai Nie

, Shan Liang

, Wei Xue

, Xueliang Zhang

, Wenju Liu

, Like Dong

, Hong Yang

National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences

College of Computer Science, Inner Mongolia University

Electric Power Research Institute of ShanXi Electric Power Company, China State Grid Corp

{shuai.nie, sliang, wxue, lwj}@nlpr.ia.ac.cn cszxl@imu.edu.cn

Abstract

Recently, supervised speech separation has been extensively

studied and shown considerable promise. Due to the tem-

poral continuity of speech, speech auditory features and sep-

aration targets present prominent spectro-temporal structures

and strong correlations over the time-frequency (T-F) domain,

which can be exploited for speech separation. However, many

supervised speech separation methods independently model

each T-F unit with only one target and much ignore these useful

information. In this paper, we propose a two-stage multi-target

joint learning method to jointly model the related speech sepa-

ration targets at the frame level. Systematic experiments show

that the proposed approach consistently achieves better separa-

tion and generalization performances in the low signal-to-noise

ratio(SNR) conditions.

Index Terms: speech separation, multi-target learning, compu-

tational auditory scene analysis (CASA)

1. Introduction

In real-world environments, the background interference sub-

stantially degrades the speech intelligibility and the perfor-

mance of many applications, such as speech communication

and automatic speech recognition (ASR) [1, 7, 12, 18]. To ad-

dress this issue, the speech separation, which aims to extract the

speech signal from the mixture, has been studied for decades.

However, it is still a challenging task to achieve effective speech

separations in real-world environments, especially when the

signal-to-noise ratio (SNR) is low and only one microphone is

available.

Speech separation can be formulated as a supervised learn-

ing problem [12, 24, 26]. Typically, a supervised speech sepa-

ration learns a function that maps the noisy features extracted

from the mixture to certain ideal masks or clean spectra that

can be used to separate the target speech from the mixture. As

a new trend, compared to the traditional speech enhancemen-

t [13], supervised speech separation has shown to be substan-

tially promising for challenging acoustic conditions [12,24,26].

Supervised speech separation has two main types of train-

ing targets, i.e. the mask-based targets [23] and spectra-based

one [26]. For the mask-based targets, the algorithm learns the

best approximation of an ideal mask computed using the clean

and noisy speech, such as the ideal ratio mask(IRM) [14, 25],

while for the spectra-based targets, it learns the best approxi-

mation of the clean speech spectra, such as the Gammatone fre-

quency power spectrum(GF) [9]. Both the IRM and GF can be

This research was partly supported by the China National Na-

ture Science Foundation (No.91120303, No.61273267, No.90820011,

No.61403370 and No.61365006).

Channel

Frame

100 200 300

Frame

Channel

100 200 300

Figure 1: Left: the GF of clean speech; Right: the IRM com-

puted with the clean speech and white noise (mixed at 0 dB)

used to generate the separated speech with the improved intelli-

gibility and/or perceptual quality [23]. Intuitively, the IRM and

the GF of clean speech present similar spectro-temporal struc-

tures as is shown by the example in Fig. 1. In fact, mathemati-

cally, the IRM can be derived from the GFs of clean speech and

noise, which is computed as follows:

IRM(t, f )=

(t, f)

(t, f)+N

(t, f)

(1)

where S

(t, f) and N

(t, f) are the GFs of clean speech and

noise in the time-frequency (T-F) unit of channel f and frame

t, respectively. Moreover, due to the sparsity of speech in the

T-F domain, the GF keeps relatively invariant harmonic struc-

ture in various auditory environments, and the IRM is inherently

bounded and less sensitive to estimation errors [15]. These cor-

relations and complementarity can be exploited for speech sepa-

ration. But they are much ignored in previous works. Therefore,

jointly modeling the IRM and GF in one model will probably

improve the separation performance.

In this paper, we propose a multi-target deep neural net-

work (DNN) to jointly model the IRM and GF. Its target is the

combination of the IRM and the GF of clean speech. To fur-

ther improve the separation performance, a two-stage method is

explored. In the ﬁrst stage, the multi-target DNN is trained to

learn a function that maps the noisy features to the joint targets

for all frequency channels in one frame. Compared to the in-

dividual T-F unit, modeling at the frame level can capture the

correlations over the frequency domain in speech. Moreover, to

exploit the spectro-temporal structures in speech auditory fea-

tures and joint targets, we use denoising autoencoders (DAE)

to model them by self-learning, respectively. Then, the learned

DAEs are combined with a linear transformation matrix W

initialize the multi-target DNN. Finally, according to the differ-

ent errors produced by output nodes, a backpropagation (BP)

algorithm with bias weights is further explored to ﬁne tune the

multi-target DNN. In the second stage, the estimated IRM and

GF are integrated into another DNN to obtain the ﬁnal separa-

tion result with higher smoothness and perceptual quality.

10, 2015, Dresden, Germany

INTERSPEECH 2015

1503

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38697940

粉丝: 7
资源: 920

两阶段多目标联合学习提升单通道语音分离性能

藏经阁-万物互联语音交互从端开始-前端处理从技术到商业.pdf

基于卷积循环网络与非局部模块的语音增强方法.docx

Speech separation based on signal-noise-dependentdeep neural networks for robust speech recognition

单通道语音分离相对多通道语音分离的优点

基于CNN的单通道语音分离算法

比较先进的深度学习语音分离的网络模型有哪些

Python语音分离代码实现

matlab jade语音分离

matlab实现语音分离

HMFCC语音分离技术优势

最新资源