深度神经网络语音增强的多目标学习与掩码后处理策略

27 浏览量更新于2024-08-27 收藏 704KB PDF 举报

本文探讨了一种多目标学习方法以及基于掩码的后处理策略在深度神经网络（Deep Neural Network, DNN）驱动的语音增强（Speech Enhancement, SE）中的应用。传统的语音增强任务主要关注清晰度和可理解性，但该研究扩展了这一框架，通过引入额外的次要目标来提升模型性能。首先，"Multi-objective Learning"部分，作者提出了一个集成式学习框架，旨在同时优化两种目标：一是与语音增强直接相关的任务，即提高原始语音信号的质量；二是学习次要目标，如梅尔频率倒谱系数（Mel Frequency Cepstral Coefficients, MFCCs）。MFCCs是语音特征的一种表示，它捕捉了人耳感知到的语音频谱的频率特性，这对于理解语音内容至关重要。通过将这些非直接相关的目标融入DNN架构，模型能够捕捉更丰富的语音特征，从而提升整体性能。其次，"Mask-based Post-processing"指的是利用理想二进制掩码（Ideal Binary Mask, IBM）来进行后处理。IBM是一种理想的方法，它能精确地分离出语音信号与背景噪声，但在实际应用中往往难以获取。在本文中，通过训练得到的DNN模型预测的掩码与IBM进行融合，可能采用软掩码或者加权融合的方式，以降低对完美掩码的依赖，并提高实际应用中的鲁棒性。这种联合优化策略在参数估计过程中引入了额外的约束，有助于模型学习到更准确的特征表示和分离能力，从而提高语音增强的效果。这种方法不仅提升了信号的质量，还可能改善了对语音语义和语音结构的理解，使得整个系统更加适应复杂环境下的语音通信需求。这篇文章提出了一种创新的深度学习框架，通过多目标学习和掩码后处理技术，有效地提高了语音增强系统的性能，为未来智能音频处理领域的研究提供了新的思路和方法。这不仅对于语音信号处理领域，也对其他需要同时考虑多个目标的任务，如音频分类或声源定位等领域具有潜在的应用价值。

Multi-objective Learning and Mask-based Post-processing for Deep Neural

Network based Speech Enhancement

Yong Xu

1∗

, Jun Du

, Zhen Huang

, Li-Rong Dai

, Chin-Hui Lee

National Engineering Laboratory for Speech and Language Information Processing,

University of Science and Technology of China, China

School of Electrical and Computer Engineering, Georgia Institute of Technology, USA

xuyong62@mail.ustc.edu.cn, jundu@ustc.edu.cn, chl@ece.gatech.edu

Abstract

We propose a multi-objective framework to learn both sec-

ondary targets not directly related to the intended task of speech

enhancement (SE) and the primary target of the clean log-power

spectra (LPS) features to be used directly for constructing the

enhanced speech signals. In deep neural network (DNN) based

SE we introduce an auxiliary structure to learn secondary con-

tinuous features, such as mel-frequency cepstral coefﬁcients

(MFCCs), and categorical information, such as the ideal binary

mask (IBM), and integrate it into the original DNN architecture

for joint optimization of all the parameters. This joint estima-

tion scheme imposes additional constraints not available in the

direct prediction of LPS, and potentially improves the learning

of the primary target. Furthermore, the learned secondary in-

formation as a byproduct can be used for other purposes, e.g.,

the IBM-based post-processing in this work. A series of experi-

ments show that joint LPS and MFCC learning improves the SE

performance, and IBM-based post-processing further enhances

listening quality of the reconstructed speech.

Index Terms: speech enhancement, deep neural network, min-

imum mean square error, multi-objective learning, binary mask

1. Introduction

Classical speech enhancement (SE) approaches, such as spec-

tral subtraction [1], MMSE-based spectral amplitude estimator

[2, 3] and optimally modiﬁed log-MMSE estimator [4, 5], are

considered as unsupervised techniques having been studied ex-

tensively for several decades. Based on key assumptions for the

interactions between speech and noise, the tremendous progress

has been made for those techniques in the past. However some

issues, such as fast changing noise (e.g., machine gun [6]) and

negative spectrum estimation, still need to be addressed.

On the other hand, supervised machine learning approaches

have also been developed in recent years. They were shown

to generate enhanced speech with good qualities [7]. Non-

negative matrix factorization (NMF) based speech enhancement

[7, 8] was one notable example in which speech and noise ba-

sis models were learned separately from training speech and

noise databases. Then the clean speech could be decomposed

given the noisy speech. However, speech and noise are as-

sumed uncorrelated and it limited the quality of the enhanced

speech signals. Following recent successes in deep learning

based speech processing [9, 10, 11] we have recently proposed a

deep neural network (DNN) based speech enhancement frame-

∗

This work is done while Yong Xu was visiting Georgia Tech in

2014-2015.

work [12, 13, 14] in which DNN was regarded as a regression

model to predict the clean log-power spectra (LPS) features [15]

from noisy LPS features. DNN also acts as a mapping function

to learn the relationship between clean and noisy speech fea-

tures without imposing any assumption. Similar DNN-based

speech denoising methods were also proposed in [16, 17]. In

[18, 19], DNN-based method was demonstrated to be better

than the NMF-based methods in speech separation. In DNN-

based speech enhancement, the minimum mean square error

(MMSE) between the target features and the predicted features

was always used as the objective function. It is difﬁcult to de-

sign a better cost function to directly optimize the DNN model,

especially for features that are correlated. In [19] it was shown

that other cost functions, such as the Kullback Leibler diver-

gence [20] or the Itakura-Saito divergence [21], all performed

worse than the MMSE.

In this paper, a multi-objective learning framework is pro-

posed to optimize a joint objective function, encompassing er-

rors not only for the primary clean LPS features but also errors

in secondary targets for continuous features, such as MFCC,

and for categorical information, such as ideal binary mask

(IBM) [22]. This joint optimization of different but related tar-

gets can potentially improve the DNN prediction performance

of the primary target LPS which is then used to reconstruct the

enhanced waveform. In the LPS domain, the target values of

different frequency bins were predicted independently without

any correlation constraint, and some knowledge in auditory per-

ception [23] is not easily utilized. Nonetheless in the MFCC

domain, mel-ﬁltering is ﬁrst applied and the correlation of each

channel is represented in the MFCC coefﬁcients. Furthermore,

IBM is the most important concept in the computational audi-

tory scene analysis (CASA) [23]. IBM which represents the

noise-dominant or speech-dominant meta information can also

improve DNN training and the estimated IBM could further be

used for post-processing. Finally, MFCC and IBM can be com-

bined together to help predict the target clean LPS features.

In our SE experiments, we ﬁnd that learning MFCC and/or

IBM as secondary tasks provides improvements to DNN-based

speech enhancement. Furthermore, IBM-based post-processing

also gives an additional 1.5 dB improvement of segmental

signal-to-noise ratio (SSNR) [15].

2. Multi-objective Learning for DNN-based

Speech Enhancement

In [12, 13], DNN is adopted as a mapping function to predict

the clean LPS features from the noisy LPS features. The re-

lationship between the clean and noisy speech features can be

10, 2015, Dresden, Germany

INTERSPEECH 2015

1508

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38736760

粉丝: 5
资源: 980

深度神经网络语音增强的多目标学习与掩码后处理策略

ISSCC2021: Scalable In-Memory Computing for Deep Neural Network Processors

深度学习图像检测：从R-CNN到Mask R-CNN的进化

MDFA-DeepLearning：高效处理大型多元时间序列数据

matlab将图像剪裁代码-Deep-Neural-Network-Based-Image-Enhancement-System:基于神经网络

Deep Neural Network-based Enhancement for Image and Video Stream

Improving deep neural network based speech enhancement in low SNR environments

An Overview of Multi-Task Learning in Deep Neural Networks.pdf

Multi-column deep neural network for traffic sign classification

语音中的mask---Neural network based spectral mask estimation for aco

Large-Scale-Multi-Class-Image-Based-Cell-Classification-with-Deep-Learning:具有深度学习的大规模多类基于图像的细胞分类

最新资源