基于信号噪声依赖深度神经网络的语音分离技术

152 浏览量更新于2024-08-27 收藏 1.39MB PDF 举报

"这篇研究论文探讨了一种基于信号噪声依赖深度神经网络（Signal-Noise-Dependent Deep Neural Networks, SND-DNNs）的语音分离方法，以提高语音识别的鲁棒性。通过采用分而治之的策略设计具有更高分辨率的SND-DNN，能够更好地处理不同信噪比（SNRs）下的各种说话者混合变化。论文中提到了两种SNR依赖的DNN，即正SNR和负SNR DNN，分别用于处理正SNR和负SNR水平的混合语音信号。在分离阶段，首先使用一般DNN进行初步分离，以获得准确的SNR估计，然后模型选择适当的SND-DNN进行进一步的精细分离。" 本文是针对深度学习在语音识别中的应用进行的研究，特别是在复杂环境下的语音增强和分离。传统的基于DNN的语音识别系统可能在高噪声环境下表现不佳，因为它们难以适应各种SNR条件下的混合语音。因此，研究者提出了一种创新的SND-DNN框架，它旨在解决这一问题。 SND-DNN的核心思想是将深度学习模型的专业化分为两类，即针对正SNR和负SNR的DNN。这样做的目的是确保模型能针对性地处理不同噪声条件下的语音信号，从而提高分离和识别的准确性。正SNR DNN用于处理噪声相对较小或信号较强的语音，而负SNR DNN则适用于噪声较大、信号较弱的场景。在实际操作中，首先使用一个通用的DNN对输入的混合语音进行初步分离，这个过程同时也估计出每个语音分量的SNR。根据这个估计的SNR，系统可以决定使用哪种SNR依赖的DNN进行后续的精细化分离。这种方法提高了对不同噪声条件的适应性，增强了整个系统的鲁棒性。论文中可能还包括了实验结果，对比了SND-DNN方法与传统方法在不同信噪比条件下的性能，以及对不同类型的噪声和多个说话人的处理能力。这些实验结果可能会展示SND-DNN在提升语音识别率和降低错误率方面的显著优势。这篇研究论文为提高深度学习在噪声环境中的语音识别性能提供了一个新的视角，特别是通过设计和训练针对不同SNR的专门DNN模型，增强了系统在复杂环境下的语音分离和识别能力。这对于语音通信、语音识别技术以及未来的智能语音助手等应用有着重要的实践意义。

SPEECH SEPARATION BASED ON SIGNAL-NOISE-DEPENDENT DEEP NEURAL

NETWORKS FOR ROBUST SPEECH RECOGNITION

Yan-Hui Tu

, Jun Du

, Li-Rong Dai

, Chin-Hui Lee

University of Science and Technology of China, HeFei, AnHui, P.R. China

Georgia Institute of Technology, Atlanta, Georgia, USA

tuyanhui@mail.ustc.edu.cn, {jundu,lrdai}@ustc.edu.cn, chl@ece.gatech.edu

ABSTRACT

In this paper, we propose a new signal-noise-dependent (SND) deep

neural network (DNN) framework to further improve the separation

and recognition performance of the recently developed technique for

general DNN-based speech separation. We adopt a divide and con-

quer strategy to design the proposed SND-DNNs with higher resolu-

tions that a single general DNN could not well accommodate for all

the speaker mixing variabilities at different levels of signal-to-noise

ratios (SNRs). In this study two kinds of SNR-dependent DNNs,

namely positive and negative DNNs, are trained to cover the mixed

speech signals with positive and negative SNR levels, respectively.

At the separation stage, a ﬁrst-pass separation using a general DNN

can give an accurate SNR estimation for a model selection. Experi-

mental results on the Speech Separation Challenge (SSC) task show

that SND-DNNs could yield signiﬁcant performance improvements

for both speech separation and recognition over a general DNN. Fur-

thermore, this purely front-end processing method achieves a rela-

tive word error rate reduction of 11.6% over a state-of-the-art recog-

nition system where a complicated joint decoding framework needs

to be implemented in the back-end.

Index Terms— single-channel speech separation, robust speech

recognition, deep neural networks, semi-supervised mode

1. INTRODUCTION

Speech separation aims at separating the voice of each speaker when

multiple speakers talk simultaneously. It is important for many ap-

plications, for example automatic speech recognition (ASR). While

signiﬁcant progress has been made in improving the noise robust-

ness of ASR systems, most techniques focus on improving the per-

formance of the back-end recogniser. In this study, we use the sepa-

ration system as our front-end pre-processor for ASR. So the perfor-

mance of the ASR system depends heavily on the quality of acoustic

pre-processing. The separating algorithms can be often classiﬁed

into unsupervised and supervised modes. In the former, speaker i-

dentities and the reference speech of each speaker are not available

in the training stage, while the information of both the target and the

interfering speakers is provided in the supervised modes.

One broad class of single-channel speech separation is the so-

called computational auditory scene analysis (CASA) [1], usually

in an unsupervised mode. CASA-based approaches [2]-[6], use the

psychoacoustic cues, such as pitch, voice onset/offset, temporal con-

tinuity, harmonic structures, and modulation correlation, to segregate

This work was supported by the National Natural Science Foundation of

China under Grants No. 61305002.

a voice of interest by masking the interfering sources. For exam-

ple, in [5], pitch and amplitude modulation were adopted to separate

the voiced portions of co-channel speech. In [6], unsupervised clus-

tering was used to separate speech regions into two speaker groups

by maximizing the ratio of between-cluster and within-cluster dis-

tances. Recently, a data-driven approach [7] separates the underly-

ing clean speech segments by matching each mixed speech segment

against a composite training segment.

In the supervised approaches, speech separation is often formu-

lated as an estimation problem based on:

= x

+ x

(1)

where x

, x

are speech signals of the mixture, target speaker,

and interfering speaker, respectively. To solve this under-determined

equation, a general strategy is to represent the speakers by two mod-

els, and use a certain criterion to reconstruct the sources given the

single mixture. An early study in [8] adopted a factorial hidden

Markov model (FHMM) to describe a speaker, and the estimated

sources are used to generate a binary mask. To further impose tem-

poral constraints on speech signals for separation, the work in [9] in-

vestigates the phone-level dynamics using HMMs [10]. For FHMM-

based speech separation, 2-D Viterbi algorithms and approximations

have been used to estimate the inference [11]. In [12], FHMM was

adopted to model vocal tract characteristics for detecting pitch to re-

construct speech sources. In [13, 14, 15] Gaussian mixture models

(GMMs) were employed to model speakers, and the minimum mean

squared error (MMSE) or maximum a posteriori (MAP) estimator is

used to recover the speech signals. The factorial-max vector quanti-

zation model (MAXVQ) was also used to infer the mask signals in

[16]. Other popular approaches include nonnegative matrix factor-

ization (NMF) based models [17].

Recently, speech separation based on deep learning approaches

becomes increasingly popular, which can be divided into two broad

classes. One is in a supervised mode, where deep neural network-

s (DNNs) or recurrent neural networks (RNNs) [18] are adopted to

separate the mixed speech given the information of the target speak-

er, interfering speaker, and even the signal-to-noise ratio (SNR). The

other one is in a semi-supervised mode where only the information

of the target speaker is provided. Our recent work [19, 20, 21] be-

longs to the latter. In [19, 20], we solve the separation problem in E-

q. (1) by using DNN to directly model the highly nonlinear relation-

ship among speech features of the target speaker, the interference s-

peaker and the mixed signals. Its effectiveness has also been veriﬁed

for robust speech recognition [21]. As our DNN approach is semi-

supervised, a large mount of training data with different interfering

speakers at different SNRs can be included to address the problem

of unseen information. However a single general DNN might not

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38591223

粉丝: 7
资源: 911

基于信号噪声依赖深度神经网络的语音分离技术

GST-memristor-based online learning neural networks

Convolutional Maxout Neural Networks for Speech Separation

Supervised speech separation based on deep learning: An overview

Sub-band Separation Based Digital Pre-distortion Technique for Linearizing Concurrent Dual-band Power Amplifiers

Speech-Separation-Paper-Tutorial:基于神经网络的语音分离必读论文

matlab的egde源代码-Matlab-toolbox-for-DNN-based-speech-separation:该文件夹包含用于工

Blind source separation with variable step-size method based on a reference separation system

Blind Speech Separation and Enhancement With GCC-NMF

DACS: DHT-Based Distributed Access-Control System for a Secure Locator/Identifier Separation Network

A-new-signal-separation-method.rar_A Separation_信号分选_分选

最新资源