深度学习提升语音识别：前端与后端神经网络联合训练

15 浏览量更新于2024-08-26 收藏 200KB PDF 举报

"这篇研究论文探讨了前端和后端深度神经网络在语音识别中的联合训练，以提升识别系统的鲁棒性。通过使用深度神经网络进行语音预处理，研究团队研究了从噪声语音中直接提取不同特征映射的方法，并提出了一种单一的深度神经网络模型，该模型能同时进行特征映射和声学建模的联合训练。实验结果表明，这种联合训练系统能够通过融合多个DNN预处理系统显著降低词错误率（WER），表明来自DNN增强语音信号不同域的特征具有强烈互补性。在Aurora4噪声语音识别任务上，采用多条件训练的最优系统平均词错误率降低到10.3%，相比于先前的DNN预处理方法减少了16.3%的错误率。" 本文是关于深度学习在语音识别领域的应用，特别关注了前端和后端深度神经网络的协同工作。前端通常指的是对原始语音信号进行预处理的阶段，如降噪、特征提取等，而后端则涉及声学模型的构建和解码。近年来，深度神经网络在语音处理中扮演了关键角色，它们能够学习到复杂的声音模式并进行有效的特征表示。研究首先探索了如何利用深度神经网络（DNN）对噪声语音进行预处理，以生成鲁棒的特征。DNN被用于从原始噪声语音中直接映射出对识别有利的特征。这一步骤的目标是在噪声环境中提取出尽可能纯净、有区分性的语音特征，以降低噪声对识别性能的影响。随后，作者提出了一个创新的方案，即使用单一的DNN模型同时执行特征映射和声学建模。这样的联合训练策略允许模型在预处理和识别过程中进行端到端的学习，可能进一步优化整个系统的表现。这种方式使得模型可以更好地适应噪声环境，同时减少了模型之间的不匹配问题。实验部分，研究团队在Aurora4这个标准的噪声语音识别数据集上进行了测试。Aurora4包含了各种环境下的噪声，是评估鲁棒语音识别系统性能的理想平台。通过多条件训练，即训练模型处理多种噪声条件下的语音，他们得到了显著的性能提升。最佳系统在WER上降低了16.3%，证明了联合训练和多DNN预处理系统的有效性。此外，论文还强调了融合多个DNN预处理系统的重要性。这意味着从DNN增强后的语音信号的不同域获取的特征之间存在互补性，这些特征的组合可以提高整体识别的准确性。这一发现对于未来设计更强大的鲁棒语音识别系统具有重要指导意义。这项研究展示了深度学习技术在改善噪声环境中语音识别性能上的潜力，尤其是通过前端和后端神经网络的联合训练。这种方法不仅提升了识别的准确率，而且揭示了特征多样性在鲁棒性中的关键作用，为后续的语音识别研究提供了新的方向。

JOINT TRAINING OF FRONT-END AND BACK-END DEEP NEURAL NETWORKS FOR

ROBUST SPEECH RECOGNITION

Tian Gao

, Jun Du

, Li-Rong Dai

, Chin-Hui Lee

University of Science and Technology of China, Hefei, Anhui, P. R. China

Georgia Institute of Technology, Atlanta, Georgia, USA

gtian09@mail.ustc.edu.cn, {jundu,lrdai}@ustc.edu.cn, chl@ece.gatech.edu

ABSTRACT

Based on the recently proposed speech pre-processing front-

end with deep neural networks (DNNs), we ﬁrst investigate d-

ifferent feature mapping directly from noisy speech via DNN

for robust speech recognition. Next, we propose to jointly

train a single DNN for both feature mapping and acoustic

modeling. In the end, we show that the word error rate (WER)

of the jointly trained system could be signiﬁcantly reduced

by the fusion of multiple DNN pre-processing systems which

implies that features obtained from different domains of the

DNN-enhanced speech signals are strongly complementary.

Testing on the Aurora4 noisy speech recognition task our best

system with multi-condition training can achieves an average

WER of 10.3%, yielding a relative reduction of 16.3% over

our previous DNN pre-processing only system with a WER

of 12.3%. To the best of our knowledge, this represents the

best published result on the Aurora4 task without using any

adaptation techniques.

Index Terms— robust speech recognition, deep neural

network, feature mapping, joint training, system fusion

1. INTRODUCTION

With the fast development of mobile internet, the speech-

enabled applications using automatic speech recognition (AS-

R) are becoming increasingly popular. However, the noise

robustness is one of the critical issues to make ASR system

widely used in real world. Historically, most of ASR system-

s use Mel-frequency cepstral coefﬁcients (MFCCs) and their

derivatives as speech features, and a set of Gaussian mixture

continuous density HMMs (CDHMMs) for modeling basic

speech units. Many techniques [1, 2, 3] have been proposed

to address this issues. One category of techniques is the so-

called data-driven approach based on stereo-data [4, 5], which

is also the topic of this study.

The recent breakthrough of deep learning [6, 7], espe-

cially the application of deep neural networks (DNNs) in the

This work was supported by the National Natural Science Foundation of

China under Grants No. 61305002.

ASR area [8, 9, 10], marks a new milestone that DNN-HMM

for acoustic modeling becomes the state-of-the-art replac-

ing GMM-HMM. It is believed that the ﬁrst several layers

of DNN play the role of extracting highly nonlinear and

discriminative features which are robust to irrelevant vari-

abilities. This makes DNN-HMM inherently noise robust to

some extent as veriﬁed on the Aurora4 task [11].

In [12, 13], several front-end techniques were shown to

yield further performance gains on top of the DNN-HMM

system for tasks with small vocabularies or constrained gram-

mars. However for large vocabulary tasks, the convention-

al enhancement approach as in [14], effective for the GMM-

HMM systems, might even lead to a system degradation for

DNN-HMM with log mel-ﬁlterbank (LMFB) features under

the well-matched training-testing conditions [11].

Meanwhile, the data-driven approaches using stereo da-

ta via recurrent neural network (RNN) and DNN proposed

in [15, 16] can improve the speech recognition accuracy on

small vocabulary tasks. More recently, masking techniques

[17, 18, 19] were successfully applied to noisy speech recog-

nition. In [19], the approach using time-frequency masking

combined with feature mapping via DNN claimed to achieve

the best results on the Aurora4 task. Unfortunately, for multi-

condition training using DNN-HMM with LMFB features,

this approach still resulted in a worse performance similar to

those concluded in [11]. In [20], we propose a pre-processing

approach via DNN as a regression model to enhance noisy

speech for robust speech recognition and was shown to out-

perform the masking approach [19].

In this study, we report our recent progress to further im-

prove the ASR performance of multi-condition training espe-

cially when both additive noise and convolutional distortion

are involved in the test data. First, instead of extracting a-

coustic features from the enhanced speech waveform, DNN is

adopted directly as a highly nonlinear mapping function to es-

timate the clean speech features from observed noisy speech.

Second, we employ a hybrid DNN architecture to joint train

DNNs for both feature mapping and acoustic modeling. The

proposed joint training allows error back-propagation to the

feature mapping layers and the input of the hybrid DNN is

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38565003

粉丝: 6
资源: 913

深度学习提升语音识别：前端与后端神经网络联合训练

基于深度学习的银行卡识别工具，GUI前端，后端采用Tensorflow.zip

语音唤醒 语音识别

语音识别基本法

怎么理解前端和后端的区别

python前端和后端怎么联系起来

Java 前端和后端的区别

前端和后端的区别是什么

前端和后端开发接口流程

数据库前端和后端怎么连接？

前端用ant-design-vue实现语音搜索框，后端用java调用百度语音识别实现语音转文字

最新资源

语音唤醒语音识别