使用上下文依赖深度神经网络进行对话语音转录

模式识别

需积分: 9 92 浏览量更新于2024-09-13 收藏 116KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Conversational Speech Transcription - 使用上下文依赖的深度神经网络" 在语音识别领域，"Conversational Speech Transcription"是一个重要的任务，它涉及到将日常对话的语音转换成文字。这篇由Frank Seide、Gang Li和Dong Yu等人撰写的论文探讨了如何利用上下文依赖的深度神经网络（Context-Dependent Deep Neural Networks，简称CD-DNN）来提高语音转文本的准确性。 CD-DNN-HMMs是传统声学模型，如高斯混合模型（Gaussian Mixture Models，GMMs），与人工神经网络（Artificial Neural Networks，ANNs）以及深度信念网络（Deep Belief Networks，DBNs）预训练技术的结合。在单通道、非特定说话人识别的场景下，该方法在电话通话转录基准测试（如Switchboard的RT03SFisher部分）上显著提升了性能。通过这种方法，词错误率（Word Error Rate，WER）从传统的判别训练GMMs的27.4%降低到了18.5%，实现了33%的相对改进。论文进一步扩展了CD-DNN-HMMs的应用，使用超过300小时的训练数据，超过9000个绑定状态（tied states），并且最多可以有9个隐藏层。这使得模型能够处理更复杂的语音信号，并且展示了如何利用稀疏性来优化性能。通过这种大规模的数据和复杂模型结构，CD-DNN-HMMs能够更准确地捕获语言的动态变化和说话人的个性特征，从而提高识别率。此外，论文还强调了在处理大量数据时，如何利用稀疏性来避免过拟合，保持模型的泛化能力。这在处理如Switchboard这样的大规模对话数据集时尤为重要，因为这些数据集通常包含各种不同的说话风格和环境噪声。CD-DNN-HMMs的这种优势使其成为现代语音识别系统中的关键组件，尤其是在开发自动语音助手和智能对话系统等应用中。 "Conversational Speech Transcription"的工作不仅推动了模式识别领域的技术进步，也对实际应用场景如语音交互系统、电话客服自动化、语音搜索和翻译等产生了深远影响。通过深度学习技术的运用，语音识别系统的准确性和鲁棒性得到了显著提升，为未来的自然语言处理和人机交互奠定了坚实的基础。

资源详情

资源推荐

Conversational Speech Transcription

Using Context-Dependent Deep Neural Networks

Frank Seide

, Gang Li,

and Dong Yu

Microsoft Research Asia, Beijing, P.R.C.

Microsoft Research, Redmond, USA

{fseide,ganl,dongyu}@microsoft.com

Abstract

We apply the recently proposed Context-Dependent Deep-

Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text

transcription. For single-pass speaker-independent recognition

on the RT03S Fisher portion of phone-call transcription bench-

mark (Switchboard), the word-error rate is reduced from 27.4%,

obtained by discriminatively trained Gaussian-mixture HMMs,

to 18.5%—a 33% relative improvement.

CD-DNN-HMMs combine classic artiﬁcial-neural-network

HMMs with traditional tied-state triphones and deep-belief-

network pre-training. They had previously been shown to re-

duce errors by 16% relatively when trained on tens of hours of

data using hundreds of tied states. This paper takes CD-DNN-

HMMs further and applies them to transcription using over 300

hours of training data, over 9000 tied states, and up to 9 hidden

layers, and demonstrates how sparseness can be exploited.

On four less well-matched transcription tasks, we observe

relative error reductions of 22–28%.

Index Terms: speech recognition, deep belief networks, deep

neural networks

1. Introduction

Since the early 90’s, artiﬁcial neural networks (ANNs) have

been used to model the state emission probabilities of HMM

speech recognizers [1]. While traditional Gaussian mixture

model (GMM)-HMMs model context dependency through tied

context-dependent states (e.g. CART-clustered crossword tri-

phones [2]), ANN-HMMs were never used to do so directly.

Instead, networks were often factorized, e.g. into a monophone

and a context-dependent part [3], or hierarchically decomposed

[4]. It has been commonly assumed that hundreds or thousands

of triphone states were just too many to be accurately modeled

or trained in a neural network. Only recently did Yu et al. dis-

cover that doing so is not only feasible but works very well [5].

Context-dependent deep-neural-network HMMs, or CD-

DNN-HMMs [5, 6], apply the classical ANN-HMMs of the

90’s to traditional tied-state triphones directly, exploiting Hin-

ton’s deep-belief-network (DBN) pre-training procedure. This

was shown to lead to a very promising and possibly disruptive

acoustic model as indicated by a 16% relative recognition error

reduction over discriminatively trained GMM-HMMs on a busi-

ness search task [5, 6], which features short query utterances,

tens of hours of training data, and hundreds of tied states.

This paper takes this model a step further and serves sev-

eral purposes. First, we show that the exact same CD-DNN-

HMM can be effectively scaled up in terms of training-data size

(from 24 hours to over 300), model complexity (from 761 tied

triphone states to over 9000), depth (from 5 to 9 hidden lay-

ers), and task (from voice queries to speech-to-text transcrip-

tion). This is demonstrated on a publicly available benchmark,

the Switchboard phone-call transcription task (2000 NIST Hub5

and RT03S sets). We should note here that ANNs have been

trained on up to 2000 hours of speech before [7], but with much

fewer output units (monophones) and fewer hidden layers.

Second, we advance the CD-DNN-HMMs by introducing

weight sparseness and the related learning strategy and demon-

strate that this can reduce recognition error or model size.

Third, we present the statistical view of the multi-layer per-

ceptron (MLP) and DBN and provide empirical evidence for

understanding which factors contribute most to the accuracy im-

provements achieved by the CD-DNN-HMMs.

2. The Context-Dependent

Deep Neural Network HMM

A deep neural network (DNN) is a conventional multi-layer per-

ceptron (MLP, [8]) with many hidden layers, optionally initial-

ized using the DBN pre-training algorithm. In the following,

we want to recap the DNN from a statistical viewpoint and de-

scribe its integration with context-dependent HMMs for speech

recognition. For a more detailed description, please refer to [6].

2.1. Multi-Layer Perceptron—A Statistical View

An MLP as used in this paper models the posterior probabil-

ity P

s|o

(s|o) of a class s given an o

bservation vector o,asa

stack of (L +1)layers of log-linear models. The ﬁrst L layers,

 =0...L − 1, model posterior probabilities of h

idden binary

vectors h



given input vectors v



, while the top layer L models

the desired class posterior as



h|v



j=1



)·h



)·1

+ e



)·0

, 0 ≤ <L

s|v

(s|v

)





)

=softmax

))



)=(W



)



+ a



with weight matrices W



and bias vectors a



, where h



and



) are the j-th component of h



and z



), respectively.

The precise modeling of P

s|o

(s|o) requires integration over

all possible values of h



across all layers which is infeasi-

ble. An effective practical trick is to replace the marginaliza-

tion with the “mean-ﬁeld approximation” [9]. Given observa-

tion o, we set v

= o and choose the conditional expectation



h|v



} = σ





)



as input v

+1

to the next layer,

where σ

(z)=1/(1 + e

−z

31 August 2011, Florence, Italy

INTERSPEECH 2011

437

下载后可阅读完整内容，剩余3页未读，立即下载

D_Wade

粉丝: 2
资源: 6

使用上下文依赖深度神经网络进行对话语音转录

【13】Achieving Human Parity in Conversational Speech Recognition.pdf

利用有效融合方法进行跨话语语言建模的会话语音识别_Conversational speech recognition lever

conversational-insights

Conversational:纸代号

Conversational Recommender Systems.pdf

A Neural Conversational Model原文

Designing Bots_ Creating Conversational Experiences

Lord Birthday Conversational Tips-crx插件

Conversational-AI-NLP-Tutorial

Using Conversational Artificial Intelligence to Support Children

owards Conversational Recommendation over Multi-Type

COSC310-Interactive-Conversational-Agent

interspeech2021-conversational-tts:https

conversational-QG:[ACL 2019]

对话系统综述 Neural Approaches to Conversational AI

最新资源