深度神经网络在大规模语音识别中的应用

dnn

语音识别

需积分: 27 67 浏览量更新于2024-09-08 收藏 1.1MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"本文介绍了在大型词汇语音识别（LVSR）中使用上下文依赖的预训练深度神经网络（CD-DNN）的方法。通过结合深度信念网络（DBN）的预训练技术，作者提出了一种新的DNN-HMM混合架构，该架构使DNN能够输出声学单元（senones或tied triphone states）的概率分布。预训练算法增强了深度神经网络的初始化，有助于优化过程并减少泛化误差。文章分析了模型的关键组成部分，描述了将CD-DNN-HMM应用于LVSR的步骤，并探讨了不同建模选择对性能的影响。实验结果显示，CD-DNN-HMM在具有挑战性的商业搜索数据集上表现显著优于传统方法。" 在语音识别领域，深度神经网络（DNN）已经成为一种强大的工具，特别是在大型词汇集的语音识别任务中。DNN能够学习到比传统隐马尔可夫模型（HMM）更复杂的声学特征，从而提高识别准确性。本文章的核心是将DNN与HMM相结合，创建一个预训练的上下文依赖DNN-HMM模型（CD-DNN-HMM）。这里的“上下文依赖”指的是模型考虑到了相邻音素对当前音素的影响，这有助于更准确地捕捉语音序列中的动态变化。深度信念网络（DBN）是一种无监督学习方法，它用于预先训练DNN的权重。DBN可以逐层贪婪地学习高阶统计特性，这些特性可以作为DNN的初始权重，为有监督的后续训练提供更好的起点。DBN的预训练可以帮助避免在深度网络中常见的梯度消失或爆炸问题，提高模型的训练效率和性能。文章详细讨论了将CD-DNN-HMM应用于LVSR的具体步骤，包括如何构建网络结构、如何进行预训练和微调，以及如何将模型的输出与HMM状态连接起来进行解码。此外，作者还研究了不同的预训练策略、网络深度、隐藏层节点数量以及激活函数选择等对识别性能的影响。实验部分，CD-DNN-HMM模型在一项商业搜索任务上的表现被证明优于传统的HMM-GMM方法。这表明，结合上下文信息的深度学习模型能够显著提高在实际应用中的语音识别效果，尤其是在处理大量词汇和复杂语言环境时。这篇文章深入探讨了如何利用深度学习技术改进语音识别系统的性能，特别是通过预训练的上下文依赖DNN-HMM模型，为语音识别领域的研究和实践提供了有价值的见解。

资源详情

资源推荐

DRAFT ACCEPTED BY IEEE TRANS. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 3

as those possible with deep belief network pre-training when

training a deep (the encoder and decoder in their architecture

both had three hidden layers) autoencoder with a nonlinear

conjugate gradient algorithm. Both [56] and [57] investigate

why training deep feed-forward neural networks can often

be easier with some form of pre-training or a sophisticated

optimizer of the sort used in [58].

Since the time of the early hybrid architectures, the vector

processing capabilities of modern GPUs and the advent of

more effective training algorithms for deep neural nets have

made much more powerful architectures feasible. Much previ-

ous hybrid ANN-HMM work focused on context-independent

or rudimentary context-dependent phone models and small

to mid-vocabulary tasks (with notable exceptions such as

[45]), possibly masking some of the potential advantages of

the ANN-HMM hybrid approach. Additionally, GMM-HMM

training is much easier to parallelize in a computer cluster

setting, which historically gave such systems a signiﬁcant

advantage in scalability. Also, since speaker and environment

adaptation is generally easier for GMM-HMM systems, the

GMM-HMM approach has been the dominant one in the past

two decades for speech recognition. That being said, if we

consider the wider use of neural networks in acoustic modeling

beyond the hybrid approach, neural network feature extraction

is an important component of many state-of-the-art acoustic

models.

B. Introduction to the DNN-HMM approach

The primary contributions of this work are the develop-

ment of a context-dependent, pre-trained, deep neural network

HMM hybrid acoustic model (CD-DNN-HMM); a description

of our recipe for applying this sort of model to LVSR prob-

lems; and an analysis of our results which show substantial

improvements in recognition accuracy for a difﬁcult LVSR

task over discriminatively-trained pure CD-GMM-HMM sys-

tems. Our work differs from earlier context-dependent ANN-

HMMs [42] [41] in two key respects. First, we used deeper,

more expressive neural network architectures and thus em-

ployed the unsupervised DBN pre-training algorithm to make

sure training would be effective. Second, we used posterior

probabilities of senones (tied triphone HMM states) [48] as

the output of the neural network, instead of the combination of

context-independent phone and context class used previously

in hybrid architectures. This second difference also distin-

guishes our work from earlier uses of DNN-HMM hybrids for

phone recognition [30]–[32], [59]. Note that [59], which also

appears in this issue, is the context-independent version of our

approach and builds the foundation for our work. The work in

this paper focuses on context-dependent DNN-HMMs using

posterior probabilities of senones as network outputs and can

be successfully applied to large vocabulary tasks. Training the

neural network to predict a distribution over senones causes

more bits of information to be present in the neural network

training labels. It also incorporates context-dependence into

the neural network outputs (which, since we are not using a

Tandem approach, lets us use a decoder based on triphone

HMMs), and it may have additional beneﬁts. Our evaluation

was done on LVSR instead of phoneme recognition tasks as

was the case in [30]–[32], [59]. It represents the ﬁrst large

vocabulary application of a pre-trained, deep neural network

approach. Our results show that our CD-DNN-HMM sys-

tem provides dramatic improvements over a discriminatively

trained CD-GMM-HMM baseline.

The remainder of this paper is organized as follows. In

section II we brieﬂy introduce restricted Boltzmann machines

(RBMs) and deep belief nets, and outline the general pre-

training strategy we use. In section III, we describe the

basic ideas, the key properties, and the training and decoding

strategies of our CD-DNN-HMMs. In section IV we analyze

experimental results on a 65K+ vocabulary business search

dataset collected from the Bing mobile voice search applica-

tion (formerly known as Live Search for mobile [36], [60])

under real usage scenarios. Section V offers conclusions and

directions for future work.

II. DEEP BELIEF NETWORKS

Deep belief networks (DBNs) are probabilistic generative

models with multiple layers of stochastic hidden units above

a single bottom layer of observed variables that represent a

data vector. DBNs have undirected connections between the

top two layers and directed connections to all other layers from

the layer above. There is an efﬁcient unsupervised algorithm,

ﬁrst described in [24], for learning the connection weights in a

DBN that is equivalent to training each adjacent pair of layers

as an restricted Boltzmann machine (RBM). There is also a

fast, approximate, bottom-up inference algorithm to infer the

states of all hidden units conditioned on a data vector. After the

unsupervised, or pre-training phase, Hinton et al. [24] used the

up-down algorithm to optimize all of the DBN weights jointly.

During this ﬁne-tuning phase, a supervised objective function

could also be optimized.

In this work, we use the DBN weights resulting from the

unsupervised pre-training algorithm to initialize the weights of

a deep, but otherwise standard, feed-forward neural network

and then simply use the backpropagation algorithm [61] to

ﬁne-tune the network weights with respect to a supervised

criterion. Pre-training followed by stochastic gradient descent

is our method of choice for training deep neural networks

because it often outperforms random initialization for the

deeper architectures we are interested in training and provides

results very robust to the initial random seed. The generative

model learned during pre-training helps prevent overﬁtting,

even when using models with very high capacity and can aid

in the subsequent optimization of the recognition weights.

Although empirical results ultimately are the best reason for

the use of a technique, our motivation for even trying to ﬁnd

and apply deeper models that might be capable of learning

rich, distributed representations of their input is also based on

formal and informal arguments by other researchers in the

machine learning community. As argued in [62] and [63],

insufﬁciently deep architectures can require an exponential

blow-up in the number of computational elements needed to

represent certain functions satisfactorily. Thus one primary

motivation for using deeper models such as neural networks

剩余12页未读，继续阅读

小白cc

粉丝: 0
资源: 3

深度神经网络在大规模语音识别中的应用

DNN_语音增强

百度刘洋：智能语音-从 DNN 到 LSTM

如何实现dnn语音识别

DNN 语音识别 matlab

基于dnn的语音识别思维导图整理

语音识别dnn python

语音识别的目的是什么？语音识别系统可以如何分类？当前，语音识别的主流方法是什么方法？

语音识别,语音识别转文字,matlab源码

语音识别技术是什么 语音识别基本方法介绍【图文】

GMM-HMM语音识别源码

基于深度学习的语音识别

Linux环境下语音识别

科大讯飞语音识别原理

java离线集成语音识别

语音识别的离线界面需要由哪些功能组成？具体怎么实现？有哪些技术？

学习语音识别需要掌握哪些知识

基于matlab的语音识别

语音识别技术的发展历经三大阶段：

人的语音识别 matlab

python实现语音识别的研究现状和具体方法

最新资源

语音识别技术是什么语音识别基本方法介绍【图文】