深度神经网络：预训练模型与大词汇语音识别的突破

需积分: 9 121 浏览量更新于2024-09-16 2 收藏 678KB PDF 举报

深度神经网络（Deep Neural Network, DNN）是当前人工智能领域的重要组成部分，特别是对于大规模语音识别（Large Vocabulary Speech Recognition, LVSR）的应用有着显著的影响。本文档标题《Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition》探讨了如何利用深度信念网络（Deep Belief Networks, DBN）技术改进传统的隐马尔可夫模型（Hidden Markov Model, HMM）在语音识别中的性能。深度神经网络是一种多层次的神经网络结构，其核心在于通过多个隐藏层进行信息的抽象和非线性转换。每个隐藏层可以捕捉输入数据的复杂特征，使得深层网络能够处理比浅层网络更复杂的模式和关系。与传统的HMM相比，DNN-HMM混合架构的优势在于其强大的表征能力和更好的泛化能力。预训练（Pre-training）是训练DNN的一种有效策略，利用深度信念网络的生成式初始化方法，可以在一定程度上避免梯度消失或爆炸问题，并有助于优化过程，减少模型在新数据上的泛化误差。论文中的创新之处在于提出了一个基于上下文依赖的深度神经网络模型（CD-DNN-HMM），它能够根据不同上下文条件调整语音单元的分布，如音素状态（tied triphone states）。这种模型的训练过程包括预训练阶段，即使用DBN学习底层特征表示，然后在此基础上微调DNN以适应LVSR任务。实验结果显示，CD-DNN-HMM在处理具有挑战性的商业搜索数据集时，能够显著优于传统的LVSR方法，证明了深度神经网络在语音识别领域的高效性和优势。总结来说，这篇论文介绍了深度神经网络在大规模语音识别中的应用，特别是在上下文依赖场景下，通过预训练和深度结构的优势，提高了识别准确性和模型的灵活性。这不仅推动了语音识别技术的发展，也为其他领域的深度学习模型设计提供了宝贵的参考。

32 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 1, JANUARY 2012

layers) autoencoder with a nonlinear conjugate gradient algo-

rithm. Both [56] and [57] investigate why training deep feed-for-

ward neural networks can often be easier with some form of

pre-training or a sophisticated optimizer of the sort used in [58].

Since the time of the early hybrid architectures, the vector

processing capabilities of modern GPUs and the advent of more

effective training algorithms for deep neural nets have made

much more powerful architectures feasible. Much previous hy-

brid ANN-HMM work focused on context-independent or rudi-

mentary context-dependent phone models and small to mid-vo-

cabulary tasks (with notable exceptions such as [45]), possibly

masking some of the potential advantages of the ANN-HMM

hybrid approach. Additionally, GMM-HMM training is much

easier to parallelize in a computer cluster setting, which his-

torically gave such systems a signiﬁcant advantage in scala-

bility. Also, since speaker and environment adaptation is gener-

ally easier for GMM-HMM systems, the GMM-HMM approach

has been the dominant one in the past two decades for speech

recognition. That being said, if we consider the wider use of

neural networks in acoustic modeling beyond the hybrid ap-

proach, neural network feature extraction is an important com-

ponent of many state-of-the-art acoustic models.

B. Introduction to the DNN-HMM Approach

The primary contributions of this work are the development

of a context-dependent, pre-trained, deep neural network HMM

hybrid acoustic model (CD-DNN-HMM); a description of our

recipe for applying this sort of model to LVSR problems; and an

analysis of our results which show substantial improvements in

recognition accuracy for a difﬁcult LVSR task over discrimina-

tively-trained pure CD-GMM-HMM systems. Our work differs

from earlier context-dependent ANN-HMMs [42], [41] in two

key respects. First, we used deeper, more expressive neural

network architectures and thus employed the unsupervised

DBN pre-training algorithm to make sure training would be

effective. Second, we used posterior probabilities of senones

(tied triphone HMM states) [48] as the output of the neural

network, instead of the combination of context-independent

phone and context class used previously in hybrid architectures.

This second difference also distinguishes our work from earlier

uses of DNN-HMM hybrids for phone recognition [30]–[32],

[59]. Note that [59], which also appears in this issue, is the

context-independent version of our approach and builds the

foundation for our work. The work in this paper focuses on

context-dependent DNN-HMMs using posterior probabilities

of senones as network outputs and can be successfully applied

to large vocabulary tasks. Training the neural network to predict

a distribution over senones causes more bits of information to

be present in the neural network training labels. It also incor-

porates context-dependence into the neural network outputs

(which, since we are not using a Tandem approach, lets us use a

decoder based on triphone HMMs), and it may have additional

beneﬁts. Our evaluation was done on LVSR instead of phoneme

recognition tasks as was the case in [30]–[32], [59]. It repre-

sents the ﬁrst large-vocabulary application of a pre-trained,

deep neural network approach. Our results show that our

CD-DNN-HMM system provides dramatic improvements over

a discriminatively trained CD-GMM-HMM baseline.

The remainder of this paper is organized as follows. In

Section II, we brieﬂy introduce RBMs and deep belief nets, and

outline the general pre-training strategy we use. In Section III,

we describe the basic ideas, the key properties, and the training

and decoding strategies of our CD-DNN-HMMs. In Section IV,

we analyze experimental results on a 65

vocabulary busi-

ness search dataset collected from the Bing mobile voice search

application (formerly known as Live Search for mobile [36],

[60]) under real usage scenarios. Section V offers conclusions

and directions for future work.

II. D

EEP

BELIEF NETWORKS

Deep belief networks (DBNs) are probabilistic generative

models with multiple layers of stochastic hidden units above

a single bottom layer of observed variables that represent a

data vector. DBNs have undirected connections between the

top two layers and directed connections to all other layers from

the layer above. There is an efﬁcient unsupervised algorithm,

ﬁrst described in [24], for learning the connection weights in a

DBN that is equivalent to training each adjacent pair of layers

as an restricted Boltzmann machine (RBM). There is also a

fast, approximate, bottom-up inference algorithm to infer the

states of all hidden units conditioned on a data vector. After

the unsupervised pre-training phase, Hinton et al. [24] used the

up-down algorithm to optimize all of the DBN weights jointly.

During this ﬁne-tuning phase, a supervised objective function

could also be optimized.

In this paper, we use the DBN weights resulting from the un-

supervised pre-training algorithm to initialize the weights of a

deep, but otherwise standard, feed-forward neural network and

then simply use the backpropagation algorithm [61] to ﬁne-tune

the network weights with respect to a supervised criterion. Pre-

training followed by stochastic gradient descent is our method

of choice for training deep neural networks because it often

outperforms random initialization for the deeper architectures

we are interested in training and provides results very robust to

the initial random seed. The generative model learned during

pre-training helps prevent overﬁtting, even when using models

with very high capacity and can aid in the subsequent optimiza-

tion of the recognition weights.

Although empirical results ultimately are the best reason for

the use of a technique, our motivation for even trying to ﬁnd and

apply deeper models that might be capable of learning rich, dis-

tributed representations of their input is also based on formal

and informal arguments by other researchers in the machine

learning community. As argued in [62] and [63], insufﬁciently

deep architectures can require an exponential blow-up in the

number of computational elements needed to represent certain

functions satisfactorily. Thus, one primary motivation for using

deeper models such as neural networks with many layers is that

they have the potential to be much more representationally ef-

ﬁcient for some problems than shallower models like GMMs.

Furthermore, GMMs as used in speech recognition typically

have a large number of Gaussians with independently parame-

terized means which may result in those Gaussians being highly

localized and thus may result in such models only performing

local generalization. In effect, such a GMM would partition the

input space into regions each modeled by a single Gaussian.

剩余12页未读，继续阅读

qq_14859861

粉丝: 0
资源: 1

深度神经网络：预训练模型与大词汇语音识别的突破

Deep_Neural_Network_Application-Image_Classification

Multi-column Deep Neural Networks for Image Classification

Multi-column deep neural network for traffic sign classification

deep neural network

Deep Neural Network.zip_deep neural network_distance1dk_深度神经网络_深

Recurrent Deep Neural network based Object Detection :Recurrent Deep Neural network based Object Detection matlab code with example-matlab开发

690140.tar.gz_deep_deep neural_deep neural network

Deep Neural Network 深度学习 deep learning

Relation Classification via Convolutional Deep Neural Network

Deep Neural Network Application - Image Classification

最新资源