深度循环神经网络在语音识别中的突破

需积分: 49 54 浏览量更新于2024-09-07 1 收藏 413KB PDF 举报

"这篇学术论文探讨了使用深度循环神经网络（Deep Recurrent Neural Networks, DRNNs）在语音识别中的应用。尽管循环神经网络（RNNs）在处理序列数据时表现出强大的能力，特别是在无输入输出对齐信息的连接主义时间分类（Connectionist Temporal Classification, CTC）任务中，但在语音识别领域，它们的表现一直不如深度前馈网络。作者Alex Graves、Abdel-rahman Mohamed和Geoffrey Hinton通过研究发现，将多层表示与长短期记忆（LSTM）RNN架构相结合的深RNN可以显著提高性能。经过端到端训练并配合适当的正则化，他们展示了深度LSTM RNN在TIMIT音素识别基准测试中达到了17.7%的错误率，这是已知的最佳记录成绩。" 在语音识别中，循环神经网络（RNNs）的使用通常受到其处理长期依赖问题的限制。标准RNN在处理长序列时可能会遇到“梯度消失”或“梯度爆炸”的问题，这使得它们难以捕获远距离的上下文信息。然而，长短期记忆（LSTM）网络通过引入门控机制，有效解决了这个问题，使得它们能够在更长时间尺度上保持和利用信息。本文的重点是深入研究深度循环神经网络，即将多层结构应用于RNNs，这种结构已经在深度学习中证明了其在特征提取方面的有效性。深度网络允许模型学习不同层次的抽象特征，每一层专注于捕捉不同复杂性的模式。通过结合这些层次的表示与LSTM的能力，DRNNs能够更好地理解语音信号中的时间依赖性，并且在序列标注任务中展现出色的性能。实验部分，研究者使用端到端训练方法，这是一种无需预先定义输入和输出对齐的训练策略，特别适用于像语音识别这样的时间序列问题。端到端训练使模型能够直接从原始音频信号中学习到音素的表示，而无需依赖于手工设计的声学特征。此外，正则化技术被用于防止过拟合，确保模型在未见过的数据上也能有良好的泛化能力。在TIMIT音素识别基准测试中，深度LSTM RNN取得了17.7%的测试集错误率，这一成绩表明DRNNs在语音识别领域具有巨大的潜力，超越了之前使用传统RNN和深度前馈网络所取得的结果。这一突破性进展对于语音识别技术的发展具有重要意义，可能推动未来语音识别系统在准确性和效率上的提升。

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS

Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

Department of Computer Science, University of Toronto

ABSTRACT

Recurrent neural networks (RNNs) are a powerful model for

sequential data. End-to-end training methods such as Connec-

tionist Temporal Classiﬁcation make it possible to train RNNs

for sequence labelling problems where the input-output align-

ment is unknown. The combination of these methods with

the Long Short-term Memory RNN architecture has proved

particularly fruitful, delivering state-of-the-art results in cur-

sive handwriting recognition. However RNN performance in

speech recognition has so far been disappointing, with better

results returned by deep feedforward networks. This paper in-

vestigates deep recurrent neural networks, which combine the

multiple levels of representation that have proved so effective

in deep networks with the ﬂexible use of long range context

that empowers RNNs. When trained end-to-end with suit-

able regularisation, we ﬁnd that deep Long Short-term Mem-

ory RNNs achieve a test set error of 17.7% on the TIMIT

phoneme recognition benchmark, which to our knowledge is

the best recorded score.

Index Terms— recurrent neural networks, deep neural

networks, speech recognition

1. INTRODUCTION

Neural networks have a long history in speech recognition,

usually in combination with hidden Markov models [1, 2].

They have gained attention in recent years with the dramatic

improvements in acoustic modelling yielded by deep feed-

forward networks [3, 4]. Given that speech is an inherently

dynamic process, it seems natural to consider recurrent neu-

ral networks (RNNs) as an alternative model. HMM-RNN

systems [5] have also seen a recent revival [6, 7], but do not

currently perform as well as deep networks.

Instead of combining RNNs with HMMs, it is possible

to train RNNs ‘end-to-end’ for speech recognition [8, 9, 10].

This approach exploits the larger state-space and richer dy-

namics of RNNs compared to HMMs, and avoids the prob-

lem of using potentially incorrect alignments as training tar-

gets. The combination of Long Short-term Memory [11], an

RNN architecture with an improved memory, with end-to-end

training has proved especially effective for cursive handwrit-

ing recognition [12, 13]. However it has so far made little

impact on speech recognition.

RNNs are inherently deep in time, since their hidden state

is a function of all previous hidden states. The question that

inspired this paper was whether RNNs could also beneﬁt from

depth in space; that is from stacking multiple recurrent hid-

den layers on top of each other, just as feedforward layers are

stacked in conventional deep networks. To answer this ques-

tion we introduce deep Long Short-term Memory RNNs and

assess their potential for speech recognition. We also present

an enhancement to a recently introduced end-to-end learning

method that jointly trains two separate RNNs as acoustic and

linguistic models [10]. Sections 2 and 3 describe the network

architectures and training methods, Section 4 provides exper-

imental results and concluding remarks are given in Section 5.

2. RECURRENT NEURAL NETWORKS

Given an input sequence x = (x

, . . . , x

), a standard recur-

rent neural network (RNN) computes the hidden vector se-

quence h = (h

, . . . , h

) and output vector sequence y =

, . . . , y

) by iterating the following equations from t = 1

to T :

= H (W

+ W

t−1

+ b

) (1)

= W

+ b

(2)

where the W terms denote weight matrices (e.g. W

is the

input-hidden weight matrix), the b terms denote bias vectors

(e.g. b

is hidden bias vector) and H is the hidden layer func-

tion.

H is usually an elementwise application of a sigmoid

function. However we have found that the Long Short-Term

Memory (LSTM) architecture [11], which uses purpose-built

memory cells to store information, is better at ﬁnding and ex-

ploiting long range context. Fig. 1 illustrates a single LSTM

memory cell. For the version of LSTM used in this paper [14]

H is implemented by the following composite function:

= σ (W

+ W

t−1

+ W

t−1

+ b

) (3)

= σ (W

+ W

t−1

+ W

t−1

+ b

) (4)

= f

t−1

+ i

tanh (W

+ W

t−1

+ b

) (5)

= σ (W

+ W

t−1

+ W

+ b

) (6)

= o

tanh(c

) (7)

where σ is the logistic sigmoid function, and i, f, o and c

are respectively the input gate, forget gate, output gate and

arXiv:1303.5778v1 [cs.NE] 22 Mar 2013

下载后可阅读完整内容，剩余4页未读，立即下载

xinghaoyan

粉丝: 11
资源: 79

深度循环神经网络在语音识别中的突破

【10】Towards End-to-End Speech Recognitionwith Recurrent Neural Networks.pdf

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS.pdf

A Comprehensive Survey on Graph Neural NETWORKS.pdf

speech recognition with recurrent network

【Advanced】Implementation of Recurrent Neural Networks (RNN) in Matlab

Pattern Recognition using Neural and Functional Networks(Springer2009新书)

【11】Fast and accurate recurrent neural network acoustic models

Deep.Learning.with.TensorFlow

DeepLearning深度学习教程_第六章_循环神经网络(RNN).pdf

TF-Speech-Recognition-Challenge-Solution：Tensorflow语音识别挑战（https：www.kaggle.comctensorflow-speech-recognition-challenge）中使用的模型的源代码。 该解决方案在私人排行榜中排名前5％

最新资源

TF-Speech-Recognition-Challenge-Solution：Tensorflow语音识别挑战（https：www.kaggle.comctensorflow-speech-recognition-challenge）中使用的模型的源代码。该解决方案在私人排行榜中排名前5％