使用循环神经网络实现端到端语音识别

需积分: 15 190 浏览量更新于2024-09-07 收藏 465KB PDF 举报

"这篇学术论文探讨了使用循环神经网络（Recurrent Neural Networks，RNNs）特别是深度双向长短期记忆网络（Deep Bidirectional LSTM RNNs）实现端到端的语音识别系统。该系统不再需要中间的音素表示，而是直接将音频数据转化为文本。论文提出了一种对目标函数的修改，使网络能够在没有词典或语言模型的情况下优化字错误率（Word Error Rate，WER）。在华尔街日报语料库上，该系统分别取得了27.3%、21.9%和8.2%的字错误率，这些结果是在没有语言信息、只有允许词汇的词典以及使用三元组语言模型的情况下取得的。通过与基线系统的结合，错误率进一步降低至6.7%。" 本文是Alex Graves和Navdeep Jaitly共同撰写的，他们分别来自Google DeepMind和多伦多大学计算机科学系。他们的研究关注点在于构建一个无需依赖传统语音识别步骤（如声学模型、发音词典和语言模型）的直接音频到文本的转换系统。端到端的语音识别系统是近年来的研究热点，其目标是简化传统管道，减少中间步骤，提高整体效率。本研究中，他们采用了深度双向LSTM RNN作为核心架构，这种网络结构能够处理序列数据并考虑上下文信息，非常适合于语音识别任务。同时，他们结合了连接主义时间分类（Connectionist Temporal Classification, CTC）目标函数，这是一种处理不规则序列对齐问题的有效方法。关键创新点在于对CTC目标函数的修改。传统的CTC通常计算的是预测序列与参考序列之间的差异，而改进后的版本允许网络直接最小化预期的任意转录损失函数，从而可以直接优化字错误率。这一改进使得系统在没有词典或语言模型的情况下也能进行训练，降低了对先验语言知识的依赖。在实验部分，该系统在华尔街日报语料库上展示了其性能。在没有语言信息的情况下，系统达到的字错误率为27.3%，这已经是一个显著的成就。当仅提供允许词汇的词典时，错误率降低到21.9%。进一步引入三元组语言模型后，错误率进一步降低到8.2%。最后，通过与一个基线系统的联合，错误率被优化到6.7%，显示出该端到端模型在语音识别领域的强大潜力。这篇论文为端到端语音识别提供了一个新的视角，即如何利用深度学习技术，特别是LSTM RNNs和CTC，来实现无需中间表示的直接文本转录，并且在实际应用中取得了令人鼓舞的结果。这一成果对于语音识别技术的未来发展具有重要的理论和实践意义。

Towards End-to-End Speech Recognition

with Recurrent Neural Networks

Alex Graves GRAVES@CS.TORONTO.EDU

Google DeepMind, London, United Kingdom

Navdeep Jaitly NDJAITLY@CS.TORONTO.EDU

Department of Computer Science, University of Toronto, Canada

Abstract

This paper presents a speech recognition sys-

tem that directly transcribes audio data with text,

without requiring an intermediate phonetic repre-

sentation. The system is based on a combination

of the deep bidirectional LSTM recurrent neural

network architecture and the Connectionist Tem-

poral Classiﬁcation objective function. A mod-

iﬁcation to the objective function is introduced

that trains the network to minimise the expec-

tation of an arbitrary transcription loss function.

This allows a direct optimisation of the word er-

ror rate, even in the absence of a lexicon or lan-

guage model. The system achieves a word error

rate of 27.3% on the Wall Street Journal corpus

with no prior linguistic information, 21.9% with

only a lexicon of allowed words, and 8.2% with a

trigram language model. Combining the network

with a baseline system further reduces the error

rate to 6.7%.

1. Introduction

Recent advances in algorithms and computer hardware

have made it possible to train neural networks in an end-

to-end fashion for tasks that previously required signiﬁ-

cant human expertise. For example, convolutional neural

networks are now able to directly classify raw pixels into

high-level concepts such as object categories (Krizhevsky

et al., 2012) and messages on trafﬁc signs (Ciresan et al.,

2011), without using hand-designed feature extraction al-

gorithms. Not only do such networks require less human

effort than traditional approaches, they generally deliver

superior performance. This is particularly true when very

large amounts of training data are available, as the bene-

Proceedings of the 31

International Conference on Machine

Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-

right 2014 by the author(s).

ﬁts of holistic optimisation tend to outweigh those of prior

knowledge.

While automatic speech recognition has greatly beneﬁted

from the introduction of neural networks (Bourlard & Mor-

gan, 1993; Hinton et al., 2012), the networks are at present

only a single component in a complex pipeline. As with

traditional computer vision, the ﬁrst stage of the pipeline

is input feature extraction: standard techniques include

mel-scale ﬁlterbanks (Davis & Mermelstein, 1980) (with

or without a further transform into Cepstral coefﬁcients)

and speaker normalisation techniques such as vocal tract

length normalisation (Lee & Rose, 1998). Neural networks

are then trained to classify individual frames of acous-

tic data, and their output distributions are reformulated as

emission probabilities for a hidden Markov model (HMM).

The objective function used to train the networks is there-

fore substantially different from the true performance mea-

sure (sequence-level transcription accuracy). This is pre-

cisely the sort of inconsistency that end-to-end learning

seeks to avoid. In practice it is a source of frustration to

researchers, who ﬁnd that a large gain in frame accuracy

can translate to a negligible improvement, or even deterio-

ration in transcription accuracy. An additional problem is

that the frame-level training targets must be inferred from

the alignments determined by the HMM. This leads to an

awkward iterative procedure, where network retraining is

alternated with HMM re-alignments to generate more accu-

rate targets. Full-sequence training methods such as Max-

imum Mutual Information have been used to directly train

HMM-neural network hybrids to maximise the probability

of the correct transcription (Bahl et al., 1986; Jaitly et al.,

2012). However these techniques are only suitable for re-

training a system already trained at frame-level, and require

the careful tuning of a large number of hyper-parameters—

typically even more than the tuning required for deep neu-

ral networks.

While the transcriptions used to train speech recognition

systems are lexical, the targets presented to the networks

下载后可阅读完整内容，剩余8页未读，立即下载

xinghaoyan

粉丝: 11
资源: 79

使用循环神经网络实现端到端语音识别

Recurrent Convolutional Neural Networks for Text Classification

TACOTRON：走向端到端语音合成 TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS.pdf

神经网络 | DeepVO:Towards End-to-End Visual Odometry-附件资源

信息安全_数据安全_eu-16-Sullivan-Towards-A-Policy-.pdf

Towards Tightening Worst Case End-to-End Delay Estimates for AFDX Network

Towards Knowledge-Grounded Open-Domain Conversations.pdf

HotSDN-paper-2014-ONOS-Towards-an-Open-Distributed-SDN-OS.pdf

Towards Effective Low-bitwidth Convolutional Neural Networks

信息安全_数据安全_us-18-Wu-Towards-Automating-Expl.pdf

Transmission of photonic polarization states through 55-m water: towards air-to-sea quantum communication

最新资源