神经网络语音识别新模型：LAS

需积分: 10 193 浏览量更新于2024-09-07 收藏 632KB PDF 举报

"Listen, Attend, and Spell (LAS) 是一种深度学习驱动的语音识别系统，它提出了一种全新的架构来处理大规模词汇的会话式语音转录。相比于传统的基于HMM（隐马尔可夫模型）的语音识别系统，如DNN-HMM或CTC（连接时序分类）模型，LAS摒弃了对输出字符序列与音频序列之间独立概率分布的假设，从而实现了端到端的学习和建模。 LAS的核心是由两个主要组件构成：监听器（Listener）和拼写器（Speller）。监听器是一个金字塔型的循环神经网络（RNN）编码器，它接受滤波器银行特征作为输入，负责从音频信号中提取特征并转换为有意义的内部表示。这个设计允许模型直接从原始声音信号中捕捉到语义信息，无需预设的发音模型。拼写器则是另一个注意力机制驱动的循环神经网络解码器，它逐个输出字符，但每个字符的生成不仅依赖于先前已生成的字符，还同时考虑了整个音频序列的信息。这种注意力机制使得模型能够根据当前需要关注的部分来动态调整对输入音频的处理，提高了识别精度。在Google语音搜索任务上，LAS展示出了显著的优势，其词错误率（Word Error Rate, WER）表明了模型在处理复杂会话对话场景下的优秀性能。与传统方法相比，LAS简化了语音识别流程，消除了中间环节，使得模型训练更加高效，同时也更接近人类听觉理解的方式，具有更高的灵活性和适应性。LAS代表了语音识别技术的一个重要突破，为未来的语音交互和自然语言处理提供了新的可能性。"

LISTEN, ATTEND AND SPELL: A NEURAL NETWORK FOR

LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION

William Chan

Carnegie Mellon University

Navdeep Jaitly, Quoc Le, Oriol Vinyals

Google Brain

ABSTRACT

We present Listen, Attend and Spell (LAS), a neural speech recog-

nizer that transcribes speech utterances directly to characters with-

out pronunciation models, HMMs or other components of traditional

speech recognizers. In LAS, the neural network architecture sub-

sumes the acoustic, pronunciation and language models making it

not only an end-to-end trained system but an end-to-end model. In

contrast to DNN-HMM, CTC and most other models, LAS makes no

independence assumptions about the probability distribution of the

output character sequences given the acoustic sequence. Our system

has two components: a listener and a speller. The listener is a pyra-

midal recurrent network encoder that accepts ﬁlter bank spectra as

inputs. The speller is an attention-based recurrent network decoder

that emits each character conditioned on all previous characters, and

the entire acoustic sequence. On a Google voice search task, LAS

achieves a WER of 14.1% without a dictionary or an external lan-

guage model and 10.3% with language model rescoring over the top

32 beams. In comparison, the state-of-the-art CLDNN-HMM model

achieves a WER of 8.0% on the same set.

Index Terms— Recurrent neural network, neural attention, end-

to-end speech recognition

1. INTRODUCTION

State-of-the-art speech recognizers of today are complicated systems

comprising of various components - acoustic models, language mod-

els, pronunciation models and text normalization. Each of these

components make assumptions about the underlying probability dis-

tributions they model. For example n-gram language models and

Hidden Markov Models (HMMs) make strong Markovian indepen-

dence assumptions between words/symbols in a sequence. Connec-

tionist Temporal Classiﬁcation (CTC) and DNN-HMM systems as-

sume that neural networks make independent predictions at different

times and use HMMs or language models (which make their own in-

dependence assumptions) to introduce dependencies between these

predictions over time [1, 2, 3]. End-to-end training of such mod-

els attempts to mitigate these problems by training the components

jointly [4, 5, 6]. In these models, acoustic models are updated based

on a WER proxy, while the pronunciation and language models are

rarely updated [7], if at all.

In this paper we introduce Listen, Attend and Spell (LAS), a

neural network that learns to transcribe an audio sequence signal to

a word sequence, one character at a time, without using explicit lan-

guage models, pronunciation models, HMMs, etc. LAS does not

make any independence assumptions about the nature of the prob-

ability distribution of the output character sequence, given the in-

put acoustic sequence. This method is based on the sequence-to-

sequence learning framework with attention [8, 9, 10, 11, 12, 13]. It

consists of an encoder Recurrent Neural Network (RNN), which is

named the listener, and a decoder RNN, which is named the speller.

The listener is a pyramidal RNN that converts speech signals into

high level features. The speller is an RNN that transduces these

higher level features into output utterances by specifying a proba-

bility distribution over the next character, given all of the acoustics

and the previous characters. At each step the RNN uses its inter-

nal state to guide an attention mechanism [10, 11, 12] to compute a

“context” vector from the high level features of the listener. It uses

this context vector, and its internal state to both update its internal

state and to predict the next character in the sequence. The entire

model is trained jointly, from scratch, by optimizing the probability

of the output sequence using a chain rule decomposition. We call

this an end-to-end model because all the components of a traditional

speech recognizer are integrated into its parameters, and optimized

together during training, unlike end-to-end training of conventional

models that attempt to adjust acoustic models to work well with the

other ﬁxed components of a speech recognizer.

Our model was inspired by [11, 12] that showed how end-to-

end recognition could be performed on the TIMIT phone recognition

task. We note a recent paper from the same group that describes

an application of these ideas to WSJ [14]. Our paper independently

explores the challenges associated with the application of these ideas

to large scale conversational speech recognition on a Google voice

search task. We defer a discussion of the relationship between these

and other methods to section 5.

2. MODEL

In this section, we formally describe LAS. Let x = (x

, . . . , x

)

be the input sequence of ﬁlter bank spectra features and y =

(hsosi, y

, . . . , y

, heosi), y

∈ {a, · · · , z, 0, · · · , 9, hspacei,

hcommai, hperiodi, hapostrophei, hunki} be the output sequence

of characters. Here hsosi and heosi are the special start-of-sentence

token, and end-of-sentence tokens, respectively, and hunki are

unknown tokens such as accented characters.

LAS models each character output y

as a conditional distribu-

tion over the previous characters y

and the input signal x using the

chain rule for probabilities:

P (y|x) =

P (y

|x, y

) (1)

This objective makes the model a discriminative, end-to-end

model, because it directly predicts the conditional probability of

character sequences, given the acoustic signal.

LAS consists of two sub-modules: the listener and the speller.

The listener is an acoustic model encoder that performs an operation

called Listen. The Listen operation transforms the original signal x

into a high level representation h = (h

, . . . , h

) with U ≤ T. The

speller is an attention-based character decoder that performs an op-

eration we call AttendAndSp ell. The AttendAndSpell operation

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_41778389

粉丝: 0
资源: 5

神经网络语音识别新模型：LAS

ETG2000_S_R_V1i0i6_EtherCATSlaveInformationSpecification

ETG5001_1_V0i9i0_S_D_MDP_GeneralSpec.pdf

ETG2200_V3i0i3_G_R_SlaveImplementationGuide.pdf

两篇英文文献关于语音识别

attention-lstm

空间注意力机制相关参考文献

给我推荐20个比较流行的音频处理算法模型

GRU attention

point transformer v2

最新资源