局部单调注意力机制：端到端语音与语言处理新方法

需积分: 7 175 浏览量更新于2024-09-08 收藏 446KB PDF 举报

"这篇论文提出了一种新的局部单调注意力机制，用于端到端的语音和语言处理，旨在解决全局注意力机制在处理长序列时的计算效率低和对齐错误问题，特别适合自动语音识别（ASR）、音素到字母转换（G2P）等任务。" 端到端的语音和语言处理近年来已经成为研究热点，尤其是在序列相关的任务中，如机器翻译、语音识别等。这一领域的关键在于利用编码器-解码器神经网络架构，通过注意力机制来学习源序列和目标序列之间的对应关系。然而，传统的全局注意力机制存在两个主要问题：一是计算成本高，因为它需要对编码器生成的整个输入序列进行加权汇总；二是对于长序列，可能会产生对齐错误，这对那些具有单调或左到右性质的任务（如ASR和G2P）尤为不利。论文作者Andros Tjandra、Sakriani Sakti和Satoshi Nakamura提出了一个新颖的局部单调注意力机制。这种机制强调了局部性和单调性，以适应那些需要顺序处理的任务特性。局部注意力允许模型更加专注于输入序列的局部区域，减少了对全局信息的依赖，从而降低了计算复杂度。而单调性则确保了模型在处理序列时按照正确的顺序进行，避免了对齐错误，尤其适用于那些序列间顺序关系重要的任务。为了控制这些特性，论文探讨了多种方法，实验结果显示，采用局部单调注意力的编码器-解码器模型在ASR、G2P以及结构相似的两种语言之间的机器翻译任务上，取得了显著的性能提升，并且降低了计算复杂度。这表明，这种新型注意力机制能够有效地提高端到端语音和语言处理系统的准确性和效率，特别是在处理长序列时，其优势更为明显。这项工作为改进端到端语音处理模型提供了一个有前景的方向，通过引入局部和单调的注意力机制，能够在保持高精度的同时，降低计算需求，这对于资源有限的实时应用尤其具有价值。

Local Monotonic Attention Mechanism

for End-to-End Speech and Language Processing

Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura

Graduate School of Information Science

Nara Institute of Science and Technology, Japan

{andros.tjandra.ai6, ssakti, s-nakamura}@is.naist.jp

Abstract

Recently, encoder-decoder neural net-

works have shown impressive perfor-

mance on many sequence-related tasks.

The architecture commonly uses an atten-

tional mechanism which allows the model

to learn alignments between the source

and the target sequence. Most attentional

mechanisms used today is based on a

global attention property which requires

a computation of a weighted summariza-

tion of the whole input sequence gener-

ated by encoder states. However, it is

computationally expensive and often pro-

duces misalignment on the longer input

sequence. Furthermore, it does not ﬁt

with monotonous or left-to-right nature in

several tasks, such as automatic speech

recognition (ASR), grapheme-to-phoneme

(G2P), etc. In this paper, we propose a

novel attention mechanism that has local

and monotonic properties. Various ways

to control those properties are also ex-

plored. Experimental results on ASR, G2P

and machine translation between two lan-

guages with similar sentence structures,

demonstrate that the proposed encoder-

decoder model with local monotonic at-

tention could achieve signiﬁcant perfor-

mance improvements and reduce the com-

putational complexity in comparison with

the one that used the standard global atten-

tion architecture.

1 Introduction

End-to-end training is a newly emerging approach

to sequence-to-sequence mapping tasks, that al-

lows the model to directly learn the mapping be-

tween variable-length representation of different

modalities (i.e., text-to-text sequence (Bahdanau

et al., 2014; Sutskever et al., 2014), speech-to-

text sequence (Chorowski et al., 2014; Chan et al.,

2016), image-to-text sequence (Xu et al., 2015),

etc).

One popular approaches in the end-to-end map-

ping tasks of different modalities is based on

encoder-decoder architecture. The earlier version

of an encoder-decoder model is built with only two

different components (Sutskever et al., 2014; Cho

et al., 2014b): (1) an encoder that processes the

source sequence and encodes them into a ﬁxed-

length vector; and (2) a decoder that produces the

target sequence based on information from ﬁxed-

length vector given by encoder. Both the encoder

and decoder are jointly trained to maximize the

probability of a correct target sequence given a

source sequence. This architecture has been ap-

plied in many applications such as machine trans-

lation (Sutskever et al., 2014; Cho et al., 2014b),

image captioning (Karpathy and Fei-Fei, 2015),

and so on.

However, such architecture encounters difﬁcul-

ties, especially for coping with long sequences.

Because in order to generate the correct target se-

quence, the decoder solely depends only on the

last hidden state of the encoder. In other words,

the network needs to compress all of the infor-

mation contained in the source sequence into a

single ﬁxed-length vector. (Cho et al., 2014a)

demonstrated a decrease in the performance of

the encoder-decoder model associated with an in-

crease in the length of the input sentence sequence.

Therefore, (Bahdanau et al., 2014) introduced at-

tention mechanism to address these issues. Instead

of relying on a ﬁxed-length vector, the decoder

is assisted by the attention module to get the re-

lated context from the encoder sides, depends on

the current decoder states.

Most attention-based encoder-decoder model

arXiv:1705.08091v2 [cs.CL] 3 Nov 2017

下载后可阅读完整内容，剩余9页未读，立即下载

open_studio

粉丝: 0
资源: 1

局部单调注意力机制：端到端语音与语言处理新方法

Packt.Natural.Language.Processing.with.Java.Cookbook..rar

Learning TensorFlow_A Guide to Building Deep Learning Systems-O'Reilly(2017)

COLING 2018 Tutorial 4：Deep Bayesian Learning and Understanding

Learning TensorFlow: A Guide to Building Deep Learning Systems

Learning TensorFlow. A Guide to building Deep Learning Systems

Learning TensorFlow A Guide to Building Deep Learning Systems带详细书签

Learning TensorFlow A Guide to Building Deep Learning Systems 2017.8

【Basic】Speech Signal Recognition in MATLAB: Implementation of Speech Recognition Based on DTW and ...

【Fundamentals】Voice Signal Synthesis in MATLAB: Understanding Speech Synthesis Technologies and ...

Applications of MATLAB in Acoustic Signal Processing: Noise Control and Sound Enhancement

最新资源