在线线性时间注意力：强制单调排列

版权申诉

75 浏览量更新于2024-07-20 收藏 1MB PDF 举报

“通过强制单调排列实现在线和线性时间注意OnlineandLinear-TimeAttentionbyEnforcingMonotonicAlignmen.pdf” 本文探讨的是如何改进递归神经网络（RNN）中的注意机制，使其适用于在线和线性时间复杂度的场景。传统的软注意机制在处理序列到序列任务时，如机器翻译、文本摘要和语音识别，需要对整个输入序列进行扫描，这导致其在时间和计算效率上的局限性，不适合实时或在线应用，因为其时间复杂度为二次。递归神经网络模型结合注意机制已经在诸如自然语言处理、语音识别和机器翻译等序列到序列问题上取得了显著成果。注意机制允许模型在生成输出序列时动态地关注输入序列的不同部分，增强了模型的表达能力。然而，这种机制的计算成本高，因为它在生成每个输出元素时都需要遍历整个输入序列。针对这一问题，作者提出了一个新颖的方法，即通过强制单调对齐（Enforcing Monotonic Alignments）来学习对齐模式。在许多序列到序列的任务中，输入和输出序列的元素之间存在单调关系，即输出的生成通常依赖于输入的前向部分，而不是后向部分。利用这一洞察，作者设计了一种端到端的可微分方法，该方法能够在测试时在线并以线性时间计算注意力权重，从而解决了在线应用的效率问题。在实验部分，作者在句子摘要、机器翻译和在线语音识别三个任务上验证了这种方法的有效性。实验结果表明，尽管采用了更高效的计算方式，但这种方法仍然能够与现有的序列到序列模型竞争，达到相当的性能表现。 1. 引言最近，序列到序列框架的提出（Sutskever et al., 2014; Cho et al., 2014）极大地推动了深度学习在处理序列任务上的进展，例如将源语言翻译成目标语言。然而，随着任务复杂性的增加，特别是在线和实时应用，需要减少计算复杂度，以满足实时响应的需求。 2. 背景与相关工作在介绍新方法之前，文章会回顾注意机制的基本概念，以及当前在递归神经网络中应用的软注意机制的局限性。同时，也会讨论其他尝试优化注意机制的先前工作，比如硬注意和局部注意等。 3. 强制单调对齐的注意机制这部分将详细阐述新的方法，包括如何构建模型以学习和强制单调对齐，以及如何在保持可微性的同时实现在线线性时间的注意力计算。 4. 实验设置与结果实验部分会详细介绍实验的设计，包括数据集、评估指标以及与其他方法的比较。作者会展示新方法在不同任务上的性能，以及如何通过调整参数和结构来优化结果。 5. 结论与未来工作最后，文章会总结所提出的在线线性时间注意机制的优点，并讨论可能的扩展和未来的研究方向，包括如何进一步提高效率，以及在更多类型的序列任务中应用这种方法。这个工作为递归神经网络的注意机制提供了一个重要的优化，使得它们能在对实时性和效率有严格要求的场景中得到更广泛的应用。

Online and Linear-Time Attention by Enforcing Monotonic Alignments

for a derivation)

i,j

= p

i,j

k=1

i−1,k

j−1

l=k

(1 − p

i,l

)

(9)

= p

i,j



(1 − p

i,j−1

)

i,j−1

+ α

i−1,j



(10)

We provide a solution to the recurrence relation of eq. (10)

which allows computing α

i,j

for j ∈ {1, . . . , T } in parallel

with cumulative sum and cumulative product operations in

appendix C.1. Deﬁning q

i,j

= α

i,j

gives the following

procedure for computing α

i,j

= a(s

i−1

, h

) (11)

i,j

= σ(e

i,j

) (12)

i,j

= (1 − p

i,j−1

+ α

i−1,j

(13)

i,j

= p

i,j

(14)

where we deﬁne the special cases of q

i,0

= 0, p

i,0

= 0

to maintain equivalence with eq. (9). As in softmax-

based attention, the α

i,j

values produce a weighting over

the memory, which are then used to compute the con-

text vector at each timestep as in eq. (3). However, note

that α

may not be a valid probability distribution because

i,j

≤ 1. Using α

as-is, without normalization, ef-

fectively associates any additional probability not allocated

to memory entries to an additional all-zero memory loca-

tion. Normalizing α

so that

j=1

i,j

= 1 has two issues:

First, we can’t perform this normalization at test time and

still achieve online decoding because the normalization de-

pends on α

i,j

for j ∈ {1, . . . , T}, and second, it would re-

sult in a mismatch compared to the probability distribution

induced by the hard monotonic attention process which sets

to a vector of zeros when z

i,j

= 0 for j ∈ {t

i−1

, . . . , T }.

Note that computing c

still has a quadratic complexity be-

cause we must compute α

i,j

for j ∈ {1, . . . , T } for each

output timestep i. However, because we are training di-

rectly with respect to the expected value of c

, we will train

our decoders using eqs. (11) to (14) and then use the on-

line, linear-time attention process of section 2.2 at test time.

Furthermore, if p

i,j

∈ {0, 1} these approaches are equiva-

lent, so in order for the model to exhibit similar behavior at

training and test time, we need p

i,j

≈ 0 or p

i,j

≈ 1. We

address this in section 2.5.

2.4. Modiﬁed Energy Function

While various “energy functions” a(·) have been proposed,

the most common to our knowledge is the one proposed in

(Bahdanau et al., 2015):

a(s

i−1

, h

) = v

tanh(W s

i−1

+ V h

+ b) (15)

where W and V are weight matrices, b is a bias vector,

and v is a weight vector. We make two modiﬁcations to

eq. (15) for use with our monotonic decoder: First, while

the softmax is invariant to offset,

the logistic sigmoid is

not. As a result, we make the simple modiﬁcation of adding

a scalar variable r after the tanh function, allowing the

model to learn the appropriate offset for the pre-sigmoid

activations. Note that eq. (13) tends to exponentially de-

cay attention over the memory because 1 − p

i,j

∈ [0, 1];

we therefore initialized r to a negative value prior to train-

ing so that 1 − p

i,j

tends to be close to 1. Second, the

use of the sigmoid nonlinearity in eq. (12) implies that our

mechanism is particularly sensitive to the scale of the en-

ergy terms e

i,j

, or correspondingly, the scale of the energy

vector v. We found an effective solution to this issue was

to apply weight normalization (Salimans & Kingma, 2016)

to v, replacing it by gv/kvk where g is a scalar parame-

ter. Initializing g to the inverse square root of the attention

hidden dimension worked well for all problems we studied.

The above produces the energy function

a(s

i−1

, h

) = g

kvk

tanh(W s

i−1

+ V h

+ b) + r (16)

The addition of the two scalar parameters g and r prevented

the issues described above in all our experiments while in-

curring a negligible increase in the number of parameters.

2.5. Encouraging Discreteness

As mentioned above, in order for our mechanism to exhibit

similar behavior when training in expectation and when us-

ing the hard monotonic attention process at test time, we

require that p

i,j

≈ 0 or p

i,j

≈ 1. A straightforward way to

encourage this behavior is to add noise before the sigmoid

in eq. (12), as was done e.g. in (Frey, 1997; Salakhutdinov

& Hinton, 2009; Foerster et al., 2016). We found that sim-

ply adding zero-mean, unit-variance Gaussian noise to the

pre-sigmoid activations was sufﬁcient in all of our exper-

iments. This approach is similar to the recently proposed

Gumbel-Softmax trick (Jang et al., 2016; Maddison et al.,

2016), except we did not ﬁnd it necessary to anneal the

temperature as suggested in (Jang et al., 2016).

Note that once we have a model which produces p

i,j

which

are effectively discrete, we can eschew the sampling in-

volved in the process of section 2.2 and instead simply set

i,j

= I(p

i,j

> τ ) where I is the indicator function and τ

is a threshold. We used this approach in all of our exper-

iments, setting τ = 0.5. Furthermore, at test time we do

not add pre-sigmoid noise, making decoding purely deter-

b is occasionally omitted, but we found it often improves per-

formance and only incurs a modest increase in parameters, so we

include it.

That is, softmax(e) = softmax(e + r) for any r ∈ R.

剩余18页未读，继续阅读

电动汽车控制与安全

粉丝: 276
资源: 4186

在线线性时间注意力：强制单调排列

Python库 monotonic-1.1-py2.py3-none-any.whl 使用指南

实施单调性约束的在线线性时间注意力机制

MATLAB在多对称Euler盒方案和参数敏感性分析中的应用

DP的单调队列优化-Yuiffy.pdf

自动控制理论：7第七章 非线性系统的分析-1.ppt

NI PXIe-6738.pdf

MATLAB-2015-机设678-上机习题五-解答.pdf

考研数学二 历年考研真题1987-2018.pdf

2-2巩固练习_导数的应用--单调性_提高1--.doc

专题02 函数的定义与函数的性质-高一数学百所名校好题分项解析汇编（2019版）（必修1）（解析版） - 副本.pdf

最新资源

自动控制理论：7第七章非线性系统的分析-1.ppt

考研数学二历年考研真题1987-2018.pdf