Online and Linear-Time Attention by Enforcing Monotonic Alignments
for a derivation)
α
i,j
= p
i,j
j
X
k=1
α
i−1,k
j−1
Y
l=k
(1 − p
i,l
)
!
(9)
= p
i,j
(1 − p
i,j−1
)
α
i,j−1
p
i,j−1
+ α
i−1,j
(10)
We provide a solution to the recurrence relation of eq. (10)
which allows computing α
i,j
for j ∈ {1, . . . , T } in parallel
with cumulative sum and cumulative product operations in
appendix C.1. Defining q
i,j
= α
i,j
/p
i,j
gives the following
procedure for computing α
i,j
:
e
i,j
= a(s
i−1
, h
j
) (11)
p
i,j
= σ(e
i,j
) (12)
q
i,j
= (1 − p
i,j−1
)q
i,j−1
+ α
i−1,j
(13)
α
i,j
= p
i,j
q
i,j
(14)
where we define the special cases of q
i,0
= 0, p
i,0
= 0
to maintain equivalence with eq. (9). As in softmax-
based attention, the α
i,j
values produce a weighting over
the memory, which are then used to compute the con-
text vector at each timestep as in eq. (3). However, note
that α
i
may not be a valid probability distribution because
P
j
α
i,j
≤ 1. Using α
i
as-is, without normalization, ef-
fectively associates any additional probability not allocated
to memory entries to an additional all-zero memory loca-
tion. Normalizing α
i
so that
P
T
j=1
α
i,j
= 1 has two issues:
First, we can’t perform this normalization at test time and
still achieve online decoding because the normalization de-
pends on α
i,j
for j ∈ {1, . . . , T}, and second, it would re-
sult in a mismatch compared to the probability distribution
induced by the hard monotonic attention process which sets
c
i
to a vector of zeros when z
i,j
= 0 for j ∈ {t
i−1
, . . . , T }.
Note that computing c
i
still has a quadratic complexity be-
cause we must compute α
i,j
for j ∈ {1, . . . , T } for each
output timestep i. However, because we are training di-
rectly with respect to the expected value of c
i
, we will train
our decoders using eqs. (11) to (14) and then use the on-
line, linear-time attention process of section 2.2 at test time.
Furthermore, if p
i,j
∈ {0, 1} these approaches are equiva-
lent, so in order for the model to exhibit similar behavior at
training and test time, we need p
i,j
≈ 0 or p
i,j
≈ 1. We
address this in section 2.5.
2.4. Modified Energy Function
While various “energy functions” a(·) have been proposed,
the most common to our knowledge is the one proposed in
(Bahdanau et al., 2015):
a(s
i−1
, h
j
) = v
>
tanh(W s
i−1
+ V h
j
+ b) (15)
where W and V are weight matrices, b is a bias vector,
1
and v is a weight vector. We make two modifications to
eq. (15) for use with our monotonic decoder: First, while
the softmax is invariant to offset,
2
the logistic sigmoid is
not. As a result, we make the simple modification of adding
a scalar variable r after the tanh function, allowing the
model to learn the appropriate offset for the pre-sigmoid
activations. Note that eq. (13) tends to exponentially de-
cay attention over the memory because 1 − p
i,j
∈ [0, 1];
we therefore initialized r to a negative value prior to train-
ing so that 1 − p
i,j
tends to be close to 1. Second, the
use of the sigmoid nonlinearity in eq. (12) implies that our
mechanism is particularly sensitive to the scale of the en-
ergy terms e
i,j
, or correspondingly, the scale of the energy
vector v. We found an effective solution to this issue was
to apply weight normalization (Salimans & Kingma, 2016)
to v, replacing it by gv/kvk where g is a scalar parame-
ter. Initializing g to the inverse square root of the attention
hidden dimension worked well for all problems we studied.
The above produces the energy function
a(s
i−1
, h
j
) = g
v
>
kvk
tanh(W s
i−1
+ V h
j
+ b) + r (16)
The addition of the two scalar parameters g and r prevented
the issues described above in all our experiments while in-
curring a negligible increase in the number of parameters.
2.5. Encouraging Discreteness
As mentioned above, in order for our mechanism to exhibit
similar behavior when training in expectation and when us-
ing the hard monotonic attention process at test time, we
require that p
i,j
≈ 0 or p
i,j
≈ 1. A straightforward way to
encourage this behavior is to add noise before the sigmoid
in eq. (12), as was done e.g. in (Frey, 1997; Salakhutdinov
& Hinton, 2009; Foerster et al., 2016). We found that sim-
ply adding zero-mean, unit-variance Gaussian noise to the
pre-sigmoid activations was sufficient in all of our exper-
iments. This approach is similar to the recently proposed
Gumbel-Softmax trick (Jang et al., 2016; Maddison et al.,
2016), except we did not find it necessary to anneal the
temperature as suggested in (Jang et al., 2016).
Note that once we have a model which produces p
i,j
which
are effectively discrete, we can eschew the sampling in-
volved in the process of section 2.2 and instead simply set
z
i,j
= I(p
i,j
> τ ) where I is the indicator function and τ
is a threshold. We used this approach in all of our exper-
iments, setting τ = 0.5. Furthermore, at test time we do
not add pre-sigmoid noise, making decoding purely deter-
1
b is occasionally omitted, but we found it often improves per-
formance and only incurs a modest increase in parameters, so we
include it.
2
That is, softmax(e) = softmax(e + r) for any r ∈ R.