Local Monotonic Attention Mechanism
for End-to-End Speech and Language Processing
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura
Graduate School of Information Science
Nara Institute of Science and Technology, Japan
{andros.tjandra.ai6, ssakti, s-nakamura}@is.naist.jp
Abstract
Recently, encoder-decoder neural net-
works have shown impressive perfor-
mance on many sequence-related tasks.
The architecture commonly uses an atten-
tional mechanism which allows the model
to learn alignments between the source
and the target sequence. Most attentional
mechanisms used today is based on a
global attention property which requires
a computation of a weighted summariza-
tion of the whole input sequence gener-
ated by encoder states. However, it is
computationally expensive and often pro-
duces misalignment on the longer input
sequence. Furthermore, it does not fit
with monotonous or left-to-right nature in
several tasks, such as automatic speech
recognition (ASR), grapheme-to-phoneme
(G2P), etc. In this paper, we propose a
novel attention mechanism that has local
and monotonic properties. Various ways
to control those properties are also ex-
plored. Experimental results on ASR, G2P
and machine translation between two lan-
guages with similar sentence structures,
demonstrate that the proposed encoder-
decoder model with local monotonic at-
tention could achieve significant perfor-
mance improvements and reduce the com-
putational complexity in comparison with
the one that used the standard global atten-
tion architecture.
1 Introduction
End-to-end training is a newly emerging approach
to sequence-to-sequence mapping tasks, that al-
lows the model to directly learn the mapping be-
tween variable-length representation of different
modalities (i.e., text-to-text sequence (Bahdanau
et al., 2014; Sutskever et al., 2014), speech-to-
text sequence (Chorowski et al., 2014; Chan et al.,
2016), image-to-text sequence (Xu et al., 2015),
etc).
One popular approaches in the end-to-end map-
ping tasks of different modalities is based on
encoder-decoder architecture. The earlier version
of an encoder-decoder model is built with only two
different components (Sutskever et al., 2014; Cho
et al., 2014b): (1) an encoder that processes the
source sequence and encodes them into a fixed-
length vector; and (2) a decoder that produces the
target sequence based on information from fixed-
length vector given by encoder. Both the encoder
and decoder are jointly trained to maximize the
probability of a correct target sequence given a
source sequence. This architecture has been ap-
plied in many applications such as machine trans-
lation (Sutskever et al., 2014; Cho et al., 2014b),
image captioning (Karpathy and Fei-Fei, 2015),
and so on.
However, such architecture encounters difficul-
ties, especially for coping with long sequences.
Because in order to generate the correct target se-
quence, the decoder solely depends only on the
last hidden state of the encoder. In other words,
the network needs to compress all of the infor-
mation contained in the source sequence into a
single fixed-length vector. (Cho et al., 2014a)
demonstrated a decrease in the performance of
the encoder-decoder model associated with an in-
crease in the length of the input sentence sequence.
Therefore, (Bahdanau et al., 2014) introduced at-
tention mechanism to address these issues. Instead
of relying on a fixed-length vector, the decoder
is assisted by the attention module to get the re-
lated context from the encoder sides, depends on
the current decoder states.
Most attention-based encoder-decoder model
arXiv:1705.08091v2 [cs.CL] 3 Nov 2017