
WEIGHTED TRANSFORMER NETWORK FOR
MACHINE TRANSLATION
Karim Ahmed, Nitish Shirish Keskar & Richard Socher
Salesforce Research
Palo Alto, CA 94103, USA
{karim.ahmed,nkeskar,rsocher}@salesforce.com
ABSTRACT
State-of-the-art results on neural machine translation often use attentional
sequence-to-sequence models with some form of convolution or recursion.
Vaswani et al. (2017) propose a new architecture that avoids recurrence and con-
volution completely. Instead, it uses only self-attention and feed-forward layers.
While the proposed architecture achieves state-of-the-art results on several ma-
chine translation tasks, it requires a large number of parameters and training iter-
ations to converge. We propose Weighted Transformer, a Transformer with mod-
ified attention layers, that not only outperforms the baseline network in BLEU
score but also converges 15 − 40% faster. Specifically, we replace the multi-head
attention by multiple self-attention branches that the model learns to combine dur-
ing the training process. Our model improves the state-of-the-art performance by
0.5 BLEU points on the WMT 2014 English-to-German translation task and by
0.4 on the English-to-French translation task.
1 INTRODUCTION
Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs) (Hochreiter
& Schmidhuber, 1997), form an important building block for many tasks that require modeling of
sequential data. RNNs have been successfully employed for several such tasks including language
modeling (Melis et al., 2017; Merity et al., 2017), speech recognition (Xiong et al., 2017; Graves
et al., 2013), and machine translation (Wu et al., 2016; Bahdanau et al., 2014). RNNs make output
predictions at each time step by computing a hidden state vector h
t
based on the current input token
and the previous states. This sequential computation underlies their ability to map arbitrary input-
output sequence pairs. However, because of their auto-regressive property of requiring previous
hidden states to be computed before the current time step, they cannot benefit from parallelization.
Variants of recurrent networks that use strided convolutions eschew the traditional time-step based
computation (Kaiser & Bengio, 2016; Lei & Zhang, 2017; Bradbury et al., 2016; Gehring et al.,
2016; 2017; Kalchbrenner et al., 2016). However, in these models, the operations needed to learn
dependencies between distant positions can be difficult to learn (Hochreiter et al., 2001; Hochreiter,
1998). Attention mechanisms, often used in conjunction with recurrent models, have become an in-
tegral part of complex sequential tasks because they facilitate learning of such dependencies (Luong
et al., 2015; Bahdanau et al., 2014; Parikh et al., 2016; Paulus et al., 2017; Kim et al., 2017).
In Vaswani et al. (2017), the authors introduce the Transformer network, a novel architecture that
avoids the recurrence equation and maps the input sequences into hidden states solely using atten-
tion. Specifically, the authors use positional encodings in conjunction with a multi-head attention
mechanism. This allows for increased parallel computation and reduces time to convergence. The
authors report results for neural machine translation that show the Transformer networks achieves
state-of-the-art performance on the WMT 2014 English-to-German and English-to-French tasks
while being orders-of-magnitude faster than prior approaches.
Transformer networks still require a large number of parameters to achieve state-of-the-art perfor-
mance. In the case of the newstest2013 English-to-German translation task, the base model required
65M parameters, and the large model required 213M parameters. We propose a variant of the Trans-
former network which we call Weighted Transformer that uses self-attention branches in lieu of
1
arXiv:1711.02132v1 [cs.AI] 6 Nov 2017