Domain Adaptation for Neural Machine Translation: A Survey
is determined from representations all projected from the same sequence. That sequence
consists of the source embeddings x at the input to the encoder, target embeddings y at the
input to the decoder, or the output of the previous layer in a multi-layer encoder or decoder.
All of these elements - encoder, decoder, and attention network - can be duplicated for a
specific domain (section 5.2).
3.2.3 Increasing Model Depth
Any machine translation encoder or decoder – recurrent, convolutional, self-attention-based
– may be implemented as a multi-layer network. For example, the ‘base’ Transformer as
described in Vaswani et al. (2017) has an encoder and decoder each composed of a stack of
6 identical layers. Each layer carries out its operation (e.g. self-attention) on the output
of the layer before. The final layer in the encoder is used as input to the decoder, and the
final layer in the decoder is used to produce the output.
In principle multi-layer networks are capable of learning more fine-grained language
representations than single-layer networks, simply because they have more parameters. In
practice multi-layer networks are susceptible to training difficulties since objective function
gradients must be propagated through more layers.
A common way to improve gradient propagation is adding residual networks around
each layer – that is, the output of a layer f (z) becomes f(z) + z (He et al., 2016). Each
layer in the encoder or decoder subnetwork then has access to the original subnetwork input.
Residual connections have been found to be necessary when training deep recurrent models
(Britz et al., 2017a) and Transformer models (Chen et al., 2018).
When such architectural ‘tricks’ are applied, deep models have been shown to outperform
equivalent shallower models for NMT in some settings (Wang et al., 2019). These techniques
effectively move the bottleneck on model size towards constraints such as memory footprint
and training time. However, we do not consider model depth a panacea, and note that
recent work has in some cases found better performance for shallower models, especially for
low-resource translation (Sennrich & Zhang, 2019; Nguyen & Chiang, 2018).
3.3 Training Neural Machine Translation Models
Once an NMT model architecture has been determined as in section 3.2, its parameters must
be adjusted so as to produce a mapping between source sequences x and target sequences y.
NMT model parameters are trained by backpropagation (Rumelhart et al., 1986), typically
using some form of Stochastic Gradient Descent (SGD) optimizer. These gradients must
be determined for some objective on the training data.
Standard training objectives, such as cross-entropy loss, use both the source sentence
and reference sentence during training. However, during inference only the source sentence
and the prefix of the model’s own hypothesis
ˆ
y are available. An auto-regressive sequence
decoder therefore experiences a discrepancy between conditioning during training and in-
ference (Bengio et al., 2015; Ranzato et al., 2016). The need to improve performance while
avoiding over-exposure to the training data has motivated parameter and objective regu-
larization methods. Here we summarize some of these approaches that form the basis for
domain adaptation techniques discussed later.
11