深度学习驱动的多领域神经机器翻译适应策略综述

需积分: 9 180 浏览量更新于2024-07-14 1 收藏 759KB PDF 举报

随着深度学习技术的飞速发展，神经机器翻译（NMT）模型已经展现出极高的翻译能力，只要拥有足够的训练数据和长时间的训练。然而，当面临具有独特风格或词汇的新领域文本时，NMT系统往往会遭遇挑战。这主要表现在两个方面：一是过度拟合，即在优化特定代表性的训练语料库时，模型可能过于适应训练数据，导致在处理未见过的数据时性能下降；二是“灾难性遗忘”，即模型可能忘记之前在其他领域学到的知识，影响其泛化能力。针对这些问题，本文综述了神经机器翻译的域自适应策略，着重探讨了以下几个关键方面： 1. 数据选择：如何有效地挑选或合成不同领域的训练数据，以减少对特定领域的依赖，同时保持对通用知识的保留。这可能涉及跨语言资源的利用、领域特异性数据增强或者多任务学习等方法。 2. 模型架构：设计适应性更强的NMT模型结构，如使用可分离的编码器-解码器、添加领域特定的嵌入层、或是利用领域特定的语言模型来辅助翻译过程。这些设计旨在提高模型的灵活性和迁移能力。 3. 参数适应：针对不同的领域，采用动态调整参数的方式，比如在训练过程中区分通用参数和领域参数，或者在测试时根据输入文本的领域特征进行实时参数调整。这样可以在保持模型泛化能力的同时，针对性地提升特定领域的翻译质量。 4. 推理过程：研究如何在保持模型原有性能的基础上，改进推理策略，例如使用多模态输入、集成领域知识或使用迁移学习策略，以便更好地处理新领域的输入。 5. 多领域适应：对于需要翻译多个领域文本的情况，提出多领域适应技术，旨在提高系统对多个领域的一致性和鲁棒性。这通常涉及联合训练、领域特定的微调或自适应性模型集成等策略。总结来说，这篇综述论文深入探讨了神经机器翻译在面临新领域挑战时如何通过稳健的域自适应方法来改善性能。通过关注数据、模型、参数和推理过程的优化，以及多领域适应，我们可以构建出更加灵活且能够在多种情境下表现出色的翻译系统。这些方法对于实际应用中的多语言、多领域交流有着重要的理论支持和实践价值。

Domain Adaptation for Neural Machine Translation: A Survey

is determined from representations all projected from the same sequence. That sequence

consists of the source embeddings x at the input to the encoder, target embeddings y at the

input to the decoder, or the output of the previous layer in a multi-layer encoder or decoder.

All of these elements - encoder, decoder, and attention network - can be duplicated for a

speciﬁc domain (section 5.2).

3.2.3 Increasing Model Depth

Any machine translation encoder or decoder – recurrent, convolutional, self-attention-based

– may be implemented as a multi-layer network. For example, the ‘base’ Transformer as

described in Vaswani et al. (2017) has an encoder and decoder each composed of a stack of

6 identical layers. Each layer carries out its operation (e.g. self-attention) on the output

of the layer before. The ﬁnal layer in the encoder is used as input to the decoder, and the

ﬁnal layer in the decoder is used to produce the output.

In principle multi-layer networks are capable of learning more ﬁne-grained language

representations than single-layer networks, simply because they have more parameters. In

practice multi-layer networks are susceptible to training diﬃculties since objective function

gradients must be propagated through more layers.

A common way to improve gradient propagation is adding residual networks around

each layer – that is, the output of a layer f (z) becomes f(z) + z (He et al., 2016). Each

layer in the encoder or decoder subnetwork then has access to the original subnetwork input.

Residual connections have been found to be necessary when training deep recurrent models

(Britz et al., 2017a) and Transformer models (Chen et al., 2018).

When such architectural ‘tricks’ are applied, deep models have been shown to outperform

equivalent shallower models for NMT in some settings (Wang et al., 2019). These techniques

eﬀectively move the bottleneck on model size towards constraints such as memory footprint

and training time. However, we do not consider model depth a panacea, and note that

recent work has in some cases found better performance for shallower models, especially for

low-resource translation (Sennrich & Zhang, 2019; Nguyen & Chiang, 2018).

3.3 Training Neural Machine Translation Models

Once an NMT model architecture has been determined as in section 3.2, its parameters must

be adjusted so as to produce a mapping between source sequences x and target sequences y.

NMT model parameters are trained by backpropagation (Rumelhart et al., 1986), typically

using some form of Stochastic Gradient Descent (SGD) optimizer. These gradients must

be determined for some objective on the training data.

Standard training objectives, such as cross-entropy loss, use both the source sentence

and reference sentence during training. However, during inference only the source sentence

and the preﬁx of the model’s own hypothesis

y are available. An auto-regressive sequence

decoder therefore experiences a discrepancy between conditioning during training and in-

ference (Bengio et al., 2015; Ranzato et al., 2016). The need to improve performance while

avoiding over-exposure to the training data has motivated parameter and objective regu-

larization methods. Here we summarize some of these approaches that form the basis for

domain adaptation techniques discussed later.

Danielle Saunders

3.3.1 Cross-Entropy Loss

Since early development of neural networks trainable by gradient descent (Rumelhart et al.,

1986), it has been proposed that a generalizeable approach to neural network training is

varying weights θ in the gradient direction of the log likelihood in order to maximize the

log likelihood of training examples (Baum & Wilczek, 1988; Levin & Fleisher, 1988). This

is known as Maximum Likelihood Estimation (MLE).

θ = argmax

log P (y|x; θ) (3)

MLE is equivalent to minimizing the cross-entropy loss L

between the generated

output distribution and the target sequences where there is a single ground-truth label

q(y

= y

|x; θ) = δ(y

) for each token:

θ = argmin

|y|

j=1

− log P (y

1:j−1

, x; θ) = argmin

(x, y; θ) (4)

Continuing from this early work in eﬀective neural network training, the majority of

end-to-end NMT models minimize the cross-entropy loss as an objective function. In this

survey we describe many domain adaptation techniques involving variations on the MLE loss

(section 6.3). We also describe domain adaptation applications of an alternative objective,

Minimum Risk Training (MRT), which seeks to address some of the downsides of MLE

(section 6.2).

3.3.2 Output Distribution Regularization

One approach to exposure bias under MLE is to regularize MLE training. The aim is to

avoid over-ﬁtting parameter weights to the training set. This can be achieved by adding a

regularization penalty to the MLE loss function.

Log-likelihood maximization as in Eq. 4 assumes that a ground-truth label is far more

likely than all other labels. This objective encourages large diﬀerences in likelihood between

training examples and language that does not appear during training. This can result in

over-conﬁdence and over-ﬁtting to the training data, reducing the model’s ability to cope

with novel data during inference.

A solution to this problem is to incorporate some level of uncertainty into the distribution

over output labels. Szegedy et al. (2016) propose ‘label smoothing’ (LS) for computer

vision tasks using convolutional networks, which replaces the single target label q(y

|x; θ) =

δ(y

) used to derive Eq. 4. Instead the label distribution is smoothed towards a uniform

distribution over the target vocabulary by parameter :

q(y

|x; θ) = (1 − )δ(y

) +



trg

(5)

The cross-entropy objective function then becomes:

θ = argmin

|y|

j=1

∈V

trg

−q(y

|x; θ) log P (y

1:j−1

, x; θ) (6)

Domain Adaptation for Neural Machine Translation: A Survey

Label smoothing was made popular for NMT by its use in purely attention-based net-

works such as the Transformer model (Vaswani et al., 2017), although it has also been

shown to improve RNN-based NMT performance (Chen et al., 2018).

Instead of smoothing the distribution over labels with a uniform distribution

trg

, the

smoothing distribution can come from a teacher model, known as knowledge distillation

(Hinton et al., 2015). Label smoothing can also smooth towards a unigram distribution

over the vocabulary (Pereyra et al., 2017). These schemes eﬀectively incorporate prior

information about the target language distribution. Alternatively, Pereyra et al. (2017)

suggest simply penalizing conﬁdent (i.e. low-entropy) output distributions as a regulariza-

tion method for NMT among other tasks. Output distribution regularization in the form

of knowledge distillation has been applied directly to domain adaptation: we discuss this

in section 6.3.3.

3.3.3 Objective Function Regularization

Rather than adjusting the ground truth output distribution before applying the loss func-

tion, a regularization term L

Reg

can be added to the loss function itself.

θ = argmin

(x, y; θ) + λL

Reg

(θ)] (7)

One simple case is for L

Reg

is an L2 penalty term, L

Reg

. More translation-

speciﬁc regularization terms can involve multi-task learning, in which the added loss term is

an objective from another task. Proposed multi-task-related loss terms include a coverage

term to address over- and under-translation (Tu et al., 2016), a right-to-left translation

objective (Zhang et al., 2019b), the ‘future cost’ of a partial translation (Duan et al.,

2020), or a target language modelling objective (G¨ulcehre et al., 2015; Sriram et al., 2018;

Stahlberg et al., 2018a).

An alternative but related approach is dropout, which randomly omits a small subset of

parameters θ

dropout

from optimization for a training batch (Hinton et al., 2012). This can

be interpreted as regularization at a given training step with L

Reg

(θ) = ∞ for θ ∈ θ

dropout

0 otherwise, but without the eﬀect of numerical overﬂow.

Various objective function regularization methods have become extremely popular for

avoiding catastrophic forgetting during NMT domain adaptation. We discuss applications

closer to dropout in section 6.3.1, and more generalized regularization for domain adaptation

in section 6.3.2.

3.4 Inference with Neural Machine Translation Models

During training, the NMT model is provided with examples of source and target language

sentences x and y, and its parameters are adjusted in order to model P (y|x). During

inference, the model has access only to x, and must produce a target language translation

hypothesis:

y = argmax

P (y|x) (8)

剩余63页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习驱动的多领域神经机器翻译适应策略综述

《迁移学习:域自适应理论》综述论文

2017年对神经网络机器翻译的综述文章

NeurIPS 2020上与【域自适应】相关论文（六篇）

modernmt:神经自适应机器翻译，可适应上下文并从改正中学习

单源域自适应与多源域自适应

基于度量的域自适应和基于对抗的域自适应异同点

基于度量的域自适应和基于对抗的域自适应相似之处

DANN在域自适应算法中的优点

NeRF与域自适应技术

advent无监督域自适应

最新资源