MASS：用于语言生成的掩码序列到序列预训练

需积分: 0 39 浏览量更新于2024-08-05 收藏 7.48MB PDF 举报

"MASS Masked Sequence to Sequence Pre-training for Language Generation" 在自然语言处理领域，预训练和微调已经成为提升模型性能的关键技术。MASS（Masked Sequence to Sequence Pre-training）是由Song等人提出的一种用于语言生成的新方法，它借鉴了BERT（Devlin et al., 2018）在理解任务上的成功经验，并将其应用到序列到序列的生成任务中。 BERT是基于Transformer架构的预训练模型，通过掩码语言模型（Masked Language Model, MLM）在无监督的大量文本数据上学习语义表示。它随机选择部分单词进行掩蔽，然后让模型预测这些被遮掩的单词，以此来学习上下文依赖和词汇的表示。 MASS则针对序列到序列模型进行预训练，其核心思想是在输入序列中随机选择连续的词片段进行掩蔽，而不是单个词。然后，模型的编码器接收到带有掩蔽的句子，解码器的任务是根据未被掩蔽的部分预测出被掩蔽的序列。这种设计使模型在预训练阶段可以同时训练编码器和解码器，以提升其对输入序列的表示提取能力和语言建模能力。通过这种方式，MASS能够学习到更复杂的上下文关系，因为它不仅需要理解整个句子，还需要在缺少部分信息的情况下生成丢失的片段。预训练后的MASS模型可以在各种低资源或零资源的语言生成任务中进行微调，如神经机器翻译、文本摘要和对话响应生成等，实现在这些任务上的高效表现。在多个任务和数据集上的实验表明，MASS相比于其他预训练方法能显著提高下游任务的性能，尤其是在语言生成任务上。 MASS为语言生成任务提供了一种新的预训练策略，通过掩蔽连续的序列片段，使得模型在无监督学习阶段就具备了理解和生成完整句子的能力。这一方法进一步拓宽了预训练技术在自然语言处理中的应用范围，对于提升低资源环境下的模型性能有着重要的意义。

MASS: Masked Sequence to Sequence Pre-training for Language Generation

_ X

_ _ _ _ _ _ X

Encoder

Decoder

Attention

Figure 1. The encoder-decoder framework for our proposed MASS. The token “ ” represents the mask symbol [M].

accuracy on multiple language understanding tasks in the

GLUE benchmark (Wang et al., 2018) and SQuAD (Ra-

jpurkar et al., 2016).

There are also some works pre-training the encoder-decoder

model for language generation. Dai & Le (2015); Ra-

machandran et al. (2016) leverage a language model or

auto-encoder to pre-train the encoder and decoder. Their

improvements, although observed, are limited and not as

general and signiﬁcant as the pre-training methods (e.g.,

BERT) for language understanding. Zhang & Zong (2016)

designed a sentence reordering task for pre-training, but

only for the encoder part of the encoder-decoder model.

Zoph et al. (2016); Firat et al. (2016) pre-train the model

on similar rich-resource language pairs and ﬁne-tuned on

the target language pair, which relies on supervised data on

other language pairs. Recently, XLM (Lample & Conneau,

2019) pre-trained BERT-like models both for the encoder

and decoder, and achieved the previous state of the art re-

sults on unsupervised machine translation. However, the

encoder and decoder in XLM are pre-trained separately and

the encoder-decoder attention mechanism cannot be pre-

trained, which are sub-optimal for sequence to sequence

based language generation tasks.

Different from previous works, our proposed MASS is care-

fully designed to pre-train both the encoder and decoder

jointly using only unlabeled data, and can be applied to

most language generations tasks.

3. MASS

In this section, we ﬁrst introduce the basic framework of

sequence to sequence learning, and then propose MASS

(MAsked Sequence to Sequence pre-training). We then

discuss the differences between MASS and previous pre-

training methods including the masked language modeling

in BERT and standard language modeling.

3.1. Sequence to Sequence Learning

We denote

(x, y) ∈ (X , Y)

as a sentence pair, where

x = (x

, x

, ..., x

)

is the source sentence with

to-

kens, and

y = (y

, y

, ..., y

)

is the target sentence with

tokens, and

and

are the source and target do-

mains. A sequence to sequence model learns the param-

eter

to estimate the conditional probability

P (y|x; θ)

and usually uses log likelihood as the objective function:

L(θ; (X , Y)) = Σ

(x,y )∈(X ,Y)

log P (y|x; θ)

. The condi-

tional probability

P (y|x; θ)

can be further factorized accord-

ing to the chain rule:

P (y|x; θ) =

t=1

P (y

, x; θ)

where y

is the proceeding tokens before position t.

A major approach to sequence to sequence learning is the

encoder-decoder framework: The encoder reads the source

sequence and generates a set of representations; the decoder

estimates the conditional probability of each target token

given the source representations and its preceding tokens.

Attention mechanism (Bahdanau et al., 2015a) is further

introduced between the encoder and decoder to ﬁnd which

source representation to focus on when predicting the cur-

rent token.

3.2. Masked Sequence to Sequence Pre-training

We introduce a novel unsupervised prediction task in this

section. Given an unpaired source sentence

x ∈ X

, we

denote x

\u:v

as a modiﬁed version of x where its fragment

from position

are masked,

0 < u < v < m

and

the number of tokens of sentence

. We denote

k = v−u+1

as the number of tokens being masked from position

. We replace each masked token by a special symbol

[M]

and the length of the masked sentence is not changed.

u:v

denotes the sentence fragment of x from u to v.

MASS pre-trains a sequence to sequence model by predict-

ing the sentence fragment

u:v

taking the masked sequence

\u:v

as input. We also use the log likelihood as the objec-

tive function:

L(θ; X ) =

|X |

x∈X

log P (x

u:v

\u:v

; θ)

|X |

x∈X

log

t=u

P (x

u:v

, x

\u:v

; θ).

(1)

We show an example in Figure 1, where the input sequence

has 8 tokens with the fragment

being masked.

Note that the model only predicts the masked fragment

, given

as the decoder input for position

4 − 6

, and the decoder takes the special mask symbol

[M]

as inputs for the other positions (e.g., position

1 − 3

and

剩余10页未读，继续阅读

月小烟

粉丝: 823

MASS：用于语言生成的掩码序列到序列预训练

最新资源