探索统一的文本转文本Transformer：T5模型的转移学习潜力

需积分: 5 156 浏览量更新于2024-06-14 收藏 1.13MB PDF 举报

T5模型，全称为Text-to-Text Transfer Transformer，是Google Research在2020年的Journal of Machine Learning Research上发表的一项重要研究，旨在探索自然语言处理(NLP)领域中迁移学习的强大潜力。该模型的核心创新在于其统一的文本到文本（Text-to-Text）框架，它将各种复杂的NLP任务，如文本分类、机器翻译、问答和文本摘要等，转化为单一的形式，即将输入和输出都转化为文本。 T5模型的设计初衷是为了简化NLP任务的处理流程。传统的NLP方法通常需要针对每个任务设计特定的模型架构和训练策略，而T5则通过预训练和微调两阶段来实现通用性与针对性的结合。在预训练阶段，T5模型利用大规模无标注文本数据学习通用的语言表示能力，这有助于模型理解和生成高质量的文本。然后，在微调阶段，模型会根据特定任务的数据集调整参数，以适应不同NLP任务的需求。 T5模型基于Transformer架构，这是一个基于自注意力机制的模型，特别适合处理序列数据和长距离依赖关系。自注意力机制允许模型在处理输入时关注上下文中的所有位置，这对于理解复杂的语言结构和生成连贯的输出至关重要。 T5模型的优势体现在以下几个方面： 1. 统一的接口：所有NLP任务都被编码成一个标准化的文本到文本问题，使得模型开发者无需为每个任务定制新的模型设计，降低了开发复杂性和维护成本。 2. 高效的学习：通过共享预训练和微调过程，模型能够更高效地学习和迁移知识，减少了对大量标注数据的依赖。 3. 大规模实验：论文作者进行了一项系统性的研究，比较了不同的预训练目标、架构和无监督数据的使用，这些研究结果为后续的模型设计提供了宝贵的参考。 T5模型的出现标志着NLP领域向更加统一、灵活和高效的解决方案迈进了一大步，它不仅推动了迁移学习技术的发展，还为未来的NLP研究和实践设定了一个新的标准。

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu

we “pack” multiple sequences into each entry of the batch

so that our batches contain

roughly 2

= 65

536 tokens. In total, this batch size and number of steps corresponds

to pre-training on 2

≈

34B tokens. This is considerably less than BERT (Devlin et al.,

2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly

2T tokens. Using only 2

tokens results in a reasonable computational budget while still

providing a suﬃcient amount of pre-training for acceptable performance. We consider the

eﬀect of pre-training for more steps in Sections 3.6 and 3.7. Note that 2

tokens only covers

a fraction of the entire C4 data set, so we never repeat any data during pre-training.

During pre-training, we use an “inverse square root” learning rate schedule: 1



max(n, k)

where

is the current training iteration and

is the number of warm-up steps (set to 10

in all of our experiments). This sets a constant learning rate of 0

01 for the ﬁrst 10

steps,

then exponentially decays the learning rate until pre-training is over. We also experimented

with using a triangular learning rate (Howard and Ruder, 2018), which produced slightly

better results but requires knowing the total number of training steps ahead of time. Since

we will be varying the number of training steps in some of our experiments, we opt for the

more generic inverse square root schedule.

Our models are ﬁne-tuned for 2

= 262

144 steps on all tasks. This value was chosen

as a trade-oﬀ between the high-resource tasks (i.e. those with large data sets), which

beneﬁt from additional ﬁne-tuning, and low-resource tasks (smaller data sets), which overﬁt

quickly. During ﬁne-tuning, we continue using batches with 128 length-512 sequences (i.e.

tokens per batch). We use a constant learning rate of 0

001 when ﬁne-tuning. We save

a checkpoint every 5

000 steps and report results on the model checkpoint corresponding

to the highest validation performance. For models ﬁne-tuned on multiple tasks, we choose

the best checkpoint for each task independently. For all of the experiments except those in

Section 3.7, we report results in the validation set to avoid performing model selection on

the test set.

3.1.3 Vocabulary

We use SentencePiece (Kudo and Richardson, 2018) to encode text as WordPiece tokens

(Sennrich et al., 2015; Kudo, 2018). For all experiments, we use a vocabulary of 32

000

wordpieces. Since we ultimately ﬁne-tune our model on English to German, French, and

Romanian translation, we also require that our vocabulary covers these non-English languages.

To address this, we classiﬁed pages from the Common Crawl scrape used in C4 as German,

French, and Romanian. Then, we trained our SentencePiece model on a mixture of 10 parts

of English C4 data with 1 part each of data classiﬁed as German, French or Romanian.

This vocabulary was shared across both the input and output of our model. Note that

our vocabulary makes it so that our model can only process a predetermined, ﬁxed set of

languages.

3.1.4 Unsupervised Objective

Leveraging unlabeled data to pre-train our model necessitates an objective that does not

require labels but (loosely speaking) teaches the model generalizable knowledge that will be

10. https://www.pydoc.io/pypi/tensor2tensor-1.5.7/autoapi/data_generators/generator_utils/

index.html#data_generators.generator_utils.pack_examples

Exploring the Limits of Transfer Learning

<X> <Y>

Figure 2:

Schematic of the objective we use in our baseline model. In this example, we

process the sentence “Thank you for inviting me to your party last week.” The

words “for”, “inviting” and “last” (marked with an

) are randomly chosen for

corruption. Each consecutive span of corrupted tokens is replaced by a sentinel

token (shown as

<X>

and

<Y>

) that is unique over the example. Since “for” and

“inviting” occur consecutively, they are replaced by a single sentinel

<X>

. The

output sequence then consists of the dropped-out spans, delimited by the sentinel

tokens used to replace them in the input plus a ﬁnal sentinel token <Z>.

useful in downstream tasks. Preliminary work that applied the transfer learning paradigm

of pre-training and ﬁne-tuning all of the model’s parameters to NLP problems used a

causal language modeling objective for pre-training (Dai and Le, 2015; Peters et al., 2018;

Radford et al., 2018; Howard and Ruder, 2018). However, it has recently been shown that

“denoising” objectives (Devlin et al., 2018; Taylor, 1953) (also called “masked language

modeling”) produce better performance and as a result they have quickly become standard.

In a denoising objective, the model is trained to predict missing or otherwise corrupted

tokens in the input. Inspired by BERT’s “masked language modeling” objective and the

“word dropout” regularization technique (Bowman et al., 2015), we design an objective that

randomly samples and then drops out 15% of tokens in the input sequence. All consecutive

spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token

is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens

which are added to our vocabulary and do not correspond to any wordpiece. The target

then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel

tokens used in the input sequence plus a ﬁnal sentinel token to mark the end of the target

sequence. Our choices to mask consecutive spans of tokens and only predict dropped-out

tokens were made to reduce the computational cost of pre-training. We perform thorough

investigation into pre-training objectives in Section 3.3. An example of the transformation

resulting from applying this objective is shown in Figure 2. We empirically compare this

objective to many other variants in Section 3.3.

3.1.5 Baseline Performance

In this section, we present results using the baseline experimental procedure described above

to get a sense of what kind of performance to expect on our suite of downstream tasks.

Ideally, we would repeat every experiment in our study multiple times to get a conﬁdence

interval on our results. Unfortunately, this would be prohibitively expensive due to the large

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu

GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo

⋆

Baseline average 83.28 19.24 80.88 71.36 26.98 39.82 27.65

Baseline standard deviation 0.235 0.065 0.343 0.416 0.112 0.090 0.108

No pre-training 66.22 17.60 50.31 53.04 25.86 39.77 24.04

Table 1:

Average and standard deviation of scores achieved by our baseline model and

training procedure. For comparison, we also report performance when training on

each task from scratch (i.e. without any pre-training) for the same number of steps

used to ﬁne-tune the baseline model. All scores in this table (and every table in

our paper except Table 14) are reported on the validation sets of each data set.

number of experiments we run. As a cheaper alternative, we train our baseline model 10

times from scratch (i.e. with diﬀerent random initializations and data set shuﬄing) and

assume that the variance over these runs of the base model also applies to each experimental

variant. We don’t expect most of the changes we make to have a dramatic eﬀect on the

inter-run variance, so this should provide a reasonable indication of the signiﬁcance of

diﬀerent changes. Separately, we also measure the performance of training our model for 2

steps (the same number we use for ﬁne-tuning) on all downstream tasks without pre-training.

This gives us an idea of how much pre-training beneﬁts our model in the baseline setting.

When reporting results in the main text, we only report a subset of the scores across all

the benchmarks to conserve space and ease interpretation. For GLUE and SuperGLUE, we

report the average score across all subtasks (as stipulated by the oﬃcial benchmarks) under

the headings “GLUE” and “SGLUE”. For all translation tasks, we report the BLEU score

(Papineni et al., 2002) as provided by SacreBLEU v1.3.0 (Post, 2018) with “exp” smoothing

and “intl” tokenization. We refer to scores for WMT English to German, English to French,

and English to Romanian as EnDe, EnFr, and EnRo, respectively. For CNN/Daily Mail,

we ﬁnd the performance of models on the ROUGE-1-F, ROUGE-2-F, and ROUGE-L-F

metrics (Lin, 2004) to be highly correlated so we report the ROUGE-2-F score alone under

the heading “CNNDM”. Similarly, for SQuAD we ﬁnd the performance of the “exact match”

and “F1” scores to be highly correlated so we report the “exact match” score alone. We

provide every score achieved on every task for all experiments in Table 16, Appendix E.

Our results tables are all formatted so that each row corresponds to a particular experi-

mental conﬁguration with columns giving the scores for each benchmark. We will include

the mean performance of the baseline conﬁguration in most tables. Wherever a baseline

conﬁguration appears, we will mark it with a

⋆

(as in the ﬁrst row of Table 1). We also

will boldface any score that is within two standard deviations of the maximum (best) in a

given experiment.

Our baseline results are shown in Table 1. Overall, our results are comparable to existing

models of similar size. For example, BERT

BASE

achieved an exact match score of 80

on SQuAD and an accuracy of 84

4 on MNLI-matched, whereas we achieve 80

88 and

24, respectively (see Table 16). Note that we cannot directly compare our baseline to

BERT

BASE

because ours is an encoder-decoder model and was pre-trained for roughly

⁄4

as many steps. Unsurprisingly, we ﬁnd that pre-training provides signiﬁcant gains across

almost all benchmarks. The only exception is WMT English to French, which is a large

剩余66页未读，继续阅读

就是一顿骚操作

粉丝: 741
资源: 58

探索统一的文本转文本Transformer：T5模型的转移学习潜力

谷歌FLAN-T5大模型：5400亿参数，1800任务实现自我改进

AI大模型应用实践：大型语言模型的原理与私有化部署

深入解析大语言模型：T5到GPT-4的关键技术与能力评估

大语言模型原理说明和介绍.zip

中文聊天小模型，用t5 base在大量数据上有监督。.zip

AIGC与NLP大模型实战-经典CV与NLP大模型及其下游应用任务实现

大型语言模型的历史、发展和原理-入门性调查

高质量中文预训练模型&大模型&多模态模型&大语言模型集合

人工智能与机器学习概论+神经网络基础理论+深度学习框架与工具+Transformer模型原理与结构+自注意力机制详解等教程

GPT背后原理详解：从T5到GPT-4，国内20余位顶级大牛联合撰写

最新资源