大模型预训练：历史、现状与未来发展

需积分: 11 173 浏览量更新于2024-07-09 收藏 2.92MB PDF 举报

随着人工智能领域的飞速发展，大规模预训练模型（Pre-Trained Models, 简称PTMs）如BERT和GPT已经成为技术突破的关键驱动力。这些模型在过去的几年里取得了显著的成功，不仅革新了自然语言处理（NLP）任务的性能，而且为机器学习提供了全新的知识获取途径。PTMs通过在大规模无标注数据上进行预训练，然后在特定任务上微调的方式，显著提升了模型的泛化能力和适应性。过去，预训练模型的发展始于早期的词嵌入技术，如Word2Vec和GloVe，它们主要关注词汇表中的单个词或短语之间的关系。然而，这些模型的局限在于它们不能捕捉上下文中的复杂语义。BERT的出现则带来了革命性的变化，它引入了Transformer架构和双向上下文建模，能够理解和利用文本中的前后信息，使得预训练模型在多项NLP任务上超越了传统的序列标注方法。进入现在，BERT及其后续改进版本，如RoBERTa、Albert和DistilBERT，不断优化了训练策略和模型结构，进一步提升了模型效率和效果。同时，跨模态预训练模型如M6和ERNIE也开始融合视觉、语言等多种模态的信息，拓展了预训练模型的应用范围。这些模型不仅在语言理解上表现出色，还在问答、文档摘要、文本生成等任务中取得了显著成果。展望未来，预训练模型将继续引领AI技术的发展。一方面，模型的规模将进一步扩大，如GPT-3的175亿参数就是一个例子，这将使得模型能够学习到更丰富的语言模式。另一方面，模型的多模态融合将更加深入，比如将图像、音频和文本等不同源的数据无缝整合，形成多模态预训练模型。此外，预训练模型可能会结合元学习和自监督学习，实现更好的迁移学习能力，从而在更多未知场景下展现出更强的泛化性能。未来的研究方向还包括模型的可解释性和隐私保护，如何让预训练模型更好地服务于社会，以及如何解决预训练过程中的计算成本问题，都是重要的挑战和机遇。随着技术的进步，预训练模型有望在自然语言处理、计算机视觉、语音识别等领域扮演更为重要的角色，推动人工智能向更深层次的智能化发展。

the sky is [mask] .

blue

[CLS]

the sky is

blue

.the sky is

[CLS]

[SEP]

BERT GPT

Figure 6: The difference between GPT and BERT in their self-attention mechanisms and pre-training objectives.

after GPT and BERT to reveal the recent develop-

ment of PTMs.

3.1 Transformer

Before Transformer, RNNs have been typical neu-

ral networks for processing sequential data (espe-

cially for natural languages) for a long time. As

RNNs are equipped with sequential nature, they

read a word at each time step in order and refer to

the hidden states of the previous words to process it.

Such a mechanism is considered to be difﬁcult to

take advantage of the parallel capabilities of high-

performance computing devices such as GPUs and

TPUs.

As compared to RNNs, Transformer is an

encoder-decoder structure that applies a self-

attention mechanism, which can model correlations

between all words of the input sequence in parallel.

Hence, owing to the parallel computation of the

self-attention mechanism, Transformer could fully

take advantage of advanced computing devices to

train large-scale models. In both the encoding and

decoding phases of Transformer, the self-attention

mechanism of Transformer computes representa-

tions for all input words. Next, we dive into the

self-attention mechanism more speciﬁcally.

In the encoding phase, for a given word, Trans-

former computes an attention score by comparing

it with each other word in the input sequence. And

such attention scores indicate how much each of the

other words should contribute to the next represen-

tation of the given word. Then, the attention scores

are utilized as weights to compute a weighted aver-

age of the representations of all the words. We give

an example in Figure 5, where the self-attention

mechanism accurately captures the referential rela-

tionships between “Jack” and “he”, generating the

highest attention score. By feeding the weighted

average of all word representations into a fully con-

nected network, we obtain the representation of

the given word. Such a procedure is essentially an

aggregation of the information of the whole input

sequence, and it will be applied to all the words

to generate representations in parallel. In the de-

coding phase, the attention mechanism is similar to

the encoding, except that it only decodes one repre-

sentation from left to right at one time. And each

step of the decoding phase consults the previously

decoded results. For more details of Transformer,

please refer to its original paper (Vaswani et al.,

2017) and the survey paper (Lin et al., 2021).

Due to the prominent nature, Transformer grad-

ually becomes a standard neural structure for natu-

ral language understanding and generation. More-

over, it also serves as the backbone neural structure

for the subsequently derived PTMs. Next, we in-

troduce two landmarks that completely open the

door towards the era of large-scale self-supervised

PTMs, GPT and BERT. In general, GPT is good at

natural language generation, while BERT focuses

more on natural language understanding.

3.2 GPT

As introduced in Section 2, PTMs typically con-

sist of two phases, the pre-training phase and the

ﬁne-tuning phase. Equipped by the Transformer

decoder as the backbone

, GPT applies a genera-

tive pre-training and a discriminative ﬁne-tuning.

Theoretically, compared to precedents of PTMs,

GPT is the ﬁrst model that combines the modern

Transformer architecture and the self-supervised

pre-training objective. Empirically, GPT achieves

signiﬁcant success on almost all NLP tasks, includ-

ing natural language inference, question answering,

commonsense reasoning, semantic similarity and

Since GPT uses autoregressive language modeling, the

encoder-decoder attention in the original Transformer decoder

is removed.

Journal Pre-proof

Pre-training

Unlabeled Sentence A and B Pair

Masked Sentence A Masked Sentence B

[CLS] Tok

Tok

[SEP]… Tok

Tok

…

[CLS]

[SEP]

E’

… …

[CLS]

[SEP]

E’

… …

NSP Mask LM Mask LM

BERT

Fine-Tuning

MNLI NER

Question Answer Pair

Question Paragraph

[CLS] Tok

Tok

[SEP]… Tok

Tok

…

[CLS]

[SEP]

E’

… …

[CLS]

[SEP]

E’

… …

BERT

SQuAD

Start/End Span

Figure 7: The pre-training and ﬁne-tuning phases for BERT.

classiﬁcation.

Given large-scale corpora without labels, GPT

optimizes a standard autoregressive language mod-

eling, that is, maximizing the conditional probabili-

ties of all the words given their corresponding pre-

vious words as contexts. In the pre-training phase

of GPT, the conditional probability of each word

is modeled by Transformer. As shown in Figure 6,

for each word, GPT computes its probability distri-

butions by applying multi-head self-attention oper-

ations over its previous words followed by position-

wise feed-forward layers.

The adaptation procedure of GPT to speciﬁc

tasks is ﬁne-tuning, by using the pre-trained pa-

rameters of GPT as a start point of downstream

tasks. In the ﬁne-tuning phase, passing the input

sequence through GPT, we can obtain the represen-

tations of the ﬁnal layer of the GPT Transformer.

By using the representations of the ﬁnal layer and

task-speciﬁc labels, GPT optimizes standard objec-

tives of downstream tasks with simple extra output

layers. As GPT has hundreds of millions of param-

eters, it is trained for 1 month on 8 GPUs, which is

fairly the ﬁrst “large-scale” PTM in the history of

NLP. And undoubtedly, the success of GPT pave

the way for the subsequent rise of a series of large-

scale PTMs. In the next part, we introduce another

most representative model BERT.

3.3 BERT

The emergence of BERT has also greatly promoted

the development of the PTM ﬁeld. Theoretically,

compared with GPT, BERT uses a bidirectional

deep Transformer as the main structure. There are

also two separate stages to adapt BERT for speciﬁc

tasks, pre-training and ﬁne-tuning (see Figure 7).

In the pre-training phase, BERT applies autoen-

coding language modeling rather than autoregres-

sive language modeling used in GPT. More speciﬁ-

cally, inspired by cloze (Taylor, 1953), the objec-

tive masked language modeling (MLM) is designed.

As shown in Figure 6, in the procedure of MLM,

tokens are randomly masked with a special token

[MASK]

, the objective is to predict words at the

masked positions with contexts. Compared with

standard unidirectional autoregressive language

modeling, MLM can lead to a deep bidirectional

representation of all tokens.

Besides MLM, the objective of next sentence

prediction (NSP) is adopted to capture discourse

relationships between sentences for some down-

stream tasks with multiple sentences, such as nat-

ural language inference and question answering.

For this task, a binary classiﬁer is used to predict

whether two sentences are coherent. In the pre-

training phase, MLM and NSP work together to

optimize the parameters of BERT.

After pre-training, BERT can obtain robust pa-

rameters for downstream tasks. By modifying

inputs and outputs with the data of downstream

tasks, BERT could be ﬁne-tuned for any NLP

tasks. BERT could effectively handle those ap-

plications with the input of a single sentence or

sentence pairs. For the input, its schema is two sen-

tences concatenated with the special token

[SEP]

which could represent: (1) sentence pairs in para-

phrase, (2) hypothesis-premise pairs in entailment,

(3) question-passage pairs in question answering,

and (4) a single sentence for text classiﬁcation or

sequence tagging. For the output, BERT will pro-

duce a token-level representation for each token,

which can be used to handle sequence tagging or

Journal Pre-proof

剩余42页未读，继续阅读

syp_net

粉丝: 158
资源: 1184

大模型预训练：历史、现状与未来发展

最新《预训练语言模型》2020综述论文大全【复旦大学】.pdf

NLP算法面试必备！史上最全！PTMs：NLP预训练模型的全面总结.md

最新「基于Transformer的预训练模型」综述论文

深度学习综述

深度学习研究综述.pdf

基于分布式表示技术的推荐算法综述.pdf

《文本分类大综述：从浅层到深度学习》

基于深度卷积神经网络的目标检测研究综述.pdf

中文综述：大语言模型的发展与影响

浙大综述：多模态深度学习新进展与未来趋势

最新资源