预训练基础模型综述：从BERT到ChatGPT

需积分: 0 189 浏览量更新于2024-06-23 收藏 5.54MB PDF 举报

"这篇综述文章全面探讨了预训练基础模型（Pretrained Foundation Models, PFMs）的发展历程，从BERT到ChatGPT的演变。作者包括来自不同大学和研究机构的专家，如密歇根州立大学、北京航空航天大学、利哈伊大学等。文章强调了预训练在大型模型应用中的关键作用，以及它如何作为迁移学习范式在计算机视觉等领域取得显著效果。" 正文：预训练基础模型（PFMs）在近年来的自然语言处理（NLP）领域中扮演了核心角色，它们为各种下游任务提供了多模态数据的基础。这些模型，如BERT、GPT-3、MAE、DALLE-E和ChatGPT，都是通过在大规模数据上进行预训练，为广泛的下游应用提供合理的参数初始化。这一预训练理念在大型模型的应用中起着至关重要的作用。 BERT（Bidirectional Encoder Representations from Transformers）是预训练模型的里程碑，它引入了双向Transformer架构，彻底改变了语言模型的训练方式。BERT通过预训练任务，如掩码语言模型（Masked Language Modeling, MLM）和下一句预测（Next Sentence Prediction, NSP），学习语言的内在结构和上下文关系。这些预训练的模型参数随后可以微调，以适应特定的下游任务，如问答系统、文本分类或情感分析。 GPT（Generative Pre-trained Transformer）系列，尤其是GPT-3，进一步扩展了预训练模型的规模和能力。与BERT不同，GPT模型采用自回归方式训练，通过预测序列中的下一个词来学习语言模式。GPT-3凭借其庞大的参数量（超过1750亿），展示了强大的零样本学习和少样本学习能力，能在没有特定领域数据的情况下完成多种任务。 MAE（Masked Autoencoder）是预训练领域的又一创新，它专注于图像数据，采用了部分像素掩码策略，使得模型仅需恢复被遮挡的部分，从而降低计算成本并提高效率。这种方法在视觉任务上的表现令人印象深刻。 DALLE-E和ChatGPT则将预训练模型的概念扩展到了生成式模型领域。DALLE-E结合了语言和视觉信息，能够根据文本指令生成图像。ChatGPT则是OpenAI的最新成果，一个经过大规模对话数据预训练的模型，能够进行流畅的人机对话，展示了预训练模型在交互式应用中的潜力。预训练作为一种迁移学习方法，已经在计算机视觉中得到广泛应用，如冻结部分网络层进行特征提取，然后微调剩余部分以适应目标任务。这种技术在减少训练时间、提高模型性能方面展现出巨大优势。预训练模型的成功也启发了其他领域的研究，例如跨模态学习，其中模型在不同数据类型之间建立联系，促进更综合的理解。这篇综述深入剖析了从BERT到ChatGPT的预训练模型发展历程，揭示了预训练在构建强大、通用的AI系统中的核心地位。随着计算资源的增加和算法的不断优化，预训练模型将继续推动人工智能技术向前发展，为未来的智能应用提供更高效、更灵活的解决方案。

Boosting Examples: ChatGPT and Bard As shown in Fig. 5, ChatGPT is ﬁne-tuned based on the PFM

GPT-3.5 using RLHF. ChatGPT uses a different data collection setup compared to InstructGPT. First, a large

dataset with prompts and the desired output behaviors is collected. The dataset is used to ﬁne-tune GPT-3.5

with supervised learning. Second, given the ﬁne-tuned model and a prompt, the model will generate several

model outputs. A labeler gives the desired score and ranks the output to compose a comparison dataset,

which is used to train the reward model. Finally, the ﬁne-tuned model (ChatGPT) is optimized against the

reward model using the Proximal Policy Optimization (PPO)[64] RL algorithm.

Another experimental conversational PFM, the Bard

, is developed by Google. Bard is based on the

LM for Dialogue Applications (LaMDA). LaMDA [65] is built upon the Transformer, which is pretrained

on 1.56T words of dialog data and web text. Safety and factual grounding are two main challenges for

conversational AI, LaMDA applies the approaches that ﬁne-tuning with high-quality annotated data and

external knowledge sources to improve model performance.

3.5 Instruction-Aligning Methods

Instruction-aligning methods aim to let the LM follow human intents and generate meaningful outputs. The

general approach is ﬁne-tuning the pretrained LM with high-quality corpus in a supervised manner. To

further improve the usefulness and harmlessness of LMs, some works introduce RL into the ﬁne-tuning pro-

cedure so that LMs could revise their responses according to human or AI feedback. Both supervised and RL

approaches can leverage chain-of-thought [24] style reasoning to improve the human-judged performance

and transparency of AI decision-making.

Supervised Fine-Tuning (SFT) SFT is a well-established technique to unlock knowledge and apply it

to speciﬁc real-world, even unseen tasks. The template for SFT is composed of input-output pairs and an

instruction [111]. For example, given the instruction “Translate this sentence to Spanish:” and an input

“The new ofﬁce building was built in less than three months.”, we want the LM to generate the target “El

nuevo ediﬁcio de oﬁcinas se construyó en tres meses.”. The template is commonly humanmade including

unnatural instructions [112] and natural instructions [113, 114], or bootstrap based on a seed corpus [115].

Ethical and social risks of harm from LMs are signiﬁcant concerns in SFT [116]. LaMDA, the largest LM to

date, thus relies on crowdworker annotated data for providing a safety assessment of any generated LaMDA

response in three conversation categories: natural, sensitive, and adversarial. The list of rules serves further

safety ﬁne-tuning and evaluation purposes.

Reinforcement Learning from Feedback RL has been applied to enhance various models in NLP tasks

such as machine translation [117], summarization [18], dialogue generation [118], image captioning [119],

question generation [120], text-games [121], and more [122, 123, 124]. RL is a helpful method for opti-

mizing non-differentiable objectives in language generation tasks by treating them as sequential decision-

making problems. However, there is a risk of overﬁtting to metrics that use neural networks, leading to

nonsensical samples that score well on the metrics [125]. RL is also used to align LMs with human prefer-

ences [126, 127, 128].

InstructGPT proposes to ﬁne-tune large models with PPO against a trained reward model to align LMs

with human preference [19], which is the same method applied by ChatGPT named RLHF. Speciﬁcally,

the reward model is trained with comparison data of human labelers’ manual rankings of outputs. For each

of them, the reward model or machine labeler calculates a reward, which is used to update the LM using

https://blog.google/technology/ai/bard-google-ai-search-updates/

Table 1: Summary of PFMs in Text. The pretraining task includes language model (LM), masked LM

(MLM), permuted LM (PLM), denoising autoencoder (DAE), knowledge graphs (KG), and knowledge em-

bedding (KE).

Year Conference Model Architecture Embedding Training method Code

2013 NeurIPS Skip-Gram [66] Word2Vec Probabilistic - https://github.com/.../models

2014 EMNLP GloVe [67] Word2Vec Probabilistic - -

2015 NeurIPS LM-LSTM [68] LSTM Probabilistic LM https://github.com/.../GloVe

2016 IJCAI Shared LSTM [69] LSTM Probabilistic LM https://github.com/.../adversarial_text

2017 TACL FastText [70] Word2Vec Probabilistic - https://github.com/.../fastText

2017 NeurIPS CoVe [71] LSTM+Seq2Seq Probabilistic - https://github.com/.../cove

2018 NAACL-HLT ELMO [51] LSTM Contextual LM https://allennlp.org/elmo

2018 NAACL-HLT BERT [13] Transformer Encoder Contextual MLM https://github.com/.../bert

2018 OpenAI GPT [48] Transformer Decoder Autoregressive LM https://github.com/...transformer-lm

2019 ACL ERNIE(THU) Transformer Encoder Contextual MLM https://github.com/.../ERNIE

2019 ACL Transformer-XL [72] Transformer-XL Contextual - https://github.com/.../transformer-xl

2019 ICLR InfoWord [73] Transformer Encoder Contextual MLM -

2019 ICLR StructBERT [74] Transformer Encoder Contextual MLM -

2019 ICLR ALBERT [45] Transformer Encoder Contextual MLM https://github.com/.../ALBERT

2019 ICLR WKLM [75] Transformer Encoder Contextual MLM -

2019 ICML MASS [57] Transformer Contextual MLM(Seq2Seq) https://github.com/.../MASS

2019 EMNLP-IJCNLP KnowBERT [76] Transformer Encoder Contextual MLM https://github.com/.../kb

2019 EMNLP-IJCNLP Unicoder [77] Transformer Encoder Contextual MLM+TLM -

2019 EMNLP-IJCNLP MultiFit [78] QRNN Probabilistic LM https://github.com/.../multiﬁt

2019 EMNLP-IJCNLP SciBERT [79] Transformer Encoder Contextual MLM https://github.com/.../scibert

2019 EMNLP-IJCNLP BERT-PKD [80] Transformer Encoder Contextual MLM https://github.com/...Compression

2019 NeurIPS Xlnet [14] Transformer-XL Encoder Permutation PLM https://github.com/.../xlnet

2019 NeurIPS UNILM [58] LSTM + Transformer Contextual LM + MLM https://github.com/.../unilm

2019 NeurIPS XLM [81] Transformer Encoder Contextual MLM+CLM+TLM https://github.com/.../XLM

2019 OpenAI Blog GPT-2 [49] Transformer Decoder Autoregressive LM https://github.com/.../gpt-2

2019 arXiv RoBERTa [53] Transformer Encoder Contextual MLM https://github.com/.../fairseq

2019 arXiv ERNIE(Baidu) [59] Transformer Encoder Contextual MLM+DLM https://github.com/.../ERNIE

2019 EMC2@NeurIPS Q8BERT [82] Transformer Encoder Contextual MLM https://github.com/.../quantized_bert.py

2019 arXiv DistilBERT [83] Transformer Encoder Contextual MLM https://github.com/.../distillation

2020 ACL fastBERT [84] Transformer Encoder Contextual MLM https://github.com/.../FastBERT

2020 ACL SpanBERT [42] Transformer Encoder Contextual MLM https://github.com/.../SpanBERT

2020 ACL BART [43] Transformer En: Contextual DAE https://github.com/.../transformers

De: Autoregressive

2020 ACL CamemBERT [85] Transformer Encoder Contextual MLM(WWM) https://camembert-model.fr

2020 ACL XLM-R [86] Transformer Encoder Contextual MLM https://github.com/.../XLM

2020 ICLR Reformer [87] Reformer Permutation - https://github.com/.../reformer

2020 ICLR ELECTRA [44] Transformer Encoder Contextual MLM https://github.com/.../electra

2020 AAAI Q-BERT [88] Transformer Encoder Contextual MLM -

2020 AAAI XNLG [89] Transformer Contextual MLM+DAE https://github.com/.../xnlg

2020 AAAI K-BERT [90] Transformer Encoder Contextual MLM https://github.com/.../K-BERT

2020 AAAI ERNIE 2.0 [60] Transformer Encoder Contextual MLM https://github.com/.../ERNIE

2020 NeurIPS GPT-3 [20] Transformer Decoder Autoregressive LM https://github.com/.../gpt-3

2020 NeurIPS MPNet [55] Transformer Encoder Permutation MLM+PLM https://github.com/.../MPNet

2020 NeurIPS ConvBERT [91] Mixed Attention Contextual - https://github.com/.../ConvBert

2020 NeurIPS MiniLM [92] Transformer Encoder Contextual MLM https://github.com/.../minilm

2020 TACL mBART [93] Transformer Contextual DAE https://github.com/.../mbart

2020 COLING CoLAKE [94] Transformer Encoder Contextual MLM+KE https://github.com/.../CoLAKE

2020 LREC FlauBERT [95] Transformer Encoder Contextual MLM https://github.com/.../Flaubert

2020 EMNLP GLM [96] Transformer Encoder Contextual MLM+KG https://github.com/.../GLM

2020 EMNLP (Findings) TinyBERT [97] Transformer Contextual MLM https://github.com/.../TinyBERT

2020 EMNLP (Findings) RobBERT [98] Transformer Encoder Contextual MLM https://github.com/.../RobBERT

2020 EMNLP (Findings) ZEN [62] Transformer Encoder Contextual MLM https://github.com/.../ZEN

2020 EMNLP (Findings) BERT-MK [99] KG-Transformer Encoder Contextual MLM -

2020 RepL4NLP@ACL CompressingBERT [33] Transformer Encoder Contextual MLM(Pruning) https://github.com/.../bert-prune

2020 JMLR T5 [100] Transformer Contextual MLM(Seq2Seq) https://github.com/...transformer

2021 T-ASL BERT-wwm-Chinese [61] Transformer Encoder Contextual MLM https://github.com/...BERT-wwm

2021 EACL PET [101] Transformer Encoder Contextual MLM https://github.com/.../pet

2021 TACL KEPLER [102] Transformer Encoder Contextual MLM+KE https://github.com/.../KEPLER

2021 EMNLP SimCSE [103] Transformer Encoder Contextual MLM+KE https://github.com/.../SimCSE

2021 ICML GLaM [104] Transformer Autoregressive LM -

2021 arXiv XLM-E [105] Transformer Contextual MLM

2021 arXiv T0 [106] Transformer Contextual MLM https://github.com/.../T0

2021 arXiv Gopher [107] Transformer Autoregressive LM -

2022 arXiv MT-NLG [108] Transformer Contextual MLM -

2022 arXiv LaMDA [65] Transformer Decoder Autoregressive LM https://github.com/.../LaMDA

2022 arXiv Chinchilla [109] Transformer Autoregressive LM -

2022 arXiv PaLM [41] Transformer Autoregressive LM https://github.com/.../PaLM

2022 arXiv OPT [110] Transformer Decoder Autoregressive LM https://github.com/.../MetaSeq

PPO. More details are illustrated in Fig. 5. Sparrow [128], developed by DeepMind, also utilizes RLHF

that reduces the risk of unsafe and inappropriate answers. Despite some promising results using RLHF

by incorporating ﬂuency, progress in this ﬁeld is impeded by a lack of publicly available benchmarks and

implementation resources, resulting in a perception that RL is a difﬁcult approach for NLP. Therefore, an

open-source library named RL4LMs [125] is introduced recently, which consists of building blocks for

ﬁne-tuning and evaluating RL algorithms on LM-based generation.

Besides human feedback, one of the latest dialogue agents – Claude favors Constitutional AI [129] where

the reward model is learned via RL from AI Feedback (RLAIF). Both the critiques and the AI feedback are

steered by a small set of principles drawn from a ‘constitution’, the speciﬁcation of a short list of principles or

instructions, which is the only thing provided by humans in Claude. The AI feedback focuses on controlling

the outputs to be less harmful by explaining its objections to dangerous queries.

Chain-of-Thoughts (CoT) CoT is a series of intermediate reasoning steps, which can signiﬁcantly im-

prove the ability of large LMs to perform complex reasoning [24, 130, 131]. Besides, ﬁne-tuning with CoT

shows slightly more harmless compared to without CoT [129].

3.6 Summary

The neural probabilistic LM uses a neural network to estimate the parameters of the probabilistic LM, which

reduces the size of the model parameters while enlarging the number of context windows. With the help of

a neural network, the LM does not need to improve the smoothing algorithm to alleviate the performance

bottleneck continuously. Since the training target is unsupervised, a corpus with a large amount of data

is enough for training. The negative sampling technique in the training process provides a new idea for

the follow-up study of the target task in the LM. Furthermore, the neural probabilistic LM promotes the

further development of downstream task research because of its good representation capability and training

efﬁciency. After the pretraining LM, especially the BERT model, is proposed, the research in language

modeling has entered a new phase. The bidirectional LM, the hidden LM, and the sorted LM adopted by the

bidirectional LM have successfully modeled the grammatical and semantic information in natural language

at a deeper level. ChatGPT is another milestone work in PFMs using RL. The presentation ability of PFMs

is qualitatively better than that of the neural probabilistic LM. It even exceeds that of humans in some tasks.

4 PFMs for Computer Vision

With the popularity of PFM used in NLP, it motivates researchers to start exploring PFM in CV. The term

“pretraining” has not been clearly deﬁned within the realm of deep learning research in CV. This word is

ﬁrst used in convolution-based networks when we adjust the parameters on a more general dataset such as

ImageNet, which can make other tasks train to start with a warm-up initialization and thus converge with

faster speed. In contrast to early CNN-based transfer learning techniques that rely on pretrained datasets

with supervised signals, our examination of PFM centers on SSL which utilizes human-designed labels,

such as Jigsaw puzzles, or the comparison of different patches from images as pretext tasks. This allows for

learned representations to be generalized to various downstream tasks, including classiﬁcation, detection,

recognition, segmentation, etc.

However, it is costly to rely heavily on data annotations when the learning tasks become more com-

plicated, making the labeling process more arduous and time-consuming than the actual learning. This is

where SSL is urgently needed and how it can further fuel the progress of deep learning methods. To reduce

Big Data in the Wild

Data Augmentation or Self-labeling strategy

Encoder

(ConvNet,

RNN, ···)

Pretext Task

Unlabeled images

Pre-training

Labelling Information

Backbone

(ConvNet,

RNN, ···)

Transferred

Labeled images

Average

Representation

Downstream

Task

MLP

Downstream Supervised Learning

…

Data in the Domain

Figure 6: The general pipeline for SSL. The top part represents the pretraining, and the bottom stream

obtains transferred parameters from above to learn downstream supervised tasks.

the dependency on data labeling, unlabeled data are trained with self-supervision by matching, contrasting,

or generating in SSL.

The general pipeline of SSL is shown in Fig. 6. During the pretraining stage, a pretext task is designed

for the encoder networks to solve. The artiﬁcial labels for this pretext task are automatically generated based

on speciﬁc attributes of the data, such as image patches from the same origin being labeled as “positive” and

those from different origins as “negative”. Then, the encoder networks are trained to solve the pretext task

by supervised learning methods. Since shallow layers extract ﬁne-grained details such as edges, angles, and

textures, while deeper layers capture task-related high-level features such as semantic information or image

contents, learned encoders on pretext tasks can be transferred to downstream supervised tasks. During this

stage, the parameters of the backbone are ﬁxed, and only a simple classiﬁer, such as a two-layer Multi-Layer

Perceptron (MLP), needs to be learned. Considering the limited workload in the downstream training stage,

this learning process is commonly referred to as ﬁne-tuning. In summary, the representations learned during

the pretraining stage in SSL can be reused on other downstream tasks and achieve comparable results.

In this section, we introduce different tasks for pretraining PFMs in CV. The PFMs can be trained by

speciﬁc pretext tasks, frame order, generation, reconstruction, memory bank, sharing, clustering and so on.

We summarize the PFMs proposed in CV in Table 2.

4.1 Learning by Speciﬁc Pretext Task

In the early stage of unsupervised learning, the network is trained by designing a special pretext task and

predicting the answer to this task. Dosovitskiy et al. [132, 133] pretrain the Exemplar CNN to discriminate

the different patches from the unlabelled data. The experiments prove the designs can learn useful represen-

tations transferred to the standard recognition assignments. In the method based on context prediction [134],

a handcrafted supervised signal about the position information serves as the label for the pair classiﬁcation.

Inpainting [135] aims to pretrain models by predicting the missed center part. Because inpainting is a

semantic-based prediction, another decoder is linked to the context encoder in this manner. Furthermore,

the standard pixel-by-pixel reconstruction process of the decoder can be transferred to any other down-

stream inpainting tasks. Speciﬁcally, Colorization [136] is a method that evaluates how colorization as a

pretext task can help to learn semantic representation for downstream tasks. It is also known as the cross-

channel encoding since different image channels serve as input and the output is discriminated. Similarly,

Split-Brain Autoencoder [137] also learns representations in a self-supervised way by forcing the network

to solve cross-channel prediction tasks. Jigsaw [138] is proposed to pretrain the designed Context-Free

Network (CFN) in a self-supervised manner by ﬁrst designing the Jigsaw puzzle as a pretext task. Com-

pleting Damaged Jigsaw Puzzles (CDJP) [139] learns image representation by complicating pretext tasks

furthermore, in which puzzles miss one piece and the other pieces contain incomplete color. Following the

idea of designing efﬁcient and effective pretext tasks, Noroozi et al. [140] use counting visual primitives

as a special pretext task and outperform previous SOTA models on regular benchmarks. NAT [141] learns

representation by aligning the output of backbone CNN to low-dimensional noise. RotNet [142] is designed

to predict different rotations of images.





































































































Predictions





Figure 7: Contrastive Predictive Coding [143]. The input sequence can represent both images and videos.

4.2 Learning by Frame Order

The learning of sequence data such as videos always involves frame processing through time steps. This

problem often connects with solving pretext tasks that can help to learn visual temporal representations.

Contrastive Predictive Coding (CPC) [143] is the ﬁrst model to learn data representations by predicting

the future in latent space. This model can be fed with data in any modalities, like speech, images, text,

etc. The components of CPC are shown in Fig. 7 from [143], where the x

represents the input sequence

of observations, z

is a sequence of latent representations after the encoder g

enc

, and c

is a context latent

representation that summarizes all the latent sequence z

≤t

after an autoregressive model g

. Unlike the

traditional model predicts future frames x

t+k

by a generative model p

t+k

), CPC models a "density

ratio" f

to represent the mutual information between the context latent representation c

and future frame

t+k

, c

) ∝ p(x

t+k

)/x

t+k

. (10)

After the encoding of recurrent neural networks, z

and c

can both be chosen for the downstream tasks as

needed. The encoder and autoregressive model are trained by InfoNCE [143] as follows

L = −

[log f

t+k

, c

∈X

, c

)], (11)

where X denotes the training dataset containing both positive and negative samples. The density ratio f

can be estimated by optimizing L. CPC v2 revisits and improves CPC [144] by pretraining on unsupervised

representations, and its representation generality can be transferred to data-efﬁcient downstream tasks.

4.3 Learning by Generation

Although many existing applications are popular after the development of the GAN-based approach, the

representation abilities inside the GANs are not entirely exploited due to the absence of a feature encoder.

剩余96页未读，继续阅读

quanlibin1984

粉丝: 1
资源: 4

预训练基础模型综述：从BERT到ChatGPT

预训练模型BERT介绍

026-SVM用于分类时的参数优化，粒子群优化算法，用于优化核函数的c,g两个参数(SVM PSO) Matlab代码.rar

铅酸电池失效仿真comsol

小程序项目-基于微信小程序的童心党史小程序（包括源码，数据库，教程）.zip

小程序项目-基于微信小程序的新生报到系统（包括源码，数据库，教程）.zip

springboot124中药实验管理系统设计与实现.zip

解除劳动合同协议书.doc

快速过滤图像融合Matlab代码.rar

强调图像中内核形状（例如直线）的过滤器Matlab代码.rar

linux离线安装redis

最新资源