Transformer模型全览：从BERT到ChatGPT背后的预训练技术

需积分: 0 195 浏览量更新于2024-06-25 收藏 2.73MB PDF 举报

"这篇PDF论文提供了对Transformer预训练模型的全面分类和介绍，涵盖了从基础的Transformer架构到各种变体模型的详细信息。作者Xavier Amatriain在论文中梳理了近年来出现的各种Transformer模型，包括它们的名字、特点、用途以及创新之处。论文还特别讨论了Transformer模型的核心组件如注意力机制，以及它们为何在众多领域受到广泛应用。此外，他还介绍了强化学习与人类反馈（RLHF）和扩散模型等概念，并列出了一个详细的Transformer模型目录，包括它们的预训练架构、任务和应用。目录中包含了ALBERT、AlphaFold、Anthropic Assistant、BART、BERT和BigBird等多个知名模型的介绍和时间线。" 在深度学习领域，Transformer模型已经成为自然语言处理（NLP）和其他序列建模任务的主流选择。Transformer由Vaswani等人在2017年提出，以其独特的自注意力机制和编码-解码结构颠覆了传统的RNN和CNN模型。它们能够并行处理输入序列，提高了计算效率，尤其在大规模预训练任务中表现突出。 1. **Encoder/Decoder架构**：Transformer模型由编码器和解码器组成，其中编码器负责理解输入序列，解码器则生成输出序列。这种设计使得模型能够处理序列到序列的任务，如机器翻译。 2. **注意力机制**：Transformer的核心是多头自注意力（Multi-Head Self-Attention），它允许模型在不同位置之间建立动态的关系，从而捕获长距离依赖。 3. **Transformer的应用与流行原因**：Transformer不仅在NLP中取得突破，还在计算机视觉、语音识别等领域有广泛应用。它们的并行性、可扩展性和优秀性能是其受欢迎的主要原因。 4. **RLHF（Reinforcement Learning with Human Feedback）**：这是一种结合强化学习和人类反馈的方法，用于优化模型的行为使其更符合人类期望。 5. **扩散模型**：在自然语言生成和图像生成等领域，扩散模型通过逐步“去噪”过程来生成高质量的序列或图像，提供了一种新的生成方法。论文的目录部分列出了多种Transformer变体模型，例如： - **ALBERT**：轻量级BERT，通过因子分解和共享层来减少模型大小，同时保持高性能。 - **AlphaFold**：DeepMind开发的蛋白质结构预测模型，利用Transformer架构解决了生物学上的难题。 - **Anthropic Assistant**：可能是指与人类价值观一致的AI助手，涉及伦理和安全考虑的预训练模型。 - **BART**：基于Transformer的序列到序列模型，用于文本生成任务，通过破坏和重建文本进行预训练。 - **BERT**：Bidirectional Encoder Representations from Transformers，双向预训练模型，开创了预训练-微调范式。 - **BigBird**：Google开发的长序列Transformer，通过稀疏注意力机制处理长文本。这篇论文对于理解Transformer模型家族的发展和应用具有重要价值，对于研究者和从业者来说是一份宝贵的参考资料。

A PREPRINT - FEBRUARY 17, 2023

Figure 3: Reinforcement Learning with Human Feedback. From HuggingFace’s RLHF blog post at

https://

huggingface.co/blog/rlhf.

Figure 4: Probabilistic diffusion model architecture from “Diffusion Models: A Comprehensive Survey of Methods and

Applications” [9]

A PREPRINT - FEBRUARY 17, 2023

2 The Transformers catalog

Note:

For all the models available in Huggingface, I decided to directly link to the page in the documentation since

they do a fantastic job of offering a consistent format and links to everything else you might need, including the original

papers. Only a few of the models are not included in Huggingface. For those, I try to include a link to their github if

available or blog post if not. For all, I also include bibliographic reference.

2.1 Features of a Transformer

So hopefully by now you understand what Transformer models are, and why they are so popular and impactful. In this

section I will introduce a catalog of the most important Transformer models that have been developed to this day. I will

categorize each model according to the following properties: Pretraining Architecture, Pretraining Task, Compression,

Application, Year, and Number of Parameters. Let’s brieﬂy deﬁne each of those:

2.1.1 Pretraining Architecture

We described the Transformer architecture as being made up of an Encoder and a Decoder, and that is true for the

original Transformer. However, since then, different advances have been made that have revealed that in some cases it

is beneﬁcial to use only the encoder, only the decoder, or both.

Encoder Pretraining

These models, which are also called bi-directional or auto-encoding, only use the encoder

during pretraining, which is usually accomplished by masking words in the input sentence and training the model to

reconstruct. At each stage during pretraining, attention layers can access all the input words. This family of models

are most useful for tasks that require understanding complete sentences such as sentence classiﬁcation or extractive

question answering.

Decoder Pretraining

Decoder models, often called auto-regressive, use only the decoder during a pretraining that

is usually designed so the model is forced to predict the next word. The attention layers can only access the words

positioned before a given word in the sentence. They are best suited for tasks involving text generation.

Transformer (Encoder-Decoder) Pretraining

Encoder-decoder models, also called sequence-to-sequence, use both

parts of the Transformer architecture. Attention layers of the encoder can access all the words in the input, while those

of the decoder can only access the words positioned before a given word in the input. The pretraining can be done using

the objectives of encoder or decoder models, but usually involves something a bit more complex. These models are

best suited for tasks revolving around generating new sentences depending on a given input, such as summarization,

translation, or generative question answering.

2.1.2 Pretraining Task

When training a model we need to deﬁne a task for the model to learn on. Some of the typical tasks, such as predicting

the next word or learning to reconstruct masked words were already mentioned above. “Pre-trained Models for Natural

Language Processing: A Survey”[

] includes a pretty comprehensive taxonomy of pretraining tasks, all of which can

be considered self-supervised:

1. Language Modeling (LM):

Predict next token (in the case of unidirectional LM) or previous and next token

(in the case of bidirectional LM)

2. Masked Language Modeling (MLM):

mask out some tokens from the input sentences and then trains the

model to predict the masked tokens by the rest of the tokens

3. Permuted Language Modeling (PLM):

same as LM but on a random permutation of input sequences. A

permutation is randomly sampled from all possible permutations. Then some of the tokens are chosen as the

target, and the model is trained to predict these targets.

4. Denoising Autoencoder (DAE):

take a partially corrupted input (e.g. Randomly sampling tokens from the

input and replacing them with "[MASK]" elements. randomly deleting tokens from the input, or shufﬂing

sentences in random order) and aim to recover the original undistorted input.

5. Contrastive Learning (CTL):

A score function for text pairs is learned by assuming some observed pairs of

text that are more semantically similar than randomly sampled text. It includes:

• Deep InfoMax (DIM):

maximize mutual information between an image representation and local regions

of the image;

剩余35页未读，继续阅读

死磕代码程序媛

粉丝: 136
资源: 320

Transformer模型全览：从BERT到ChatGPT背后的预训练技术

问答ChatGPT之后：超大预训练模型的机遇和挑战.pdf

Chat GPT是一种基于自然语言处理的人工智能算法，它主要依赖于预训练的深度神经网络模型 下面我们将详细

ChatGPT是基于大规模预训练的自然语言处理（NLP）模型GPT的一种应用

这就是ChatGPT.pdf

深度解析chatgpt背后的技术演进.pdf

ChatGPT背后的核心理论与技术.pdf

ChatGPT4.0论文（英文）.pdf

ChatGPT_InstructGPT详解.pdf

ChatGPT创业机会分享324.pdf

ChatGPT是如何工作的.pdf

最新资源

Chat GPT是一种基于自然语言处理的人工智能算法，它主要依赖于预训练的深度神经网络模型下面我们将详细