Transformer模型：注意力即一切

64 浏览量更新于2024-06-18 收藏 2.06MB PDF 举报

"Transformer模型与传统序列转录方法的革新" 在深度学习领域，尤其是在自然语言处理（NLP）中，"Attention Is All You Need" 是一篇由 Ashish Vaswani、Noam Shazeer、Niki Parmar 等人于 Google Brain 和 Google Research 联合发表的重要论文。这篇论文标志着Transformer架构的诞生，它对传统的序列转录模型提出了革命性的变革。传统的方法通常依赖于复杂的循环神经网络（RNNs）或卷积神经网络（CNNs），它们由编码器和解码器组成，通过长短期记忆（LSTM）单元或类似的递归结构处理序列数据。这些模型的性能优秀，但存在两个主要限制：一是计算复杂度高，因为它们需要逐时间步处理，限制了并行化；二是训练时间较长，因为每个时间步都需要前向传播和反向传播。论文提出的新模型Transformer，摒弃了RNNs和CNNs中的递归和卷积层，完全依赖于自注意力机制（self-attention）。自注意力允许模型在处理每个输入元素时，同时考虑所有其他元素的信息，极大地提高了模型对全局上下文的理解。这不仅简化了模型结构，降低了模型间的依赖关系，还显著提高了并行计算能力，使得大规模训练变得更加高效。在机器翻译任务上，如WMT2014 English-to-German的比赛，Transformer模型展现了卓越的质量，达到了28.4 BLEU分的成绩，这在当时是前所未有的，并且超越了当时的最优结果。这一突破证明了注意力机制在处理序列数据时的强大潜力，使得模型能够在保持高性能的同时，显著提升计算效率和训练速度。 Transformer的成功引起了广泛的关注，后续的研究者们在此基础上发展出了许多变体，如多头注意力、位置编码等，进一步推动了自然语言处理领域的进步。如今，Transformer已经成为现代NLP的基石，广泛应用于文本分类、文本生成、对话系统等任务中，成为了深度学习的标准工具之一。其简洁的结构和强大的性能使之成为解决序列建模问题的理想选择。"

Figure 1: The Transformer - model architecture.

3.1 Encoder and Decoder Stacks

Encoder:

The encoder is composed of a stack of

N = 6

identical layers. Each layer has two

sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position-

wise fully connected feed-forward network. We employ a residual connection [

] around each of

the two sub-layers, followed by layer normalization [

]. That is, the output of each sub-layer is

LayerNorm(x + Sublayer(x))

, where

Sublayer(x)

is the function implemented by the sub-layer

itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding

layers, produce outputs of dimension d

model

= 512.

Decoder:

The decoder is also composed of a stack of

N = 6

identical layers. In addition to the two

sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head

attention over the output of the encoder stack. Similar to the encoder, we employ residual connections

around each of the sub-layers, followed by layer normalization. We also modify the self-attention

sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This

masking, combined with fact that the output embeddings are offset by one position, ensures that the

predictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,

where the query, keys, values, and output are all vectors. The output is computed as a weighted sum

of the values, where the weight assigned to each value is computed by a compatibility function of the

query with the corresponding key.

encode包括N=6个完全相同的layer，每个layer有两个sub-layer：即multi-head

self-attention、fully connected feed-forward network。

对这两个sub-layer都执行残差连接，随后都有一个normalization layer。

残差连接

skip

connection

decode与encode相比，插入了第三个sub-layer，以对

encode的输出执行multi-head attention

这个masking，加上输出嵌入

被一个位置偏移的事实，确

保了位置i的预测只能依赖于

位置i以前的已知输出。

剩余14页未读，继续阅读

lucky_chaichai

粉丝: 7113
资源: 5

Transformer模型：注意力即一切

Attention Is All You Need

attention is all you need.pptx

Transformer-Attention is all you need

llm-medical-data用于大模型微调训练的医疗数据集_llm-medical-data.zip

byzer-llm-3.3-2.12-0.1.0-SNAPSHOT.jar

LLM-RAG-WEB 大模型本地知识库召回

LangChain-for-LLM-Application-Development-main.zip

LLM大语言模型可视化三维演示，LLM-viz_LLM-viz-cn中文翻译.zip

tensorrt-llm-0.5.0-0-cp310-cp310-win-amd64.whl

The Document is All You Need！一站式 LLM底层技术原理入门指南.pdf

最新资源