transformer采用LayerNormal
时间: 2023-09-23 20:11:42 浏览: 22
Transformer采用的是Layer Normalization(层归一化)[1]。在Transformer的编码器中,每个层都包含两个子层,其中第一个子层是多头自注意力汇聚,第二个子层是基于位置的前馈网络。在计算编码器的自注意力时,查询、键和值来自前一个编码器层的输出。为了满足残差连接的要求,每个子层都采用了残差连接,并紧接着进行层归一化操作。这样,在Transformer中,对于输入序列中的每个位置,编码器都会输出一个表示向量。如果你想查看Transformer的完整代码实现,你可以参考d2l.ai的Transformer文档。<span class="em">1</span><span class="em">2</span><span class="em">3</span>
#### 引用[.reference_title]
- *1* [第十章(4):Transformer之Layer Normalization与Transformer整体结构](https://blog.csdn.net/lihuanyu520/article/details/127558754)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"]
- *2* *3* [transformer及动手学习transformer](https://blog.csdn.net/jiangchao98/article/details/121057288)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"]
[ .reference_list ]