transformer算例
时间: 2024-12-30 17:20:48 浏览: 8
### Transformer 模型实例实现教程
#### 词嵌入层与位置编码
在构建Transformer模型时,词嵌入层负责将输入的词汇映射到连续向量空间。为了保持维度一致性,在嵌入层中权重乘以 \( \sqrt{d_{\text{model}}} \)[^1]。
```python
import torch.nn as nn
import math
class Embeddings(nn.Module):
def __init__(self, d_model, vocab_size):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab_size, d_model)
self.d_model = d_model
def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)
```
接着是位置编码部分,它赋予模型处理序列顺序的能力:
```python
def positional_encoding(positions, d_model):
angle_rads = get_angles(np.arange(positions)[:, np.newaxis],
np.arange(d_model)[np.newaxis, :]/
np.power(10000, (2*(np.arange(d_model)//2))/np.float32(d_model)))
sines = np.sin(angle_rads[:, 0::2])
cosines = np.cos(angle_rads[:, 1::2])
pos_encoding = np.concatenate([sines, cosines], axis=-1)
pos_encoding = pos_encoding[np.newaxis, ...]
return torch.tensor(pos_encoding, dtype=torch.float32)
```
#### 编码器结构
编码器由多层组成,每层包含两个子层:一个多头自注意机制(Multi-head Self-Attention),以及一个简单的、基于位置前馈神经网络(Position-wise Feed-Forward Networks)。这两者之间还加入了残差连接和标准化操作[^4]。
```python
from typing import List
class EncoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, dff: int, dropout_rate=0.1):
super().__init__()
self.mha = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
self.ffn = PointWiseFeedForwardNetwork(
units=[dff, d_model]
)
self.norm1 = LayerNormalization()
self.norm2 = LayerNormalization()
self.dropout1 = Dropout(dropout_rate)
self.dropout2 = Dropout(dropout_rate)
def call(self, inputs, training=True, mask=None):
attn_output = self.mha(inputs, inputs, inputs, mask)
out1 = self.norm1(inputs + self.dropout1(attn_output))
ffn_output = self.ffn(out1)
out2 = self.norm2(out1 + self.dropout2(ffn_output))
return out2
```
#### 训练过程中的教师强制策略
训练过程中采用教师强制技术可以加速收敛并提高性能。这意味着无论当前时间步预测的结果如何,下一个时间步总是接收真实标签作为输入[^3]。
```python
for epoch in range(EPOCHS):
for batch in dataset:
with tf.GradientTape() as tape:
predictions = transformer(batch['input'], True,
dec_padding_mask=batch['dec_padding_mask'],
look_ahead_mask=batch['look_ahead_mask'])
loss = loss_function(real=batch['target'], pred=predictions)
gradients = tape.gradient(loss, transformer.trainable_variables)
optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
```
阅读全文