table transformer
时间: 2023-11-15 21:59:31 浏览: 87
Table Transformer是一种基于Transformer模型的表格数据处理方法。它可以在处理缺失和嘈杂的特征方面表现出色,并且在表格数据上尝试Transformers是有意义的。与其他表格数据模型相比,Table Transformer具有三个优点。Transformer最初是作为一种建模语言的方法被提出的,但是将表格数据与人类语言进行比较可能有些奇怪。
相关问题
transformer总体架构图
Transformer是一种基于自注意力机制的深度学习模型,它由维克托·奥尔什维茨(Vaswani)等人于2017年在论文《Attention is All You Need》中提出,主要用于处理序列数据,比如自然语言处理任务。Transformer的主要架构包括以下几个关键部分:
1. **输入嵌入**(Input Embedding):每个输入元素(如词、字符等)通过查找表(Embedding Look-up Table)映射成密集向量。
2. **位置编码**(Positional Encoding):为了保持对序列中元素相对顺序的敏感性,即使在网络中没有循环结构,也会添加位置信息到输入向量上。
3. **多层自注意力模块**(Multi-head Self-Attention):这是Transformer的核心部分,包含多个并行的注意力头,可以同时关注输入的不同方面。每层通常由自我注意力块(Self-Attention Block)、前馈神经网络(Feedforward Network)和残差连接(Residual Connections)组成。
4. **层间归一化**(Layer Normalization):在每一层的开始和结束,会对整个层的输出进行标准化,有助于稳定训练过程。
5. **点积注意力**(Scaled Dot Product Attention):用于计算每个位置元素与其他位置的相关性,然后将加权后的值作为查询结果。
6. **残差连接**(Residual Connections):允许信息直接从一层传递到下一层,增强网络的表达能力。
7. **堆叠多层**(Stacking Multiple Layers):通过堆叠多层Transformer,形成深度模型,提高模型的表示能力。
swin transformer_b
### Swin Transformer B 模型架构
Swin Transformer B 是一种特定配置下的 Swin Transformer 架构,通常指代基础(Base)规模的模型。该模型采用了分层设计来构建视觉特征表示。
#### 层级化结构
Swin Transformer 将输入图像划分为不重叠的 patches 并将其线性嵌入到高维空间中[^1]。对于 Swin Transformer B 版本而言:
- 输入分辨率设定为 224×224 像素大小;
- Patch size 设置为 4×4,意味着每一片包含了 16 个像素点的数据被映射成一个 token;
- Embedding dimension 被设为了 128 维度;
```python
import torch.nn as nn
class PatchEmbed(nn.Module):
""" Image to Patch Embedding """
def __init__(self, img_size=224, patch_size=4, in_chans=3, embed_dim=128):
super().__init__()
num_patches = (img_size // patch_size) ** 2
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = num_patches
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
def forward(self, x):
B, C, H, W = x.shape
x = self.proj(x).flatten(2).transpose(1, 2)
return x
```
#### Shifted Window Multi-head Self Attention (SW-MSA)
核心机制之一是通过局部窗口内的多头自注意力操作替代全局范围内的计算方式,这不仅降低了复杂度还增强了建模能力[^2]。具体来说,在奇数层内采用标准窗口划分方法而在偶数层则应用移位后的版本以确保相邻区域间的信息交互。
#### Feed-forward Network (FFN)
每一阶段之后都会接上两层全连接神经元组成的前馈网络作为非线性变换组件。
```python
from typing import Tuple
import torch
from torch import Tensor
import torch.nn.functional as F
from einops.layers.torch import Rearrange
def drop_path(x: Tensor,
drop_prob: float = 0.,
training: bool = False) -> Tensor:
if drop_prob == 0. or not training:
return x
keep_prob = 1 - drop_prob
shape = (x.shape[0], ) + (1, ) * (x.ndim - 1)
random_tensor = keep_prob + \
torch.rand(shape, dtype=x.dtype, device=x.device)
output = x.div(keep_prob) * random_tensor.floor()
return output
class DropPath(nn.Module):
def __init__(self, p: float = 0.) -> None:
super().__init__()
self.p = p
def forward(self, x: Tensor) -> Tensor:
return drop_path(x, self.p, self.training)
class Mlp(nn.Module):
def __init__(self,
in_features: int,
hidden_features: int = None,
out_features: int = None,
act_layer: nn.Module = nn.GELU,
drop: float = 0.) -> None:
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop = nn.Dropout(drop)
def forward(self, x: Tensor) -> Tensor:
x = self.fc1(x)
x = self.act(x)
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class WindowAttention(nn.Module):
def __init__(self,
dim: int,
window_size: Tuple[int],
num_heads: int,
qkv_bias: bool = True,
attn_drop: float = 0.,
proj_drop: float = 0.) -> None:
super().__init__()
self.dim = dim
self.window_size = window_size
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = head_dim**-0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.softmax = nn.Softmax(dim=-1)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.relative_position_bias_table = nn.Parameter(
torch.zeros((2 * window_size[0] - 1) *
(2 * window_size[1] - 1), num_heads))
coords_h = torch.arange(self.window_size[0])
coords_w = torch.arange(self.window_size[1])
coords = torch.stack(torch.meshgrid([coords_h, coords_w]))
coords_flatten = torch.flatten(coords, 1)
relative_coords = coords_flatten[:, :, None] - coords_flatten[:,
None, :]
relative_coords = relative_coords.permute(
1, 2, 0).contiguous().float()
relative_coords[:, :, 0] += self.window_size[0] - 1
relative_coords[:, :, 1] += self.window_size[1] - 1
relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
relative_position_index = relative_coords.sum(-1).long()
self.register_buffer('relative_position_index',
relative_position_index)
trunc_normal_(self.relative_position_bias_table, std=.02)
def forward(self, x: Tensor, mask: Tensor = None) -> Tensor:
b_, n, c = x.shape
qkv = self.qkv(x).reshape(b_, n, 3, self.num_heads,
c // self.num_heads).permute(
2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
q = q * self.scale
attn = (q @ k.transpose(-2, -1))
relative_position_bias = self.relative_position_bias_table[
self.relative_position_index.view(-1)].view(
self.window_size[0] * self.window_size[1],
self.window_size[0] * self.window_size[1],
-1)
relative_position_bias = relative_position_bias.permute(
2, 0, 1).contiguous()
attn = attn + relative_position_bias.unsqueeze(0)
if mask is not None:
nw = mask.shape[0]
attn = attn.view(b_ // nw, nw, self.num_heads, n,
n) + mask.unsqueeze(1).unsqueeze(0)
attn = attn.view(-1, self.num_heads, n, n)
attn = self.
阅读全文