首页Local-to-Global Self-Attention in Vision Transformers

Local-to-Global Self-Attention in Vision Transformers

时间: 2024-06-03 18:07:40 浏览: 123

Self-Attention与Transformer

5星 · 资源好评率100%

1.由来在Transformer之前，做翻译的时候，一般用基于RNN的Encoder-Decoder模型。从X翻译到Y。但是这种方式是基于RNN模型，存在两个问题。一是RNN存在梯度消失的问题。（LSTM/GRU只是缓解这个问题）二是RNN 有时间上的方向性，不能用于并行操作。Transformer 摆脱了RNN这种问题。 2.Transformer 的整体框架输入的x1,x2x_{1},x_{2}x1,x2，共同经过Self-attention机制后，在Self-attention中实现了信息的交互，分别得到了z1,z2z_{1},z_{2}z1,z2，将z1,z2

Vision Transformers (ViT) have shown remarkable performance in various vision tasks, such as image classification and object detection. However, the self-attention mechanism in ViT has a quadratic complexity with respect to the input sequence length, which limits its application to large-scale images. To address this issue, researchers have proposed a novel technique called Local-to-Global Self-Attention (LGSA), which reduces the computational complexity of the self-attention operation in ViT while maintaining its performance. LGSA divides the input image into local patches and performs self-attention within each patch. Then, it aggregates the information from different patches through a global self-attention mechanism. The local self-attention operation only considers the interactions among the pixels within a patch, which significantly reduces the computational complexity. Moreover, the global self-attention mechanism captures the long-range dependencies among the patches and ensures that the model can capture the context information from the entire image. LGSA has been shown to outperform the standard ViT on various image classification benchmarks, including ImageNet and CIFAR-100. Additionally, LGSA can be easily incorporated into existing ViT architectures without introducing significant changes. In summary, LGSA addresses the computational complexity issue of self-attention in ViT, making it more effective for large-scale image recognition tasks.

阅读全文