Local-to-Global Self-Attention in Vision Transformers
时间: 2024-06-03 18:07:40 浏览: 123
Self-Attention与Transformer
5星 · 资源好评率100%
Vision Transformers (ViT) have shown remarkable performance in various vision tasks, such as image classification and object detection. However, the self-attention mechanism in ViT has a quadratic complexity with respect to the input sequence length, which limits its application to large-scale images.
To address this issue, researchers have proposed a novel technique called Local-to-Global Self-Attention (LGSA), which reduces the computational complexity of the self-attention operation in ViT while maintaining its performance. LGSA divides the input image into local patches and performs self-attention within each patch. Then, it aggregates the information from different patches through a global self-attention mechanism.
The local self-attention operation only considers the interactions among the pixels within a patch, which significantly reduces the computational complexity. Moreover, the global self-attention mechanism captures the long-range dependencies among the patches and ensures that the model can capture the context information from the entire image.
LGSA has been shown to outperform the standard ViT on various image classification benchmarks, including ImageNet and CIFAR-100. Additionally, LGSA can be easily incorporated into existing ViT architectures without introducing significant changes.
In summary, LGSA addresses the computational complexity issue of self-attention in ViT, making it more effective for large-scale image recognition tasks.
阅读全文