多尺度视觉长former：高分辨率图像编码的新Transformer架构

需积分: 25 84 浏览量更新于2024-07-09 收藏 904KB PDF 举报

"Multi-Scale Vision Longformer是一种新型的视觉Transformer架构，专为高分辨率图像编码设计。它通过引入多尺度模型结构和改进的注意力机制——视觉Longformer，显著提升了传统Vision Transformer（ViT）的性能。该论文指出，这两种技术在处理大量输入令牌时具有线性复杂度，且在图像分类、物体检测和分割等视觉任务上超越了包括现有ViT模型和ResNet在内的多个强基线模型。" 本文介绍的Multi-Scale Vision Longformer是Transformer架构在计算机视觉领域的最新进展，尤其是在处理高分辨率图像时。Transformer模型最初在自然语言处理（NLP）领域取得了巨大成功，但其对长序列的处理能力在图像处理中受到了限制，尤其是对于像素密集的高分辨率图像。为了解决这个问题，研究者们提出了两个关键创新： 1. 多尺度模型结构：这种结构允许模型在不同的尺度上捕获图像特征，从而提供多级别的语义理解。这在处理高分辨率图像时特别有用，因为它可以在保持可管理计算成本的同时，提供更丰富的上下文信息。多尺度处理有助于捕捉不同范围的模式，从全局到局部，增强模型对图像细节的敏感度。 2. 视觉Longformer的注意力机制：Longformer是Transformer的一个变体，最初设计用于处理长文本序列，其特点是注意力机制具有线性时间复杂度。这一机制被应用到视觉领域，使得Multi-Scale Vision Longformer能有效地处理大量输入像素，避免了传统Transformer的平方时间复杂度问题，极大地提高了效率。通过全面的实证研究，Multi-Scale Vision Longformer在一系列视觉任务上展示了卓越的性能。它不仅优于现有的ViT模型，也优于它们的ResNet对应模型，甚至超越了同时期发布的Pyramid Vision Transformer。这些结果表明，这种新型的Transformer架构为图像处理提供了一种更强大、更高效的方法。此外，该研究还强调了模型和源代码将公开发布，这将促进后续研究和应用的发展，使更多的研究者能够利用这个强大的工具进行深度学习和计算机视觉的研究。Multi-Scale Vision Longformer的出现进一步推动了Transformer在视觉任务中的应用，有望成为未来图像处理和分析领域的重要工具。

3-stage models, whose ﬁrst stage patch size is 8. For later

stages, the patch sizes are set to 2, which downsizes the fea-

ture map resolution by 2. Following the practice in ResNet,

we increase the hidden dimension twice when downsizing

the feature map resolution by 2. We list a few representative

model conﬁgurations in Table 1. Different attention types

(a) have different choices of number of global tokens n

But they share the same model conﬁgurations. Thus we do

not specify a and n

in Table 1. Readers refer to the Supple-

mentary for the complete list of model conﬁgurations used

in this paper,

Size

Stage1 Stage2 Stage3 Stage4

n,p,h,d n,p,h,d n,p,h,d n,p,h,d

Tiny 1,4,1,48 1,2,3,96 9,2,3,192 1,2,6,384

Small 1,4,3,96 2,2,3,192 8,2,6,384 1,2,12,768

Medium-D 1,4,3,96 4,2,3,192 16,2,6,384 1,2,12,768

Medium-W 1,4,3,192 2,2,6,384 8,2,8,512 1,2,12,768

Base-D 1,4,3,96 8,2,3,192 24,2,6,384 1,2,12,768

Base-W 1,4,3,192 2,2,6,384 8,2,12,768 1,2,16,1024

Tiny-3stage 2,8,3,96 9,2,3,192 1,2,6,384

Small-3stage 2,8,3,192 9,2,6,384 1,2,12,768

Table 1. Model architecture for multi-scale stacked ViTs. Archi-

tecture parameters for each E-ViT stageE-ViT(a × n/p ; h, d):

number of attention blocks n, input patch size p, number of heads

h and hidden dimension d. See the meaning of these parameters

in Figure 1 (Bottom).

How to connect global tokens between consecutive

stages? The choice varies at different stages and among dif-

ferent tasks. For the tasks in this paper, e.g., classiﬁcation,

object detection, instance segmentation, we simply discard

the global tokens and only reshape the local tokens as the

input for next stage. In this choice, global tokens only plays

a role of an efﬁcient way to globally communicate between

distant local tokens, or can be viewed as a form of global

memory. These global tokens are useful in vision-language

tasks, in which the text tokens serve as the global tokens

and will be shared across stages.

Should we use the average-pooled layer-normed features

or the LayerNormed CLS token’s feature for image clas-

siﬁcation? The choice makes no difference for ﬂat mod-

els. But the average-pooled feature performs better than the

CLS feature for multi-scale models, especially for the multi-

scale models with only one attention block in the last stage

as shown in Table 1. Readers refer to the Supplementary for

an ablation study.

As reported in Table 2, the multi-scale models perform

better than the ﬂat models even in low-resolution classi-

ﬁcation problems. This shows the importance of multi-

scale structure on classiﬁcation tasks. However, the full

self-attention mechanism suffers from the quartic compu-

tation/memory complexity w.r.t. the resolution of feature

maps, as shown in Table 2. Thus, it is impossible to train

4-stage multi-scale ViTs with full attention using the same

setting (batch size and hardware) used for DeiT training.

Model

#Params FLOPs Memory Top-1

(M) (G) (M) (%)

Ti-DeiT / 16 [40] 5.7 1.3 33.4 72.2

Ti-E-ViT(full/16) 5.7 1.3 33.4 73.2/73.1

Ti-Full-1,10,1 7.12 1.35 45.8 75.9

Ti-Full-2,9,1 6.78 1.45 60.6 75.8

Ti-Full-1,1,9,1 6.71 2.29 155.0 76.1

Ti-Full-1,2,8,1 6.37 2.39 170.5 75.6

Ti-ViL-1,10,1 7.12 1.27 38.3 75.6±0.23

Ti-ViL-2,9,1 6.78 1.29 45.5 75.9±0.08

Ti-ViL-1,1,9,1 6.71 1.33 52.7 76.2±0.12

Ti-ViL-1,2,8,1 6.37 1.35 60.0 76.0±0.10

S-DeiT / 16 [40] 22.1 4.6 67.1 79.9

S-E-ViT(full/16) 22.1 4.6 67.1 80.4/80.7

S-Full-1,10,1 27.58 4.84 78.5 81.7

S-Full-2,9,1 26.25 5.05 93.8 81.7

S-Full-1,1,9,1 25.96 6.74 472.9 –

S-Full-1,2,8,1 24.63 6.95 488.3 –

S-ViL-1,10,1 27.58 4.67 73.0 81.6

S-ViL-2,9,1 26.25 4.71 81.4 81.8

S-ViL-1,1,9,1 25.96 4.82 108.5 81.8

S-ViL-1,2,8,1 24.63 4.86 116.8 82.0

Table 2. Flat vs Multi-scale Models with full self-attention: Num-

ber of paramers, FLOPS, training time, memory per image (with

Pytorch Automatic Mixed Precision enabled), and ImageNet ac-

curacy with image size 224. “Ti-Full-2,9,1” stands for a tiny-scale

3-stage multiscale ViT with a = full attention and with 2,9,1

number of attention blocks in each stage, respectively. Similarly,

small models start with “S-”. Since all our multi-scale models

use average-pooled feature from the last stage for classiﬁcation,

we report Top-1 accuracy of E-ViT(full/16) both with the CLS

feature (ﬁrst) and with the average-pooled feature (second). The

multi-scale models consistently outperform the ﬂat models, but

the memory usage of full attention quickly blows up when only

one high-resolution block is introduced. The Vision Longformer

(“ViL-”) saves FLOPs and memory, without performance drop.

3.2. Vision Longformer: An Efﬁcient Attention

Mechanism

We propose to use the ”local attention + global memory”

efﬁcient mechanism, as illustrated in Figure 2 (Left), to re-

duce the computational and memory cost when using E-ViT

to generate high-resolution feature maps. The 2-D Vision

Longformer is an extension of the 1-D Longformer [2] orig-

inally developed for NLP tasks. We add n

global tokens

(including the CLS token) that are allowed to attend to all

other tokens, serving as global memory. We also allow lo-

cal tokens to only attend to global tokens and their local 2-D

neighbors within a window size, and thus limit the increase

of the attention cost to be linear w.r.t. number of input to-

kens. In summary, there are four components in this ”local

剩余17页未读，继续阅读

qq_42715318

粉丝: 0

多尺度视觉长former：高分辨率图像编码的新Transformer架构

PyPI发布最新Python库drf-multi-lookup-0.0.19

git-multi-ssh.sh：实现Git仓库私钥动态管理

Multi-Cache：Node.js 的灵活可交换后端缓存接口

LBP-Learning-Multi-scale-Block-Local-Binary-Patterns-for-Face-Recognition.pdf

在处理高分辨率图像时，Multi-Scale Vision Longformer如何利用多尺度模型结构和视觉Longformer的注意力机制提升编码性能？

multi-label-classification.pdf

Implementation of CIC Filter for Multi-Stage Transceiver Applications.pdf.pdf

Security-in-a-Multi-Cloud-World.pdf

Multi-Factor Authentication Modeling.pdf

Multi-Agent Reinforcement Learning.pdf

最新资源