Transformer模型革新：仅用16x16像素图像实现视觉识别

深度学习

论文阅读笔记

需积分: 20 174 浏览量更新于2024-06-27 收藏 7.98MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"《一张图片胜过16x16个词：Transformer在大规模图像识别中的应用》（An Image is worth 16x16 Words: Transformers for Image Recognition）是一篇于2021年发表在国际计算机视觉与模式识别会议（ICLR）上的论文。该研究旨在探讨Transformer架构在计算机视觉领域的潜力，尤其是在传统上依赖于卷积神经网络（CNN）的图像识别任务中的作用。 Transformer，最初在自然语言处理（NLP）领域大放异彩，以其自注意力机制和并行计算能力而闻名。然而，在视觉任务中，Transformer的应用通常是与CNN结合，或者是在保留CNN基本结构的同时替换其某些组件。这篇论文挑战了这种依赖性，提出直接将纯Transformer应用于图像的像素序列，即不依赖于预先存在的CNN结构，也能在图像分类任务上取得出色的表现。作者们展示了他们设计的纯Transformer模型，能够在预训练大量数据后，如ImageNet、CIFAR-100和VTAB等标准图像识别基准上达到或超越传统的CNN模型。这表明Transformer有能力独立处理图像信息，并可能成为计算机视觉领域的一种新型基础架构。通过对比实验和分析，论文深入探讨了Transformer的优势，例如其能够捕捉更长范围的上下文信息，以及如何有效地处理不同尺度和复杂性的视觉输入。此外，论文还涉及到关键的预训练策略和模型优化技术，这些对于Transformer在图像识别中的成功至关重要。它强调了在无监督学习阶段对Transformer进行大规模训练的重要性，以及如何通过微调（finetuning）来适应特定的下游任务。通过这种方式，Transformer能够在保持高效的同时，展现出在图像理解方面的强大能力。这篇论文不仅提出了一个新颖的视觉Transformer模型，而且推动了计算机视觉领域对Transformer架构的理解和应用，为未来在图像识别和其他视觉任务中进一步探索Transformer的可能性开辟了新的道路。"

资源详情

资源推荐

Published as a conference paper at ICLR 2021

Model Layers Hidden size D MLP size Heads Params

ViT-Base 12 768 3072 12 86M

ViT-Large 24 1024 4096 16 307M

ViT-Huge 32 1280 5120 16 632M

Table 1: Details of Vision Transformer model variants.

We also evaluate on the 19-task VTAB classiﬁcation suite (Zhai et al., 2019b). VTAB evaluates

low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into

three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite

imagery, and Structured – tasks that require geometric understanding like localization.

Model Variants. We base ViT conﬁgurations on those used for BERT (Devlin et al., 2019), as

summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we

add the larger “Huge” model. In what follows we use brief notation to indicate the model size and

the input patch size: for instance, ViT-L/16 means the “Large” variant with 16 ⇥ 16 input patch size.

Note that the Transformer’s sequence length is inversely proportional to the square of the patch size,

thus models with smaller patch size are computationally more expensive.

For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization lay-

ers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized

convolutions (Qiao et al., 2019). These modiﬁcations improve transfer (Kolesnikov et al., 2020),

and we denote the modiﬁed model “ResNet (BiT)”. For the hybrids, we feed the intermediate fea-

ture maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths,

we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same

number of layers in stage 3 (keeping the total number of layers), and take the output of this extended

stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.

Training & Fine-tuning. We train all models, including ResNets, using Adam (Kingma & Ba,

2015) with 

=0.9, 

=0.999, a batch size of 4096 and apply a high weight decay of 0.1, which

we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common

practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning

rate warmup and decay, see Appendix B.1 for details. For ﬁne-tuning we use SGD with momentum,

batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we ﬁne-tuned at

higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992)

averaging with a factor of 0.9999 (Ramachandran et al., 2019; Wang et al., 2020b).

Metrics. We report results on downstream datasets either through few-shot or ﬁne-tuning accuracy.

Fine-tuning accuracies capture the performance of each model after ﬁne-tuning it on the respective

dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem

that maps the (frozen) representation of a subset of training images to {1, 1}

target vectors. This

formulation allows us to recover the exact solution in closed form. Though we mainly focus on

ﬁne-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-ﬂy evaluation

where ﬁne-tuning would be too costly.

4.2 COMPARISON TO STATE OF THE ART

We ﬁrst compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from

the literature. The ﬁrst comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which

performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al.,

2020), which is a large EfﬁcientNet trained using semi-supervised learning on ImageNet and JFT-

300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and

BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we

report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU

v3 cores (2 per chip) used for training multiplied by the training time in days.

Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L

(which is pre-trained on the same dataset) on all tasks, while requiring substantially less computa-

tional resources to train. The larger model, ViT-H/14, further improves the performance, especially

on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this

剩余22页未读，继续阅读

Mrwei_418

粉丝: 164
资源: 4

会员权益专享

Transformer模型革新：仅用16x16像素图像实现视觉识别

在pytorch中使用用npz文件保存的预训练模型pdf

Recurrent DETR: Transformer-Based Object Detection for Crowded S

华为mindspore培训资料：Transformer.pptx

from keras_vit import vit

如何将transformer加入resnet模型中

图像增强和transformer

transformer提出

图像领域transformer发展史

swin transformer和vision transformer

transformer在cv领域中应用的开端

visio transformer

transformer 图像

visiontransformer太阳花

transformer图像分割

vision transformer原文

Transformer & Bert.zip

CVPR2022 Image Dehazing Transformer with Transmission-Aware 3D代码

BERT模型实战1

科技行业前言：Transformer模型改变AI生态

会员权益专享

最新资源