Transformer在计算机视觉中的应用：一次深入探索

Transformer

需积分: 48 76 浏览量更新于2023-03-03 1 收藏 1.8MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"《视觉Transformer转换器》综述论文" 这篇综述论文专注于探讨Transformer模型在计算机视觉领域的应用和发展。Transformer最初被设计用于自然语言处理（NLP），利用自注意力机制来处理序列数据，表现出强大的语义理解能力。近年来，由于其在NLP领域的成功，研究者开始探索将其应用于计算机视觉任务，挑战传统的卷积神经网络（CNN）和循环神经网络（RNN）。在计算机视觉中，Transformer的主要优势在于其对全局依赖关系的建模能力。与CNN依赖于局部感受野进行特征提取不同，Transformer可以同时考虑图像中的所有元素，捕捉到更复杂的上下文信息。这使得Transformer在处理如图像分类、目标检测、语义分割、实例分割、图像生成、视频理解等任务时，能更有效地理解和解析复杂场景。论文对现有的视觉Transformer模型进行了分类，分析了它们在不同任务中的表现。例如，在图像分类中，Transformer模型如Vision Transformer (ViT)通过将图像划分为固定大小的patches，然后将这些patches视为序列输入到Transformer架构中，实现分类任务。而在目标检测和语义分割任务中，DETR和SETR等模型引入了Transformer架构，通过端到端的方式直接预测边界框和分割掩码，简化了传统方法中的多阶段流程。尽管Transformer在视觉任务中展现出了优秀性能，但也有其局限性。例如，Transformer模型通常需要更多的计算资源和内存，训练时间也较长。此外，由于Transformer的全局注意力机制，它可能对噪声和冗余信息更敏感，这可能影响其在某些任务中的稳定性和鲁棒性。论文还讨论了针对这些问题的改进策略，如引入局部注意力机制（如PVT、CvT）以减少计算复杂度，以及采用预训练模型来加速收敛和提高性能。同时，论文还展望了未来的研究方向，如融合Transformer和CNN的优点，开发更高效、更具适应性的Transformer结构，以及在低资源设置下的优化方法。总结来说，《视觉Transformer转换器》综述论文深入剖析了Transformer模型在计算机视觉领域的应用，比较了其与其他网络架构的优缺点，并提出了未来研究的潜在方向，对于理解Transformer在视觉任务中的作用及其未来发展具有重要价值。

资源详情

资源推荐

prediction using paired sentences as input and predicting

whether the second sentence is the original one in the docu-

ment. After pre-training, BERT can be ﬁne-tuned by adding

one output layer alone on a wide range of downstream tasks.

More speciﬁcally, When performing sequence level tasks

(e.g., sentiment analysis), BERT uses the representation of

the ﬁrst token for classiﬁcation; while for token-level tasks

(e.g., Name Entity Recognition), all tokens are fed into the

softmax layer for classiﬁcation. At the time of release,

BERT achieves the state-of-the-art results on 11 natural lan-

guage processing task, setting up a milestone in pre-trained

language models. Generative Pre-Trained Transformer se-

ries (e.g., GPT [99], GPT-2 [100]) are another type of pre-

trained models based on the Transformer decoder architec-

ture, which uses masked self-attention mechanisms. The

major difference between GPT series and BERT lies in the

way of pre-training. Unlike BERT, GPT series are one-

directional language models pre-trained by Left-to-Right

(LTR) language modeling. Besides, sentence separator

([SEP]) and classiﬁer token ([CLS]) are only involved in

the ﬁne-tuning stage of GPT but BERT learns those embed-

dings during pre-training. Because the one-directional pre-

pretraining strategy of GPT, it shows superiority in many

natural language generation tasks. More recently, a gigantic

transformer-based model, GPT-3, with incredibly 175 bil-

lion parameters has been introduced [10]. By pre-training

on 45TB compressed plaintext data, GPT-3 claims the abil-

ity to directly process different types of downstream natural

language tasks without ﬁne-tuning, achieving strong perfor-

mances on many NLP datasets, including both natural lan-

guage understanding and generation. Besides the aforemen-

tioned transformer-based PTMs, many other models have

been proposed since the introduction of Transformer. For

this is not the major topic in our survey, we simply list a

few representative models in Table 2 for interested readers.

Apart from the PTMs trained on large corpora for

general natural language processing tasks, transformer-

based models have been applied in many other NLP re-

lated domains or multi-modal tasks. BioNLP Domain.

Transformer-based models have outperformed many tra-

ditional biomedical methods. BioBERT [69] uses Trans-

former architecture for biomedical text mining tasks; SciB-

ERT [7] is developed by training Transformer on 114M sci-

entiﬁc articles covering biomedical and computer science

ﬁeld, aiming to execute NLP tasks related to scientiﬁc do-

main more precisely; Huang et al. [55] proposes Clinical-

BERT utilizing Transformer to develop and evaluate con-

tinuous representations of clinical notes and as a side effect,

the attention map of ClinicalBERT can be used to explain

predictions and thus discover high-quality connections be-

tween different medical contents. Multi-Modal Tasks. Ow-

ing to the success of Transformer across text-based NLP

tasks, many researches are committed to exploiting the po-

Table 2. List of representative language models built on Trans-

former.

Models Architecture Params Fine-tuning

GPT [99] Transformer Dec. 117M Yes

GPT-2 [100] Transformer Dec. 117M∼1542M No

GPT-3 [10] Transformer Dec. 125M∼175B No

BERT [29] Transformer Enc. 110M∼340M Yes

RoBERTa [82] Transformer Enc. 355M Yes

XLNet [136]

Two-Stream

≈ BERT Yes

Transformer Enc.

ELECTRA [27] Transformer Enc. 335M Yes

UniLM [30] Transformer Enc. 340M Yes

BART [70] Transformer 110% of BERT Yes

T5 [101] Transfomer 220M∼11B Yes

ERNIE (THU) [149] Transform Enc. 114M Yes

KnowBERT [94] Transformer Enc. 253M∼523M Yes

Transformer is the standard encoder-decoder architecture. Transfomer

Enc. and Dec. represent the encoder and decoder part of standard

Transformer. Decoder uses mask self-attention to prevent attending to

the future tokens.

The data of the Table is from [98].

tential of Transformer to process multi-modal tasks (e.g.,

video-text, image-text and audio-text). VideoBERT [115]

uses a CNN-based module pre-processing the video to get

the representation tokens, based on which a Transformer en-

coder is trained to learn the video-text representations for

downstream tasks, such as video caption. VisualBERT [72]

and VL-BERT [114] propose single-stream uniﬁed Trans-

former to capture visual elements and image-text relation-

ship for downstream tasks like visual question answering

(VQA) and visual commonsence reasoning (VCR). More-

over, several studies such as SpeechBERT [24] explore the

possibility of encoding audio and text pairs with a Trans-

former encoder to process auto-text tasks like Speech Ques-

tion Answering (SQA).

The rapid development of transformer-based models on

varieties of natural language processing as well as NLP-

related tasks demonstrates its structural superiority and ver-

satility. This empowers Transformer to become a universal

module in many other AI ﬁelds beyond natural language

processing. The following part of this survey will focus on

the applications of Transformer in a wide range of computer

vision tasks emerged in the past two years.

4. Visual Transformer

In this section, we provide a comprehensive review of the

transformer-based models in computer vision, including the

applications in image classiﬁcation, high-level vision, low-

level vision and video processing. We also brieﬂy summa-

rize the applications of self-attention mechanism and model

compression methods for efﬁcient transformer.

剩余20页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

会员权益专享

Transformer在计算机视觉中的应用：一次深入探索

Transformer-For-CV：适用于计算机视觉任务的Transformer应用程序摘要

vision_transformer

文本风格转换综述

视觉transformer的发展综述

视觉transformer的综述

视觉transformer综述

帮我写一个关于计算机视觉Transformer的综述

视觉transformer发展脉络

视觉Transformer

lora 微调 视觉transformer

你能介绍一下视觉Transformer的原理和应用吗

视觉transformer中的位置编码

transformer 进行预测 和 视觉transformer 的异同点

视觉transformer发展史

如何使用视觉Transformer模型？

鲁鹏计算机视觉transformer

视觉 transformer 原理

视觉transformer

视觉transformer复现

transformer最新综述

会员权益专享

最新资源

lora 微调视觉transformer

transformer 进行预测和视觉transformer 的异同点