没有合适的资源?快使用搜索试试~ 我知道了~
首页《视觉Transformer转换器》综述论文
《视觉Transformer转换器》综述论文
需积分: 48 1.2k 浏览量
更新于2023-05-23
评论 1
收藏 1.8MB PDF 举报
Transformer是一种主要基于自注意力机制的深度神经网络,最初应用于自然语言处理领域。受Transformer强大的表征能力的启发,研究人员提出将Transformer扩展到计算机视觉任务中。与卷积网络和循环网络等其他网络类型相比,基于Transformer的模型在各种视觉基准上都具有竞争力,甚至表现出了更好的性能。
资源详情
资源评论
资源推荐

A Survey on Visual Transformer
Kai Han
1
, Yunhe Wang
1∗
, Hanting Chen
1,2
, Xinghao Chen
1
, Jianyuan Guo
1
, Zhenhua Liu
1,2
, Yehui Tang
1,2
,
An Xiao
1
, Chunjing Xu
1
, Yixing Xu
1
, Zhaohui Yang
1,2
, Yiman Zhang
1
, Dacheng Tao
3
*
1
Noah’s Ark Lab, Huawei Technologies
2
Peking University
3
University of Sydney
{kai.han,yunhe.wang,xinghao.chen,jianyuan.guo,xiaoan1,xuchunjing,yixing.xu,zhangyiman1}@huawei.com
{htchen,liu-zh,yhtang,zhaohuiyang}@pku.edu.cn, dacheng.tao@sydney.edu.au
Abstract
Transformer is a type of deep neural network mainly
based on self-attention mechanism which is originally ap-
plied in natural language processing field. Inspired by
the strong representation ability of transformer, researchers
propose to extend transformer for computer vision tasks.
Transformer-based models show competitive and even bet-
ter performance on various visual benchmarks compared
to other network types such as convolutional networks and
recurrent networks. In this paper we provide a literature
review of these visual transformer models by categorizing
them in different tasks and analyze the advantages and dis-
advantages of these methods. In particular, the main cate-
gories include the basic image classification, high-level vi-
sion, low-level vision and video processing. Self-attention
in computer vision is also briefly revisited as self-attention
is the base component in transformer. Efficient transformer
methods are included for pushing transformer into real ap-
plications. Finally, we give a discussion about the further
research directions for visual transformer.
1. Introduction
Deep neural networks have become the fundamental in-
frastructure in modern artificial intelligence system. There
have been various network types proposed for addressing
different tasks. Multi-layer perception (MLP) or say fully
connected (FC) network is the classical neural network
which is stacked of multiple linear layers and nonlinear acti-
vations [104, 105]. Convolutional neural networks (CNNs)
introduce convolutional layers and pooling layers for pro-
cessing shift-invariant data such as images [68, 65]. Recur-
rent neural networks (RNNs) utilize the recurrent cells to
process sequential data or time series data [106, 49]. Trans-
*
Corresponding authors. All authors are in alphabetical order of last
name (except the first and the corresponding authors).
Figure 1. Milestones of transformer. The visual transformer mod-
els are in red.
former is a kind of newly proposed neural network which
mainly utilizes self-attention mechanism [5, 90] to extract
intrinsic feature [123]. Among these networks, transformer
is a recently invented neural network and show great poten-
tial for extensive artificial intelligence applications.
Transformer is originally applied on natural language
processing (NLP) tasks and brings in significant improve-
ment [123, 29, 10]. For example, Vaswani et al. [123] first
propose the transformer based solely on attention mech-
anisms for machine translation and English constituency
parsing tasks. Devlin et al. [29] introduce a new language
representation model called BERT, which pre-trains a trans-
former from unlabeled text by jointly conditioning on both
left and right context. BERT obtains state-of-the-art results
on eleven NLP tasks of the time. Brown et al. [10] pre-train
a gigantic transformer based model GPT-3 with 175 billion
parameters on 45TB compressed plaintext data and achieve
strong performances on different types of downstream nat-
ural language tasks without fine-tuning. These Transformer
based models show strong representation capacity and have
obtained breakthrough in NLP area.
Inspired by the power of transformer in NLP, recently
researchers extend transformer for computer vision (CV)
tasks. CNNs used to be the fundamental component in vi-
sion applications [47, 103], but transformer is showing its
ability as an alternative of CNN. Chen et al. [18] train a se-
quence transformer to auto-regressively predict pixels and
achieve competitive results with CNNs on image classifi-
cation task. ViT is a recently proposed vision transformer
model by Dosovitskiy et al. [31] which applies a pure trans-
1
arXiv:2012.12556v1 [cs.CV] 23 Dec 2020

Table 1. Representative works of visual transformers.
Subject Secondary Subject Method Keypoints Publication
Image
classification
Image classification
iGPT [18] Pixel prediction self-supervised learning, GPT model ICML 2020
ViT [31] Image patches, standard transformer arXiv 2020
High-level
vision
Object detection
DETR [14] Set-based prediction, bipartite matching, transformer ECCV 2020
Deformable DETR [155] DETR, deformable attention module arXiv 2020
ACT [153] Adaptive clustering transformer arXiv 2020
UP-DETR [28] Unsupervised pre-training, random query patch detection arXiv 2020
TSP [117] New bipartite matching, encoder-only transformer arXiv 2020
Segmentation
Max-DeepLab [126] PQ-style bipartite matching, dual-path transformer arXiv 2020
VisTR [129] Instance sequence matching, instance sequence segmentation arXiv 2020
Low-level
vision
Image enhancement
IPT [17] Multi-task, ImageNet pre-training, transformer model arXiv 2020
TTSR [135] Texture transformer, RefSR CVPR 2020
Image generation
Image Transformer [92] Pixel generation using transformer ICML 2018
Video
processing
Video Inpainting STTN [144] Spatial-temporal adversarial loss ECCV 2020
Video Captioning Masked Transformer [154] Masking network, event proposal CVPR 2018
Efficient
transformer
Decomposition
ASH [85] Number of heads, importance estimation NeurIPS 2019
Distillation
TinyBert [62] Various losses for different modules EMNLP Findings 2020
Quantization
FullyQT [97] Fully quantized transformer EMNLP Findings 2020
Architecture Design
ConvBert [61] Local fependence, dynamic vonvolution NeurIPS 2020
former directly to sequences of image patches and attains
state-of-the-art performance on multiple image recognition
benchmarks. Apart from the basic image classification,
transformer has been utilized to address more computer vi-
sion problems such as object detection [14, 155], seman-
tic segmentation, image processing and video understand-
ing. Due to the excellent performance, more and more
transformer-based models are proposed for improving vari-
ous visual tasks.
The transformer-based vision models are springing up
like mushrooms, which leads to difficulty to keep pace with
the rate of new progress. Thus, a survey of the exist-
ing works is agent and can be beneficial for the commu-
nity. In this paper, we focus on providing a comprehen-
sive overview of the recent advances in visual transform-
ers and discuss about the potential directions for further
improvement. To have a better archive and be convenient
to researchers on different topics, we category the trans-
former models by their application scenarios as shown in
Table 1. In particular, the main subjects include basic im-
age classification, high-level vision, low-level vision and
video processing. High-level vision deals with the inter-
pretation and use of what is seen in the image [121] such
as object detection, segmentation and lane detection. There
have been a number of transformer models addressing these
high-level vision tasks, such as DETR [14], deformable
DETR [155] for object detection and Max-DeepLab [126]
for segmentation. Low level image processing is mainly
concerned with extracting descriptions from images (that
are usually represented as images themselves) [35], whose
typical applications include super-resolution, image denois-
ing and style transfer. Few works [17, 92] in low-level vi-
sion use transformers and more investigation is required.
Video processing is an import part in computer vision in
addition to image-based tasks. Transformer can be applied
in video naturally [154, 144] due to the sequential property
of video. Transformer is beginning to show competitive
performance on these tasks compared to the conventional
CNNs or RNNs. Here we give a survey on these works
of transformer-based visual models to keep pace with the
progress in this field. The development timeline of visual
transformer is shown in Figure 1 and we believe more and
more excellent works will be engraved in the milestones.
The rest of the paper is organized as follows. Section 2
first formulates the self-attention mechanism and the stan-
dard transformer. We describe the methods of transformer
in NLP in Section 3 as the research experience may be ben-
eficial for vision tasks. Next, Section 4 is the main part of
the paper, in which we summarize the visual transformer
models on image classification, high-level vision, low-level
vision and video tasks. We also briefly revisit self-attention
mechanism for CV and efficient transformer methods as
they are closely related to our main topic. Finally, we give
a conclusion and discuss about several research directions
and challenges.
2. Formulation of Transformer
Transformer [123] was firstly applied on the machine
translation task in neural language processing (NLP). As
shown in Fig. 2, it consists of an encoder module and a de-
coder module with several encoders/decoders of the same
architecture. Each encoder is composed of a self-attention
layer and a feed-forward neural network, while each de-
coder is composed of a self-attention layer, an encoder-
decoder attention layer and a feed-forward neural network.
Before translating sentences with transformer, each word in
the sentence will be embedded into a vector with d
model
=
2

512 dimensions.
Figure 2. Pipeline of vanilla transformer.
2.1. Self-Attention Layer
In self-attention layer, the input vector is firstly trans-
formed into three different vectors, i.e., the query vector
q, the key vector k and the value vector v with dimension
d
q
= d
k
= d
v
= d
model
= 512. Vectors derived from
different inputs are then packed together into three differ-
ent matrices Q, K and V . After that, the attention function
between different input vectors is calculated with the fol-
lowing steps (as shown in Fig. 3 left):
• Step 1: Compute scores between different input vec-
tors with S = Q · K
>
;
• Step 2: Normalize the scores for the stability of gradi-
ent with S
n
= S/
√
d
k
;
• Step 3: Translate the scores into probabilities with
softmax function P = softmax(S
n
);
• Step 4: Get the weighted value matrix with Z = V ·P.
The process can be unified into a single function:
Attention(Q, K, V ) = sof tmax(
Q · K
>
√
d
k
) · V. (1)
The intuition behind Eq. 1 is simple. Step 1 calculates
scores between two different vectors, and the score is to de-
termine the degree of attention that we put on other words
when encode the word at the current position. Step 2 nor-
malize the scores to have more stable gradients for better
training, and step 3 transfer the scores into probabilities.
Finally, each value vector is multiplied by the sum-upped
probability, and vectors with larger probabilities will be fo-
cused on more by the following layers.
The encoder-decoder attention layer in the decoder mod-
ule is almost the same as self-attention layer in the encoder-
module, except that the key matrix K and value matrix V
are derived from the encoder module and the query matrix
Q is derived from the previous layer.
Note that the above process is irrelevant to the position
of each word, thus the self-attention layer lacks the abil-
ity of capturing the positional information of the words in
a sentence. To address this, a positional encoding with di-
mension d
model
is added to the original input embedding
to get the final input vector of the word. Specifically, the
position is encoded with the following equation:
P E(pos, 2i) = sin(
pos
10000
2i
d
model
); (2)
P E(pos, 2i + 1) = cos(
pos
10000
2i
d
model
), (3)
in which pos denotes the position of the word in a sentence,
and i represents the current dimension of the positional en-
coding.
Figure 3. (Left) The process of self-attention. (Right) Multi-head
attention. (The image is from [123])
2.2. Multi-Head Attention
The self-attention layer is further improved by adding
a mechanism called multi-head attention in order to boost
the performance of the vanilla self-attention layer. Note
that for a given reference word, we often want to focus
on several other words when going through the sentence.
Thus, a single-head self-attention layer limits the ability
of focusing on a specific position (or several specific posi-
tions) while does not influence the attention on other po-
sitions that is equally important at the same time. This
is achieved by giving attention layers different representa-
tion subspace. Specifically, different query, key and value
matrices are used for different heads, and they can project
the input vectors into different representation subspace after
training due to the random initialization.
In detail, given an input vector and the number of heads
h, the input vector is firstly transformed into three different
groups of vectors, i.e., the query group, the key group and
the value group. There are h vectors in each group with
dimension d
q
0
= d
k
0
= d
v
0
= d
model
/h = 64. Then,
vectors derived from different inputs are packed together
into three different groups of matrices {Q
i
}
h
i=1
, {K
i
}
h
i=1
and {V
i
}
h
i=1
. Then, the process of multi-head attention is
3

shown as follow:
MultiHead(Q
0
, K
0
, V
0
) = Concat(head
1
, ·· ·, head
h
)W
o
,
where head
i
= Attention(Q
i
, K
i
, V
i
), (4)
where Q
0
is the concatenation of {Q
i
}
h
i=1
(so is K
0
and V
0
)
and W
o
∈ R
d
model
×d
model
is the linear projection matrix.
Figure 4. Detailed structure of transformer. (The image is
from [123])
2.3. Other Parts in Transformer
Residual in the encoder and decoder. As shown in
Fig. 4, a residual connection is added in each sub-layer in
the encoder and decoder in order to strengthen the flow
of information and get a better performance. A layer-
normalization [4] is followed afterward. The output of op-
erations mentioned above can be described as:
LayerNorm(X + Attention(X)). (5)
Note that X is used as the input of self-attention layer here,
since the query, key and value matrices Q, K and V are all
derived from the same input matrix X.
Feed-forward neural network. A feed-forward NN is
applied after the self-attention layers in each encoder and
decoder. Specifically, the feed-forward NN is consist of two
linear transformation layers and a ReLU activation function
within them, which can be denoted as the following func-
tion:
F F NN (X) = W
2
σ(W
1
X), (6)
where W
1
and W
2
are the two parameter matrices of the
two linear transformation layers, and σ represents the ReLU
activation function. The dimensionality of the hidden layer
is d
h
= 2048.
Final layer in decoder. The final layer in decoder aims
to turn the stack of vectors back into a word. This is
achieved by a linear layer followed by a softmax layer. The
linear layer project the vector into a logits vector with d
word
dimensions in which d
word
is the number of words in the
vocabulary. Then, a softmax layer is used to transform the
logits vector into probabilities.
Most of the transformers used in the computer vision
tasks utilize the encoder module of the original transformer.
In short, it can be treated as a new feature selector that is dif-
ferent from the convolutional neural networks (CNNs) and
the recurrent neural networks (RNNs). Compared to CNN
which only focuses on local characteristics, transformer is
able to capture long distance characteristics which means
that the global information can be easily derived by trans-
former. Compared to RNN whose hidden state must be
computed in sequence, transformer is much efficient since
the output of the self-attention layer and the fully-connected
layers can be computed in parallel and accelerated easily.
Thus, it is meaningful to further study the application of
transformer in the area of not only NLP but also computer
vision.
3. Revisiting Transformers for NLP
Before the advent of Transformer, recurrent neural net-
works ( e.g., GRU [26] and LSTM [50]) with added atten-
tion empower most of the state-of-the-art language mod-
els. However, in RNNs, the information flow needs to be
processed sequentially from the previous hidden states to
the next one, which precludes the acceleration and paral-
lelization during training, and thus hinders the potential of
RNNs to process longer sequences or build larger mod-
els. In 2017, Vaswani et al. [123] proposes Transformer,
a novel encoder-decoder architecture solely built on multi-
head self-attention mechanisms and feed-forward neural
networks, aiming to solve seq-to-seq natural language tasks
(e.g., machine translation) with acquiring global dependen-
cies at ease. The success of Transformer demonstrates that
leveraging attention mechanisms alone can achieve compa-
rable performances with attentive RNNs. Moreover, the ar-
chitecture of Transformer favors massively parallel comput-
ing, enabling training on larger datasets and thus leading to
the burst of large pre-trained models (PTM) for natural lan-
guage processing.
BERT [29] and its variants (e.g., SpanBERT [63],
RoBERTa [82]) are a series of PTMs built on the multi-layer
Transformer encoder architecture. Two tasks are conducted
on BookCorpus [156] and English Wikipedia datasets at the
pre-training stage of BERT: 1) Masked language modeling
(MLM) via first randomly masking out some tokens in the
input and then train the model to predict; 2) Next sentence
4

prediction using paired sentences as input and predicting
whether the second sentence is the original one in the docu-
ment. After pre-training, BERT can be fine-tuned by adding
one output layer alone on a wide range of downstream tasks.
More specifically, When performing sequence level tasks
(e.g., sentiment analysis), BERT uses the representation of
the first token for classification; while for token-level tasks
(e.g., Name Entity Recognition), all tokens are fed into the
softmax layer for classification. At the time of release,
BERT achieves the state-of-the-art results on 11 natural lan-
guage processing task, setting up a milestone in pre-trained
language models. Generative Pre-Trained Transformer se-
ries (e.g., GPT [99], GPT-2 [100]) are another type of pre-
trained models based on the Transformer decoder architec-
ture, which uses masked self-attention mechanisms. The
major difference between GPT series and BERT lies in the
way of pre-training. Unlike BERT, GPT series are one-
directional language models pre-trained by Left-to-Right
(LTR) language modeling. Besides, sentence separator
([SEP]) and classifier token ([CLS]) are only involved in
the fine-tuning stage of GPT but BERT learns those embed-
dings during pre-training. Because the one-directional pre-
pretraining strategy of GPT, it shows superiority in many
natural language generation tasks. More recently, a gigantic
transformer-based model, GPT-3, with incredibly 175 bil-
lion parameters has been introduced [10]. By pre-training
on 45TB compressed plaintext data, GPT-3 claims the abil-
ity to directly process different types of downstream natural
language tasks without fine-tuning, achieving strong perfor-
mances on many NLP datasets, including both natural lan-
guage understanding and generation. Besides the aforemen-
tioned transformer-based PTMs, many other models have
been proposed since the introduction of Transformer. For
this is not the major topic in our survey, we simply list a
few representative models in Table 2 for interested readers.
Apart from the PTMs trained on large corpora for
general natural language processing tasks, transformer-
based models have been applied in many other NLP re-
lated domains or multi-modal tasks. BioNLP Domain.
Transformer-based models have outperformed many tra-
ditional biomedical methods. BioBERT [69] uses Trans-
former architecture for biomedical text mining tasks; SciB-
ERT [7] is developed by training Transformer on 114M sci-
entific articles covering biomedical and computer science
field, aiming to execute NLP tasks related to scientific do-
main more precisely; Huang et al. [55] proposes Clinical-
BERT utilizing Transformer to develop and evaluate con-
tinuous representations of clinical notes and as a side effect,
the attention map of ClinicalBERT can be used to explain
predictions and thus discover high-quality connections be-
tween different medical contents. Multi-Modal Tasks. Ow-
ing to the success of Transformer across text-based NLP
tasks, many researches are committed to exploiting the po-
Table 2. List of representative language models built on Trans-
former.
Models Architecture Params Fine-tuning
GPT [99] Transformer Dec. 117M Yes
GPT-2 [100] Transformer Dec. 117M∼1542M No
GPT-3 [10] Transformer Dec. 125M∼175B No
BERT [29] Transformer Enc. 110M∼340M Yes
RoBERTa [82] Transformer Enc. 355M Yes
XLNet [136]
Two-Stream
≈ BERT Yes
Transformer Enc.
ELECTRA [27] Transformer Enc. 335M Yes
UniLM [30] Transformer Enc. 340M Yes
BART [70] Transformer 110% of BERT Yes
T5 [101] Transfomer 220M∼11B Yes
ERNIE (THU) [149] Transform Enc. 114M Yes
KnowBERT [94] Transformer Enc. 253M∼523M Yes
1
Transformer is the standard encoder-decoder architecture. Transfomer
Enc. and Dec. represent the encoder and decoder part of standard
Transformer. Decoder uses mask self-attention to prevent attending to
the future tokens.
2
The data of the Table is from [98].
tential of Transformer to process multi-modal tasks (e.g.,
video-text, image-text and audio-text). VideoBERT [115]
uses a CNN-based module pre-processing the video to get
the representation tokens, based on which a Transformer en-
coder is trained to learn the video-text representations for
downstream tasks, such as video caption. VisualBERT [72]
and VL-BERT [114] propose single-stream unified Trans-
former to capture visual elements and image-text relation-
ship for downstream tasks like visual question answering
(VQA) and visual commonsence reasoning (VCR). More-
over, several studies such as SpeechBERT [24] explore the
possibility of encoding audio and text pairs with a Trans-
former encoder to process auto-text tasks like Speech Ques-
tion Answering (SQA).
The rapid development of transformer-based models on
varieties of natural language processing as well as NLP-
related tasks demonstrates its structural superiority and ver-
satility. This empowers Transformer to become a universal
module in many other AI fields beyond natural language
processing. The following part of this survey will focus on
the applications of Transformer in a wide range of computer
vision tasks emerged in the past two years.
4. Visual Transformer
In this section, we provide a comprehensive review of the
transformer-based models in computer vision, including the
applications in image classification, high-level vision, low-
level vision and video processing. We also briefly summa-
rize the applications of self-attention mechanism and model
compression methods for efficient transformer.
5
剩余20页未读,继续阅读


















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0