自监督学习驱动的视觉Transformer：新进展与应用

需积分: 34 19 浏览量更新于2024-08-26 1 收藏 3.87MB PDF 举报

"自监督视觉Transformer的研究正在计算机视觉领域引发关注。自监督学习借鉴了在自然语言处理中的成功经验，尤其是Transformer模型。Transformer模型在预训练时，无论是使用大规模监督数据还是协同监督（如教师网络）的方式，都表现出了在下游任务中的优秀性能。本文作者Sara Atito等人提出了一种名为Self-supervised Vision Transformers (SiT)的新方法，探讨了自监督学习用于预训练图像/视觉Transformer的优势，并将其应用于下游分类任务。" 在计算机视觉(CV)领域，自监督学习已经成为缩小与监督学习差距的一种趋势。这种方法不再依赖于大量标注数据，而是通过设计自监督任务来学习特征表示。在NLP领域，Transformer模型因其在诸如机器翻译、语义理解等任务上的卓越性能而广受青睐。Transformer模型的核心是自注意力机制，它允许模型同时考虑输入序列的所有部分，从而捕捉到更复杂的上下文信息。近年来，视觉Transformer模型也开始在CV领域崭露头角。研究发现，预训练的视觉Transformer在处理各种下游任务，如物体检测、图像分类等时，能够取得很好的效果，而且对模型的微调需求相对较小。这表明，Transformer架构在视觉任务中同样具有强大的泛化能力。 Sara Atito等人提出的SiT模型是针对视觉Transformer的自监督学习策略。他们探索了多种自监督训练机制来构建预训练模型，这些机制可能包括旋转预测、颜色预测、Jigsaw拼图等，这些任务旨在让模型在没有标签的情况下学习图像的内在结构和属性。预训练后的SiT模型可以被用作一个强大的特征提取器，然后应用于各种下游分类任务，以提升模型的性能。通过自监督学习，SiT模型能够在无标签数据上学习到丰富的视觉表示，这对于资源有限或难以获取大量标注数据的场景特别有用。此外，这种方法还可以利用未标注的数据进行大规模预训练，进一步提高模型的泛化能力和适应性。自监督视觉Transformer的研究为计算机视觉领域提供了一个新的视角，即如何在没有监督信号的情况下有效学习视觉特征，这不仅降低了对大量标注数据的依赖，也拓宽了模型应用的范围。随着自监督学习技术的不断进步，我们有理由期待视觉Transformer在未来的CV任务中发挥更大的作用。

———————— 1

SiT: Self-supervised vIsion Transformer

Sara Atito, Member IEEE, Muhammad Awais, and Josef Kittler, Life Member, IEEE

Abstract—

Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap

with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of

choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the

vision transformers have been shown to work well when pretrained either using a large scale supervised data [1] or with some kind of

co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in

downstream tasks with minimal changes [1], [2], [3]. In this work we investigate the merits of self-supervised learning for pretraining

image/vision transformers and then using them for downstream classiﬁcation tasks. We propose Self-supervised vision Transformers

(SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural ﬂexibility of SiT allows us to

use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be ﬁnetuned for a

downstream classiﬁcation task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed

approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and

their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also

observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear

classiﬁer on top of the learned features from SiT. Pretraining, ﬁnetuning, and evaluation codes will be available under:

https://github.com/Sara-Ahmed/SiT.

Index Terms—Vision Transformer, Self-supervised Learning, Discriminative Learning, Image Classiﬁcation, transformer based

autoencoders.

1 INTRODUCTION

ECENT trend particularly in NLP showed that self-

supervised pretraining can improve the performance

of downstream task signiﬁcantly [4], [5]. Similar trends

have been observed in speech recognition [6] and computer

vision applications [7], [8], [9], [10]. The self-supervised pre-

training particularly in conjunction with transformers [11]

as shown for BERT [4], [5] are the models of choice for

natural language processing (NLP). The success of self-

supervised learning comes at the cost of massive datasets

and huge capacity models, e.g., NLP based transformers

are trained on hundreds of billions of words consisting

of models with several billions parameters [5]. The recent

success of Transformers in image classiﬁcation [1] generated

a lot of interest in the computer vision community. However,

the pretraining of vision transformer is mainly studied for

very large scale supervised learning datasets, e.g., datasets

consisting of hundred of millions of labelled samples [1].

Very recently vision transformer have been shown to per-

form well on imagenet without external data [2], however,

they need distillation approaches and guidance from CNNs

counterparts. In short, a pretraining using large scale su-

pervised datasets is a norm in computer vision to train

deep neural networks in order to obtain better performance.

However, manual annotation of training data is quite expen-

sive, despite the advances in the crowd engineering inno-

vations. To address this limitation, self-supervised learning

methods [7], [9], [10], [12], [13], [14] have been proposed

to construct image representations that are semantically

meaningful from unlabelled data.

• Centre for Vision, Speech and Signal Processing (CVSSP), University of

Surrey, Guildford, United Kingdom

• {s.a.ahmed,m.a.rana,j.kittler}@surrey.ac.uk

Self-supervised methods can roughly be categorised in

to generative and discriminative approaches. Generative ap-

proaches [15], [16], [17] learn to model the distribution of the

data. However, data modelling generally is computationally

expensive and may not be necessary for representation

learning in all scenarios. On the other hand, discriminative

approaches, typically implemented in a contrastive learning

framework [8], [18], [19], [20] or using pre-text tasks [21],

[22], [23], demonstrate the ability to obtain better gener-

alised representations with modest computational require-

ments.

The primary focus of contrastive learning is to learn

image embeddings that are invariant to different augmented

views of the same image while being discriminative among

different images. Despite the impressive results achieved

by contrastive learning methods, they often disregard the

learning of contextual representations, for which alterna-

tive pretext tasks, such as reconstruction-based approaches,

might be better suited. In recent years, a stream of novel

pretext tasks have been proposed in the literature, including

inpainting patches [24], colourisation [21], [25], [26], relative

patch location [15], solving jigsaw puzzles [27], [28], cross-

channel prediction [29], predicting noise [30], predicting

image rotations [22], spotting artefacts [23], etc.

In this work, we introduce a simple framework for self-

supervise learning that leverages the advantage of both

contrastive learning and pre-text approaches. The main

contributions and ﬁndings of this study are summarised as

follows:

• We propose Self-supervised vision Transformer (SiT),

a novel method for self-supervised learning of visual

representations.

arXiv:2104.03602v1 [cs.CV] 8 Apr 2021

下载后可阅读完整内容，剩余9页未读，立即下载

syp_net

粉丝: 158
资源: 1184

自监督学习驱动的视觉Transformer：新进展与应用

Python-用于学习MultimodalTransformer的Pytorch实现

ICLR 2021上与【自监督学习】 & 【Transformer】相关的论文

SdAE：自蒸馏掩蔽自动编码器网络提升视觉Transformer的表现

自监督学习 transformer 计算机视觉

视觉transformer综述

视觉transformer模型详解

在自监督学习框架下，自蒸馏掩蔽自动编码器网络（SdAE）是如何结合视觉Transformer（ViT）进行图像分类的？具体的技术细节有哪些？

无监督学习transformer

视觉Transformer：突破、现状与CV领域的未来

半监督transformer

最新资源