———————— 1
SiT: Self-supervised vIsion Transformer
Sara Atito, Member IEEE, Muhammad Awais, and Josef Kittler, Life Member, IEEE
Abstract—
Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap
with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of
choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the
vision transformers have been shown to work well when pretrained either using a large scale supervised data [1] or with some kind of
co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in
downstream tasks with minimal changes [1], [2], [3]. In this work we investigate the merits of self-supervised learning for pretraining
image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vision Transformers
(SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to
use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a
downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed
approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and
their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also
observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear
classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under:
https://github.com/Sara-Ahmed/SiT.
Index Terms—Vision Transformer, Self-supervised Learning, Discriminative Learning, Image Classification, transformer based
autoencoders.
F
1 INTRODUCTION
R
ECENT trend particularly in NLP showed that self-
supervised pretraining can improve the performance
of downstream task significantly [4], [5]. Similar trends
have been observed in speech recognition [6] and computer
vision applications [7], [8], [9], [10]. The self-supervised pre-
training particularly in conjunction with transformers [11]
as shown for BERT [4], [5] are the models of choice for
natural language processing (NLP). The success of self-
supervised learning comes at the cost of massive datasets
and huge capacity models, e.g., NLP based transformers
are trained on hundreds of billions of words consisting
of models with several billions parameters [5]. The recent
success of Transformers in image classification [1] generated
a lot of interest in the computer vision community. However,
the pretraining of vision transformer is mainly studied for
very large scale supervised learning datasets, e.g., datasets
consisting of hundred of millions of labelled samples [1].
Very recently vision transformer have been shown to per-
form well on imagenet without external data [2], however,
they need distillation approaches and guidance from CNNs
counterparts. In short, a pretraining using large scale su-
pervised datasets is a norm in computer vision to train
deep neural networks in order to obtain better performance.
However, manual annotation of training data is quite expen-
sive, despite the advances in the crowd engineering inno-
vations. To address this limitation, self-supervised learning
methods [7], [9], [10], [12], [13], [14] have been proposed
to construct image representations that are semantically
meaningful from unlabelled data.
• Centre for Vision, Speech and Signal Processing (CVSSP), University of
Surrey, Guildford, United Kingdom
• {s.a.ahmed,m.a.rana,j.kittler}@surrey.ac.uk
Self-supervised methods can roughly be categorised in
to generative and discriminative approaches. Generative ap-
proaches [15], [16], [17] learn to model the distribution of the
data. However, data modelling generally is computationally
expensive and may not be necessary for representation
learning in all scenarios. On the other hand, discriminative
approaches, typically implemented in a contrastive learning
framework [8], [18], [19], [20] or using pre-text tasks [21],
[22], [23], demonstrate the ability to obtain better gener-
alised representations with modest computational require-
ments.
The primary focus of contrastive learning is to learn
image embeddings that are invariant to different augmented
views of the same image while being discriminative among
different images. Despite the impressive results achieved
by contrastive learning methods, they often disregard the
learning of contextual representations, for which alterna-
tive pretext tasks, such as reconstruction-based approaches,
might be better suited. In recent years, a stream of novel
pretext tasks have been proposed in the literature, including
inpainting patches [24], colourisation [21], [25], [26], relative
patch location [15], solving jigsaw puzzles [27], [28], cross-
channel prediction [29], predicting noise [30], predicting
image rotations [22], spotting artefacts [23], etc.
In this work, we introduce a simple framework for self-
supervise learning that leverages the advantage of both
contrastive learning and pre-text approaches. The main
contributions and findings of this study are summarised as
follows:
• We propose Self-supervised vision Transformer (SiT),
a novel method for self-supervised learning of visual
representations.
arXiv:2104.03602v1 [cs.CV] 8 Apr 2021