没有合适的资源?快使用搜索试试~ 我知道了~
首页最新的对比自监督学习(Contrastive Self-supervised Learning)综述论文
最新的对比自监督学习(Contrastive Self-supervised Learning)综述论文
需积分: 0 1.1k 浏览量
更新于2023-05-27
评论 1
收藏 5.72MB PDF 举报
自监督学习(Self-supervised learning)最近获得了很多关注,因为其可以避免对数据集进行大量的标签标注。它可以把自己定义的伪标签当作训练的信号,然后把学习到的表示(representation)用作下游任务里。最近,对比学习被当作自监督学习中一个非常重要的一部分,被广泛运用在计算机视觉、自然语言处理等领域。它的目标是:将一个样本的不同的、增强过的新样本们在嵌入空间中尽可能地近,然后让不同的样本之间尽可能地远。这篇论文提供了一个非常详尽的对比自监督学习综述。
资源详情
资源评论
资源推荐

A SURVEY ON CONTRASTIVE SELF-SUPERVISED LEARNING
Ashish Jaiswal
The University of Texas at Arlington
Arlington, TX 76019
ashish.jaiswal@mavs.uta.edu
Ashwin Ramesh Babu
The University of Texas at Arlington
Arlington, TX 76019
ashwin.rameshbabu@mavs.uta.edu
Mohammad Zaki Zadeh
The University of Texas at Arlington
Arlington, TX 76019
mohammad.zakizadehgharie@mavs.uta.edu
Debapriya Banerjee
The University of Texas at Arlington
Arlington, TX 76019
debapriya.banerjee2@mavs.uta.edu
Fillia Makedon
The University of Texas at Arlington
Arlington, TX 76019
makedon@uta.edu
ABSTRACT
Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating
large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and use the
learned representations for several downstream tasks. Specifically, contrastive learning has recently
become a dominant component in self-supervised learning methods for computer vision, natural
language processing (NLP), and other domains. It aims at embedding augmented versions of the
same sample close to each other while trying to push away embeddings from different samples. This
paper provides an extensive review of self-supervised methods that follow the contrastive approach.
The work explains commonly used pretext tasks in a contrastive learning setup, followed by different
architectures that have been proposed so far. Next, we have a performance comparison of different
methods for multiple downstream tasks such as image classification, object detection, and action
recognition. Finally, we conclude with the limitations of the current methods and the need for further
techniques and future directions to make substantial progress.
Keywords
contrastive learning
·
self-supervised learning
·
discriminative learning
·
image/video classification
·
object
detection · unsupervised learning · transfer learning
1 Introduction
The advancements in deep learning have elevated it to become one of the core components in most intelligent systems in
existence. The ability to learn rich patterns from the abundance of data available today has made deep neural networks
(DNNs) a compelling approach in the majority of computer vision (CV) tasks such as image classification, object
detection, image segmentation, activity recognition as well as natural language processing (NLP) tasks such as sentence
classification, language models, machine translation, etc. However, the supervised approach to learning features from
labeled data has almost reached its saturation due to intense labor required in manually annotating millions of data
samples. This is because most of the modern computer vision systems (that are supervised) try to learn some form of
image representations by finding a pattern between the data points and their respective annotations in large datasets.
Works such as GRAD-CAM [
1
] have proposed techniques that provide visual explanations for decisions made by a
model to make them more transparent and explainable.
Traditional supervised learning approaches heavily rely on the amount of annotated training data available. Even
though there’s a plethora of data available out there, the lack of annotations has pushed researchers to find alternative
arXiv:2011.00362v3 [cs.CV] 7 Feb 2021

A PREPRINT - FEBRUARY 9, 2021
approaches that can leverage them. This is where self-supervised methods plays a vital role in fueling the progress of
deep learning without the need for expensive annotations and learn feature representations where data itself provides
supervision.
Figure 1: Basic intuition behind contrastive learning paradigm: push original and augmented images closer and push
original and negative images away
Supervised learning not only depends on expensive annotations but also suffers from issues such as generalization
error, spurious correlations, and adversarial attacks [
2
]. Recently, self-supervised learning methods have integrated
both generative and contrastive approaches that have been able to utilize unlabeled data to learn the underlying
representations. A popular approach has been to propose various pretext tasks that help in learning features using
pseudo-labels. Tasks such as image-inpainting, colorizing greyscale images, jigsaw puzzles, super-resolution, video
frame prediction, audio-visual correspondence, etc have proven to be effective for learning good representations.
Figure 2: Contrastive learning pipeline for self-supervised training
Generative models gained its popularity after the introduction of Generative Adversarial Networks (GANs) [
3
] in
2014. The work later became the foundation for many successful architectures such as CycleGAN [4], StyleGAN [5],
2

A PREPRINT - FEBRUARY 9, 2021
PixelRNN [
6
], Text2Image [
7
], DiscoGAN [
8
], etc. These methods inspired more researchers to switch to training deep
learning models with unlabeled data in an self-supervised setup. Despite their success, researchers started realizing
some of the complications in GAN-based approaches. They are harder to train because of two main reasons: (a)
non-convergence–the model parameters oscillate a lot and rarely converge, and (b) the discriminator gets too successful
that the generator network fails to create real-like fakes due to which the learning cannot be continued. Also, proper
synchronization is required between the generator and the discriminator that prevents the discriminator to converge and
the generator to diverge.
Figure 3: Top-1 classification accuracy of different contrastive learning methods against baseline supervised method on
ImageNet
Unlike generative models, contrastive learning (CL) is a discriminative approach that aims at grouping similar samples
closer and diverse samples far from each other as shown in figure 1. To achieve this, a similarity metric is used to
measure how close two embeddings are. Especially, for computer vision tasks, a contrastive loss is evaluated based
on the feature representations of the images extracted from an encoder network. For instance, one sample from the
training dataset is taken and a transformed version of the sample is retrieved by applying appropriate data augmentation
techniques. During training referring to figure 2, the augmented version of the original sample is considered as a
positive sample, and the rest of the samples in the batch/dataset (depends on the method being used) are considered
negative samples. Next, the model is trained in a way that it learns to differentiate positive samples from the negative
ones. The differentiation is achieved with the help of some pretext task (explained in section 2). In doing so, the model
learns quality representations of the samples and is used later for transferring knowledge to downstream tasks. This
idea is advocated by an interesting experiment conducted by Epstein [
9
] in 2016, where he asked his students to draw a
dollar bill with and without looking at the bill. The results from the experiment show that the brain does not require
complete information of a visual piece to differentiate one object from the other. Instead, only a rough representation of
an image is enough to do so.
Most of the earlier works in this area combined some form of instance-level classification approach[
10
][
11
][
12
] with
contrastive learning and were successful to some extent. However, recent methods such as SwAV [
13
], MoCo [
14
], and
SimCLR [
15
] with modified approaches have produced results comparable to the state-of-the-art supervised method on
3

A PREPRINT - FEBRUARY 9, 2021
ImageNet [
16
] dataset as shown in figure 3. Similarly, PIRL [
17
], Selfie [
18
], and [
19
] are some papers that reflect the
effectiveness of the pretext tasks being used and how they boost the performance of their models.
2 Pretext Tasks
Pretext tasks are self-supervised tasks that act as an important strategy to learn representations of the data using pseudo
labels. These pseudo labels are generated automatically based on the attributes found in the data. The learned model
from the pretext task can be used for any downstream tasks such as classification, segmentation, detection, etc. in
computer vision. Furthermore, these tasks can be applied to any kind of data such as image, video, speech, signals,
and so on. For a pretext task in contrastive learning, the original image acts as an anchor, its augmented(transformed)
version acts as a positive sample, and the rest of the images in the batch or in the training data act as negative samples.
Most of the commonly used pretext tasks are divided into four main categories: color transformation, geometric
transformation, context-based tasks, and cross-modal based tasks. These pretext tasks have been used in various
scenarios based on the problem intended to be solved.
2.1 Color Transformation
Figure 4: Color Transformation as pretext task [
15
]. (a) Original (b) Gaussian noise (c) Gaussian blur (d) Color
distortion (Jitter)
Color transformation involves basic adjustments of color levels in an image such as blurring, color distortions, converting
to grayscale, etc. Figure 4 represents an example of color transformation applied on a sample image from the ImageNet
dataset [15]. During this pretext task, the network learns to recognize similar images invariant to their colors.
2.2 Geometric Transformation
A geometric transformation is a spatial transformation where the geometry of the image is modified without altering
its actual pixel information. The transformations include scaling, random cropping, flipping (horizontally, vertically),
etc. as represented in figure 5 through which global-to-local view prediction is achieved. Here the original image is
considered as the global view and the transformed version is considered as the local view. Chen et. al. [
15
] performed
such transformations to learn features during pretext task.
2.3 Context-Based
2.3.1 Jigsaw puzzle
Traditionally, solving jigsaw puzzles has been a prominent task in learning features from an image in an unsupervised
way. It involves identifying the correct position of the scrambled patches in an image by training an encoder (figure 6).
4

A PREPRINT - FEBRUARY 9, 2021
Figure 5: Geometric Transformation as pretext task [
15
]. (a) Original (b) Crop and Resize (c) Rotate(90
◦
, 180
◦
, 270
◦
)
(d) crop, resize, flip
In terms of contrastive learning, the original image is the anchor, and an augmented image formed by scrambling the
patches in the original image acts as a positive sample. The rest of the images in the dataset/batch are considered to be
negative samples [17].
Figure 6: Solving jigsaw puzzle being used as a pretext task to learn representation. (a) Original Image (b) reshuffled
image. The original image is the anchor and the reshuffled image is the positive sample.
2.3.2 Frame order based
This approach applies to data that extends through time. An ideal application would be in the case of sensor data or
a sequence of image frames (video). A video contains a sequence of semantically related frames. This implies that
frames that are nearby with respect to time are closely related and the ones that are far away are less likely to be related.
Intuitively, the motive for using such an approach is, solving a pretext task that allows the model to learn useful visual
representations while trying to recover the temporal coherence of a video. Here, a video with shuffled order in the
sequence of its image frames acts as a positive sample while all other videos in the batch/dataset would be negative
samples.
Similarly, other possible approaches include randomly sampling two clips of the same length from a longer video or
applying spatial augmentation for each video clip. The goal is to use a contrastive loss to train the model such that clips
taken from the same video are arranged closer whereas clips from different videos are pushed away in the embedding
space. In the work proposed by Qian et. al. [
20
], the framework contrasts the similarity between two positive samples
to those of negative samples. The positive pairs are two augmented clips from the same video. As a result, it separates
all encoded videos into non-overlapping regions such that an augmentation used in the training perturbs an encoded
video only within a small region in the representation space.
5
剩余20页未读,继续阅读

















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0