【免费】70页《自监督学习》最新简明指南，图灵奖LeCun等编著.pdf_sslattention

需积分: 0 122 浏览量更新于2024-03-20 收藏 1.74MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

自监督学习（Self-supervised learning，简称SSL）被誉为“智能的暗物质”，是推动机器学习发展的有前景的途径。与受限于标记数据可用性的监督学习相反，自监督方法可以从大量未标记数据中学习。自监督学习在深度学习领域取得了巨大成功，尤其在自然语言处理和计算机视觉领域展现出潜力。大规模未标记文本语料库上训练的大型语言模型以及通过模型如SEER在10亿图像上训练的计算机视觉模型，都展示了自监督学习的优势。在像ImageNet这样的高度竞争性基准上，自监督学习方法已经能够与标记数据训练的模型相匹敌甚至超越它们。此外，自监督学习还成功应用于视频、音频和时间序列等其他模态。这一领域的最新成果被总结在70页的《自监督学习》最新简明指南中，由包括图灵奖得主Yann LeCun在内的专家编著。这些专家从不同领域的角度对自监督学习进行了深入的研究和讨论，为读者提供了全面的视角和实用的指导。这本指南对自监督学习的定义、原理、应用领域等进行了详细介绍，帮助读者更好地了解和应用这一领域的最新进展。自监督学习的应用范围涵盖了多个领域，包括自然语言处理、计算机视觉以及视频、音频和时间序列等其他模态。在自然语言处理领域，大型语言模型在大规模未标记文本语料库上训练取得了巨大成功，推动了自然语言处理技术的发展。在计算机视觉领域，通过模型如SEER在大规模图像数据集上进行自监督学习，不断突破数据规模的界限，甚至在高度竞争性基准上超越了标记数据训练的模型。此外，自监督学习还成功应用于视频、音频和时间序列等其他模态，为各种任务提供了更有效的解决方案。总的来说，自监督学习是一种非常有前景的机器学习方法，可以通过大规模的未标记数据来学习，避免了受限于标记数据可用性的问题。各个领域的研究者和工程师都在这一领域取得了许多成果，为未来的人工智能发展打下了坚实的基础。希望通过这本《自监督学习》最新简明指南的介绍，读者能够更好地了解和应用自监督学习方法，为推动机器学习和人工智能领域的发展贡献自己的力量。

资源详情

资源推荐

A Brief History of the Self-Distillation Family

•

Xu et al. [2004], Joulin et al. [2010,

MMC]

searches pseudo-labels so that if a classiﬁer were train on

them it would have good margin (on true labels)

•

Bojanowski and Joulin [2017,

NaT]

introduces Noise as Targets i.e.

real frozen targets

M ,

, . . . , m

] ∈ R

D×C

with assignment constraints of P , [p

, . . . , p

] ∈ {0, 1}

C×N

with

NaT

= min

P :1P ≤1,P

1=1

−

n=1

CosSim(f

), M p

), (7)

•

Caron et al. [2018,

DeepCluster]

extends NaT by allowing learning of the targets in a K-means fashion

with various cluster sampling and reallocation tricks to prevent collapse

DeepCluster

= CrossEntropy(f

(x), arg min

(x) − m

) + K − means(f

(X), M ), (8)

•

YM. et al. [2020,

SLSC]

further prevents collapse in DeepCluster through constrained clustering member-

ship using Sinkhorn to infer the cluster membership probabilities

•

Grill et al. [2020,

BYOL]

introduces BYOL removing the clustering step, introducing a predictor and

projector network, deﬁning the continuous targets as the output of a momentum network, renormalize

each sample representation by its

-norm and leverage positive pairs. The predictor acts as a whitening

operator preventing collapse [Tian et al., 2021], and momentum network can be applied only to the

projector [Pham et al., 2022]

• Chen and He [2021, SimSIAM] replaces the BYOL moving average encoder by a stop-gradient

•

Caron et al. [2021,

DINO]

introduces DINO which extends BYOL and SimSIAM to discrete representa-

tions/targets and still relies on momentum encoder

•

Zhou et al. [2022a,

iBOT]

and Oquab et al. [2023,

DINOv2]

build upon DINO by combining its objective

with a latent space masked-image modeling one, combining the best of both families

Figure 5: History of Self-Labeling

where for clarity we omit the distribution over which

x, t

, t

are sampled from. Several

works have aimed at understanding how BYOL and SimSiam avoid collapse such as Tian

et al. [2021] or Halvagal et al. [2022], where they found that the asymmetry between the

two branches is the key, as well the training dynamics which regularize the variance of

the embeddings implicitly.

DINO

performs a centering of the output of the student network using a running mean

(to avoid sensitivity to mini-batch size) and discretize (smoothly) the representations by

means of a softmax with a temperate τ usually taken to be around 0.1 as in

DINO

(θ

, γ) = E

(x,t

)

[CrossEnt (softmax(f

(x))/τ), sg(softmax(center(f

(x)))/τ)))] ,

(12)

where akin to BYOL the teacher again has a moving average of the student network’s

weights, usually with value

following a cosine schedule from

0.996

during training. The

discretization in DINO caused by the softmax can be interepreted as an online clustering

mechanism, where the last layer before the softmax contains the clustering prototypes

and its weight. As such, the output of the penultimate layer is clustered using the weights

of the last layer.

iBOT

builds on DINO and combines its objective with a masked image modeling

objective applied in latent space directly. Here, the target reconstruction is not the image

pixels but the same patches embedded through the teacher network.

DINOv2

further builds on iBOT and improves its performance signiﬁcantly in both

linear and k-NN evaluations by improving the training recipe, the architecture, and by

introducing additional regularizers such as KoLeo [Sablayrolles et al., 2018]. In addition,

DINOv2 curates a larger pretraining dataset consisting of 142 million images (further

discussion in Section 2.7).

Many other methods belong to this self-distillation family. MoCo is another popular

method based on building a dictionary look-up that was shown to in some cases to

surpass supervised learning on segmentation and object detection benchmarks He et al.

[2020a]. Originally the momentum encoder was introduced as a substitute for a queue in

contrastive learning [He et al., 2020a], which extends the result of [Dosovitskiy et al., 2014].

MoCo’s moving average uses a relatively large momentum with a default value of

ξ = 0.999

This higher momentum value works much better than a smaller value of say

ξ = 0.9

. When

SimCLR introduced the use of a projector and stronger data-augmentations, MoCoV2

[Chen et al., 2020d] followed suite with stronger data-augmentations and a projector head

to boost performance. In a similar spirit, ISD [Tejankar et al., 2021] compares a query

distribution to anchors from the student distribution using KL-divergence that relaxes the

binary distinction between positive and negative samples. MSF [Koohpayegani et al., 2021]

compares a query’s nearest neighbor representation to the student target’s representation

and then minimize the

distnace between them with renormalization (akin to cosine

similarity maximization). Another approach, SSCD builds on the contrastive objective

to the task of copy detection outperforming copy detection models and other contrastive

methods [Pizzi et al., 2022]. Aside from the widespread use of the contrastive objective,

many more methods employ similar running average updates as part of their training

mechanism. For example, self-distillation [Hinton et al., 2015, Furlanello et al., 2018],

Deep Q Network in reinforcement learning [Mnih et al., 2013], Mean Teacher in semi-

supervised learning [Tarvainen and Valpola, 2017], and even model average in supervised

and generative modeling [Jean et al., 2014].

2.4 The Canonical Correlation Analysis Family:

VICReg/BarlowTwins/SWAV/W-MSE

The SSL canonical correlation analysis family originates with the Canonical Correlation

Framework (CCA) [Hotelling, 1992]. The high-level goal of CCA is to infer the relationship

between two variables by analyzing their cross-covariance matrices. Speciﬁcally, let

X ∈ R

and

Y ∈ R

. The CCA framework seeks two transformations

U = f

(X)

and

V = f

(Y ) such that

L = −

n=1

, V

such that

n=1

= 0

| {z }

zero-mean representations

U =

V = I

| {z }

identity covariance representations

, (13)

with

(the dimension of the output mappings) such that

d ≤ min(dim(X), dim(Y ))

. Linear

CCA [Hotelling, 1992] considers the two mappings to be linear in which case the optimal

parameters can be found through the SVD of

−

, involving the covariance matrices

X, Y

and their cross-covariance. A major advance in the study nonlinear CCA was

achieved by Breiman and Friedman [1985] in the univariate output setting, and by Makur

et al. [2015] in the multivariate output setting, by connecting the solution to eq. (13) to the

Alternating Conditional Expectation (ACE) method. Painsky et al. [2020] study the link

between the optimal representation for nonlinear CCA using the Alternating Conditional

Expectation proving new theoretical bounds that lead to further reﬁnements of CCA.

These ideas were extended to deep learning in Deep Canonically Correlated Autoen-

coders (DCCAE) an autoencoder regularized via CCA. Hsieh [2000] and Andrew et al.

[2013] introduce the objective of jointly learning parameters for two networks,

, f

, such

they their outputs are maximally correlated. The inputs to these networks are two views

and

. Speciﬁcally the objective is then to ﬁnd parameters

, θ

for each network

such that

(θ

∗

, θ

∗

) = argmax

(θ

,θ

)

corr(f

; θ

), f

; θ

). (14)

This DCCAE objective was extended to multivariate outputs and arbitrary DDNs in

Wang et al. [2015].

From these origins, stems SSL methods such as VICReg [Bardes et al., 2021], Barlow

Twins [Zbontar et al., 2021], SWAV [Caron et al., 2020], and W-MSE [Ermolov et al.,

2021].

VICReg

, the most recent among these methods, balances three objectives based

on co-variance matrices of representations from two views: variance, invariance, co-

variance shown in Figure 6. Regularizing the variance along each dimension of the

representation prevents collapse, the invariance ensures two views are encoded similarly,

and the co-variance encourages diﬀerent dimensions of the representation to capture

diﬀerent features.

2.5 Masked Image Modeling

A number of prominent early self-supervised pre-training algorithms for computer vision

applied degradations to training images, such as decolorization [Zhang et al., 2016], noise

[Vincent et al., 2008], or shuﬄing image patches [Noroozi and Favaro, 2016], and taught

models to undo these degradations. Context encoders instead mask out large portions of

an image and replace their pixel values with white, teaching an autoencoder to inpaint the

white patches [Pathak et al., 2016]. This early attempt at masked image modeling does

剩余69页未读，继续阅读

死磕代码程序媛

粉丝: 108
资源: 316

会员权益专享

"自监督学习：机器学习的未来之路"

会员权益专享

最新资源

"自监督学习：机器学习的未来之路"

图灵奖大佬 Lecun 发表对比学习新作，比 SimCLR 更好用！ .pdf

图灵机.pdf

首个通过图灵测试的计算机软件.pdf

python数据科学手册图灵出品.pdf

全世界公认 的AI专家有哪些

为什么图灵奖 epub

COMPUTER ORGANIZATION AND DESIGN.pdf

任选一位图灵奖获得者，简要分析他（她）获奖的原因。

绘制图灵奖得主的E-R图

46. 通过4.1和4.2微课学习，我们知道，计算机图灵奖获得者( ) 提出用一种卓有成效的进程互斥同步解决机制，这个机制是什么( ) ?

python机器学习基础教程(图灵出品) 代码

图灵学院2023面试题 pdf

flask web 开发 第二版 图灵社区 pdf

通过图灵机器人自动回复微信

我的眼睛--图灵识别pdf

人工智能行业有那些伟大人物

解释下列名称术语 1．可计算函数 2．可判定性语言 3．通用图灵机 4. 接受计算历史 5．多项式时间归约 6．NP 完全性

会员权益专享

最新资源

全世界公认的AI专家有哪些

flask web 开发第二版图灵社区 pdf