iBOT
builds on DINO and combines its objective with a masked image modeling
objective applied in latent space directly. Here, the target reconstruction is not the image
pixels but the same patches embedded through the teacher network.
DINOv2
further builds on iBOT and improves its performance significantly in both
linear and k-NN evaluations by improving the training recipe, the architecture, and by
introducing additional regularizers such as KoLeo [Sablayrolles et al., 2018]. In addition,
DINOv2 curates a larger pretraining dataset consisting of 142 million images (further
discussion in Section 2.7).
Many other methods belong to this self-distillation family. MoCo is another popular
method based on building a dictionary look-up that was shown to in some cases to
surpass supervised learning on segmentation and object detection benchmarks He et al.
[2020a]. Originally the momentum encoder was introduced as a substitute for a queue in
contrastive learning [He et al., 2020a], which extends the result of [Dosovitskiy et al., 2014].
MoCo’s moving average uses a relatively large momentum with a default value of
ξ = 0.999
.
This higher momentum value works much better than a smaller value of say
ξ = 0.9
. When
SimCLR introduced the use of a projector and stronger data-augmentations, MoCoV2
[Chen et al., 2020d] followed suite with stronger data-augmentations and a projector head
to boost performance. In a similar spirit, ISD [Tejankar et al., 2021] compares a query
distribution to anchors from the student distribution using KL-divergence that relaxes the
binary distinction between positive and negative samples. MSF [Koohpayegani et al., 2021]
compares a query’s nearest neighbor representation to the student target’s representation
and then minimize the
`
2
distnace between them with renormalization (akin to cosine
similarity maximization). Another approach, SSCD builds on the contrastive objective
to the task of copy detection outperforming copy detection models and other contrastive
methods [Pizzi et al., 2022]. Aside from the widespread use of the contrastive objective,
many more methods employ similar running average updates as part of their training
mechanism. For example, self-distillation [Hinton et al., 2015, Furlanello et al., 2018],
Deep Q Network in reinforcement learning [Mnih et al., 2013], Mean Teacher in semi-
supervised learning [Tarvainen and Valpola, 2017], and even model average in supervised
and generative modeling [Jean et al., 2014].
2.4 The Canonical Correlation Analysis Family:
VICReg/BarlowTwins/SWAV/W-MSE
The SSL canonical correlation analysis family originates with the Canonical Correlation
Framework (CCA) [Hotelling, 1992]. The high-level goal of CCA is to infer the relationship
between two variables by analyzing their cross-covariance matrices. Specifically, let
X ∈ R
D
and
Y ∈ R
D
. The CCA framework seeks two transformations
U = f
x
(X)
and
13