
Note that for each anchor i, there is 1 positive pair and 2N − 2 negative pairs. The denominator has
a total of 2N − 1 terms (the positive and negatives).
3.2.2 Supervised Contrastive Losses
For supervised learning, the contrastive loss in Eq. 1 is incapable of handling the case where, due to
the presence of labels, more than one sample is known to belong to the same class. Generalization
to an arbitrary numbers of positives, though, leads to a choice between multiple possible functions.
Eqs. 2 and 3 present the two most straightforward ways to generalize Eq. 1 to incorporate supervi-
sion.
L
sup
out
=
X
i∈I
L
sup
out,i
=
X
i∈I
−1
|P (i)|
X
p∈P (i)
log
exp (z
i
•
z
p
/τ)
P
a∈A(i)
exp (z
i
•
z
a
/τ)
(2)
L
sup
in
=
X
i∈I
L
sup
in,i
=
X
i∈I
− log
1
|P (i)|
X
p∈P (i)
exp (z
i
•
z
p
/τ)
P
a∈A(i)
exp (z
i
•
z
a
/τ)
(3)
Here, P (i) ≡ {p ∈ A(i) : ˜y
p
= ˜y
i
} is the set of indices of all positives in the multiviewed batch
distinct from i, and |P (i)| is its cardinality. In Eq. 2, the summation over positives is located outside
of the log (L
sup
out
) while in Eq. 3, the summation is located inside of the log (L
sup
in
). Both losses have
the following desirable properties:
• Generalization to an arbitrary number of positives. The major structural change of Eqs. 2
and 3 over Eq. 1 is that now, for any anchor, all positives in a multiviewed batch (i.e., the
augmentation-based sample as well as any of the remaining samples with the same label) con-
tribute to the numerator. For randomly-generated batches whose size is large with respect to the
number of classes, multiple additional terms will be present (on average, N/C, where C is the
number of classes). The supervised losses encourage the encoder to give closely aligned represen-
tations to all entries from the same class, resulting in a more robust clustering of the representation
space than that generated from Eq. 1, as is supported by our experiments in Sec. 4.
• Contrastive power increases with more negatives. Eqs. 2 and 3 both preserve the summation
over negatives in the contrastive denominator of Eq. 1. This form is largely motivated by noise
contrastive estimation and N-pair losses [13, 45], wherein the ability to discriminate between
signal and noise (negatives) is improved by adding more examples of negatives. This property is
important for representation learning via self-supervised contrastive learning, with many papers
showing increased performance with increasing number of negatives [18, 15, 48, 3].
• Intrinsic ability to perform hard positive/negative mining. When used with normalized rep-
resentations, the loss in Eq. 1 induces a gradient structure that gives rise to implicit hard posi-
tive/negative mining. The gradient contributions from hard positives/negatives (i.e., ones against
which continuing to contrast the anchor greatly benefits the encoder) are large while those for easy
positives/negatives (i.e., ones against which continuing to contrast the anchor only weakly benefits
the encoder) are small. Furthermore, for hard positives, the effect increases (asymptotically) as
the number of negatives does. Eqs. 2 and 3 both preserve this useful property and generalize it
to all positives. This implicit property allows the contrastive loss to sidestep the need for explicit
hard mining, which is a delicate but critical part of many losses, such as triplet loss [42]. We note
that this implicit property applies to both supervised and self-supervised contrastive losses, but our
derivation is the first to clearly show this property. We provide a full derivation of this property
from the loss gradient in the Supplementary material.
Loss Top-1
L
sup
out
78.7%
L
sup
in
67.4%
Table 1: ImageNet Top-1 classification
accuracy for supervised contrastive
losses on ResNet-50 for a batch size of
6144.
The two loss formulations are not, however, equivalent. Be-
cause log is a concave function, Jensen’s Inequality [23] im-
plies that L
sup
in
≤ L
sup
out
. One would thus expect L
sup
out
to be
the superior supervised loss function (since it upper-bounds
L
sup
in
). This conclusion is also supported analytically. Table 1
compares the ImageNet [7] top-1 classification accuracy using
L
sup
out
and L
sup
in
for different batch sizes (N ) on the ResNet-50
[17] architecture. The L
sup
out
supervised loss achieves signifi-
cantly higher performance than L
sup
in
. We conjecture that this is due to the gradient of L
sup
in
having
5