跨模态对比学习方法：CrossCLR

120 浏览量更新于2023-10-13 收藏 13.91MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

False Negatives Negatives Samples Positive Samples14500CrossCLR：多模态视频表示的跨模态对比学习0Mohammadreza Zolfaghari 1 � , Yi Zhu 2 , Peter Gehler 2 , Thomas Brox 201 弗莱堡大学 2 亚马逊0摘要0对比学习允许我们通过对比来自一组负样本的正样本来灵活地定义强大的损失。最近，该原则也被用于学习视频和文本的跨模态嵌入，但没有充分利用其潜力。特别是，先前的损失没有考虑模态内部的相似性，这导致嵌入不高效，因为相同的内容被映射到嵌入空间中的多个点。通过CrossCLR，我们提出了一种修复这个问题的对比损失。此外，我们根据它们的输入嵌入定义了一组高度相关的样本，并将它们从负样本中排除，以避免虚假负样本的问题。我们展示了这些原则始终提高了学习嵌入的质量。使用CrossCLR学习的联合嵌入在Youcook2和LSMDC数据集上扩展了视频-文本检索的最新技术水平，并在Youcook2数据集上扩展了视频字幕生成的最新技术水平。我们还通过学习其他模态对的改进联合嵌入来证明了该概念的普适性。01. 引言0跨模态任务，特别是连接视频和文本的任务，扩展了计算机视觉的影响和适用性。它使得基于文本查询的视频检索成为可能 [ 11 , 10 , 24 ]，图像和视频字幕生成 [ 12]，以及利用基于文本的元数据进行视觉特征学习 [ 26 ,28 , 57 , 41]。连接不直接可比较的数据源创造了在仅有视觉学习中不存在的新挑战。在本文中，我们考虑跨模态对比学习，并引入一种比直接采用仅适用于视觉数据的损失更高效地关联数据的损失。对比学习基于相对于锚点的正样本和负样本的定义，这产生了一种灵活的原则：将锚点和正样本在嵌入空间中拉近，将锚点与许多负样本分开。已经提出了许多实现这一原则的方法：最大间隔损失 [ 14]，三元组损失 [ 46 , 47 , 18 ]，0� 在亚马逊图宾根实习期间完成的工作。0b) CLIP0c) CrossCLR0模态A0模态B0a) MaxMargin0图1.当学习两种模态A和B之间的联合嵌入时，现有的对比学习损失，如MaxMargin [ 11 , 24 ]和CLIP [ 34]，忽略了虚假负样本的可能性，因此将语义相关的概念推开。所提出的CrossCLR损失识别了有影响力的样本（大圆圈/方框），将它们从负样本集中移除，并在小批量中增加它们的权重。此外，CrossCLR还在损失中添加了模态内部的链接。0和 InfoNCE [ 44 ]。通常，正样本被定义为合成的、空间 [3 , 33 ] 或时间上的一个实例的变化 [ 33]。实例辨别也被应用于跨模态任务，其中正样本（或一组正样本）是从相同的时间窗口中采样的（MIL-NCE [ 26]，AVSA [ 30 ]，CM-ACC [ 25]）。在本文中，我们研究了现有跨模态对比学习技术的两个问题。 1.跨模态损失仅确保两种模态的特征映射到联合嵌入中的相邻点，但它们缺乏一个明确的度量，也确保了相同模态的相似特征在联合嵌入中保持靠近。在先前的工作中，隐含地假设通过传递性，模态之间的相似性也会保持在模态内部的相似性。然而，尚未证明这是正确的。如果相同模态的相似特征映射到联合嵌入中相距很远的点，嵌入就缺乏语义意义，因此泛化能力很差。 2.先前跨模态对比学习的重点仅限于确保不同模态的特征映射到联合嵌入中的相邻点，而不考虑模态内部的相似性。0由于许多先前的工作侧重于无监督特征学习而不是学习联合嵌入，它们不假设输入嵌入是有意义的，即在特征空间中，语义相关的概念最初是相近的。因此，保持输入相似性对它们来说是没有意义的。在这项工作中，我们确实假设每个模态的输入嵌入（例如ImageNet预训练特征）已经涵盖了一些语义，并且我们的目标是利用这些语义跨模态进行联合嵌入。3. CrossCLR14510多模态对比损失的关键在于正样本对的定义，而负样本是从整个分布中随机抽取的。这并不能反映出我们所谓的“有影响力的样本”的效果——这些样本与许多其他样本相似，因此对嵌入形状有很大的影响。将有影响力的样本标记为负样本可能会推开实际上密切相关的样本。为了解决第一个问题，我们提出了一种对比损失，强制联合嵌入尊重原始特征空间中样本的相似性。此外，为了解决第二个问题，我们将有影响力的样本定义为在数据集中具有高连接性的样本，并将这些样本从负样本集中删除。我们还引入了基于连接性的损失加权。我们展示了这三个措施在视频-文本检索和视频字幕方面对交叉模态嵌入的改进。虽然本文侧重于视频和文本作为模态，但我们展示了所提出的交叉模态损失的积极效果可以推广到其他模态对。02. 相关工作02.1. 对比学习中的样本选择0与最近的研究[3,16,6,5,13,2]不同，我们的工作涉及多模态对比学习。我们提出了跨模态内部和跨模态间的损失目标，以确保具有相似内容的样本在联合嵌入空间中保持接近，而不考虑模态。然而，在对比学习中，样本选择起着重要作用。受Khosla等人[20]的启发，Zhang等人[38]提出了一种自适应自监督学习技术，用于挖掘最近的正样本，而不使用标签信息。Han等人[15]引入了一种协同训练方法，通过使用数据的其他互补视图来挖掘难以处理的正样本，用于视频表示学习。在负采样方面，最近的工作探索了信息丰富（难）的负样本，以促进更好和更快的对比学习[19,36]。我们的工作也侧重于选择更好的负样本，但我们不是挖掘难负样本，而是引入了有影响力样本的概念。我们将这些样本定义为与其他样本强相关且更有可能导致语义冲突的样本。我们将它们从负样本集中排除，并在损失中给予它们更高的权重。02.2. 多模态表示学习0视频数据通常由多个模态组成，例如原始RGB、动作、音频、文本、检测到的对象或场景标签。将其中多个模态一起使用有助于更好地理解视频的内容[26,11]。最近，基于Transformer的跨模态表示学习模型变得流行起来[11,12,34]。VideoBERT[41]采用了BERT的设计，并将其应用于量化的0视频表示。Miech等人[28]介绍了HowTo100M数据集。后续的工作利用了该数据集中的噪声配对[26,57]来预训练视频-语言模型。MIL-NCE[26]提出考虑在时间上接近的邻域中采样的多个正样本，以解决不对齐的视频叙述问题。虽然上述工作侧重于利用非常大的数据集从头开始学习表示，但另一条研究线是在给定预训练专家模型的情况下学习联合嵌入以计算输入特征[24,11,27]。Miech等人[27]计算了从预训练专家中提取的文本到视频的特征。整体相似性是每个单独专家相似性分数的加权和。CE[24]提出了一种专家模型和协作门控机制来组合专家。最近的扩展(MMT)[11]从7个不同的领域提取专家特征，并采用了一种时间感知的模态聚合机制。MMT采用双向最大边际排序损失进行训练。我们的工作属于第二组工作，我们假设输入嵌入是预训练的。CrossCLR通过引入和利用有影响力的样本，在每个模态中强制保持一致性，克服了最大边际排序和对比学习损失的局限性。它避免了在联合嵌入空间中推开语义相似的表示。0在本节中，我们首先定义跨模态学习任务，并强调正常对比学习在学习跨模态嵌入时面临的问题。然后我们介绍了对对比损失的修改，以确保内部模态对齐并避免语义冲突。0跨模态对齐。跨模态对齐旨在学习两个编码器f x(∙)和fy(∙)，将两个模态A和B的嵌入x和y映射到z x = f x(x)和z y= fy(y)，使得在学习的嵌入空间中，如果它们指向相似的内容，则z x和zy彼此接近，否则远离。也就是说，它旨在学习A和B之间的样本相似性函数。给定一对描述相似内容的样本(x i, yi)，例如，x i和yi可以是视频剪辑的嵌入和相应的文本描述，成功学习的跨模态相似性应该推广到未见过的样本对，即使输入模态A和B在显示相似内容时通常具有大的变化。A和B可以是任意模态。在本文中，我们假设A的样本是从视频剪辑中导出的特征嵌入，B的样本是一个句子的特征嵌入。然而，我们还展示了一个实验研究，其中A和B是从视频剪辑中导出的不同模态。............................................................................................................Connectivity...BAConnectivity...........................̸̸L = Ei∈M(xi)T fy(yj)), � ��w�ic� i� �ei�� imize� w�� e ��me�e�� f ��e e��c��e��fx and fy.One weakness in this formulation is the ad-hoc deﬁ-nition of negative samples. We simply assumed that allcombinations (xi, yj) with i ̸= j have dissimilar content.However, this is not always the case. For instance, twosentences “someone drinks a cup of tea”, and “she talks onthe phone while drinking tea”, aligned with two differentvideo clips i and j have largely overlapping content. Theloss objective in equation (1) will only be minimized iffx(xi) and fy(yj) are mapped to different points in theembedding, despite their high semantic similarity. More-over, nothing in the above loss enforces that fy(yi) andfy(yj) are mapped to proximal points in the joint embed-ding space, even when yi and yj are similar in their originalembedding space, as in the given example.3.1. Inter-Modality and Intra-Modality Align-mentLet us ﬁrst approach the missing alignment of the inputmodalities in the joint embedding. If we assume, like inthis paper, that the input modalities have already passed anencoder network that is able to place semantically relatedinputs nearby in the original feature embedding, we shouldpreserve this proximity also in the joint embedding space.Therefore, fx(·) and fy(·) should be optimized not only tomap xi and yi of a pair (xi, yi) to a proximate location inthe joint embedding (inter-modality), but similar samplesxi and xj from the same modality should be mapped inclose proximity (intra-modality). The inter-modality andintra-modality negative sets for sample xi are deﬁned as:NEi= {yj|∀yj ∈ M, j ̸= i} and NRi= {xj|∀xj ∈M, j ̸= i}.Therefore, the learning objective is based on four con-trastive components including A-to-B, A-to-A, B-to-A,and B-to-B alignments, as shown in Fig. 2-Right:L(xi) = − logδ(xi,yi)δ(xi,yi)+�yk∈NEiδ(xi, yj)��Inter modality negative pairs+λ�xj∈NRiδ(xi, xj)��Intra modality negative pairs, (2)14520视频0文本0联合嵌入空间0正样本负样本忽略0模态A到：0跨模态内部模态0跨模态0跨模态0A B 内部模态0正样本0负样本0图2.CrossCLR损失。我们首先找到有影响力的样本。有影响力的样本与大量其他样本相似。我们在损失中强调这些样本。此外，我们从负样本中剪枝这些样本，因为我们希望防止损失将它们错误地推开，而它们与其他样本共享语义。0预训练特征专家。我们使用现成的特征提取器[11, 27, 29,24]对视频进行特征表示编码，例如动作、外观、对象。每个专家都是在特定任务上进行预训练的模型，在第4.2节中有更多细节。给定一个视频v，我们采样m帧并将其输入到专家模型中，以获得每帧的专家特征x = [e1, ...,em]。对于文本数据，我们使用BERT-Baseuncased模型作为特征提取器。为了将模态A的嵌入对齐到模态B的嵌入，我们遵循COOT的两流层次结构[12]，如图4所示。它包括一个用于剪辑级别嵌入的局部变换器和一个用于视频级别嵌入的全局变换器。给定帧/单词级别的专家特征，我们通过局部变换器获得剪辑/句子级别的嵌入。这些局部嵌入作为全局变换器的输入，产生最终的视频/段落表示。0对比学习。样本对(x i, yi)为我们提供了对比学习的信息。特别地，每个这样的样本i可以被视为一个正样本对，而所有对(x i, y j)对于j ≠i将被视为负样本对。准确地说，对于小批量M中的每个样本x i，正样本集P i和负样本集N i被定义为P i = {y i}和N i= {y j |�y j ∈ M，j ≠ i}。小批量M上的相应对比损失为：0− 对数 exp( f x ( x i ) T f y ( y i ))AnFNNNNNNNcut tomato and put it in a bowl cut tomatoes and mix with the herbs cook some pieces of bacon in a pan boil the snails in waterchop small head of cabbage and put it in a bowlwash the cabbageFigure 3. Example of inﬂuential text samples in the Youcook2dataset (darker green -> more vital). An: Anchor sample, FN:False negative, N: Negative sample.L(yi) = − logδ(yi,xi)δ(yi,xi)+�xk∈NEiδ(yi, xj)��Inter modality negative pairs+λ�yj∈NRiδ(yi, yj)��Intra modality negative pairs, (3)whereδ(xi, yj)=exp(fx(xi)T fy(yj)/τ)=exp(zTxizyj/τ).The second and the third term inthe denominator sum up inter-modal and intra-modalnegative samples, respectively. λ is a hyper-parameterto control intra-modality alignment.We apply ℓ2-normalisation to the input feature embeddings beforecomputing the inner product [40, 48, 34]. In such case, theinner product is equivalent to cosine similarity. While thenominator is symmetric, the denominator is not. Hence,we add up two losses for each sample pair (xi, yi) – onefor each modality.3.2. Avoiding Semantic CollisionThe second issue we identiﬁed with regular contrastivelearning is the contrasting of false negative samples thathave actually strong semantic overlap. The common as-sumption in contrastive learning is that a large enoughnumber of negative samples help to learn better representa-tions, because in each training batch, the model contrastsmore semantic representatives. However, Arora et al. [1]showed that this assumption does not always hold. Whenthere is a large number of negative samples, the chanceof observing negative samples with high semantic overlapincreases. As shown in Figure 3, both samples "cut tomatoand put it in a bowl" and "cut tomatoes and mix with theherbs" are considered negative samples in standard con-trastive learning methods. By contrasting these undesirablenegative pairs, the network is encouraged to discard theircommon features in the learned embedding, which is, infact, the common semantic content, e.g., similar objectsand actions in two videos; see also Figure 3. We call thisissue semantic collision, which is known as “class collision”in Arora et al. [1, 39] and “sampling bias” in Chuang et al.[7] in a different context. When there is a large number ofnegative samples, frequent semantic collisions can preventδ(x , y ) = e(fx(xi)T fy(yj )τ) = e(zTxy⊤k yjy⊤i yjxiκyixIRxyIRyend forupdate networks fx and fy to minimize L = Lx+Ly214530算法1 CrossCLR的学习算法。0输入：批量大小N，队列Q，常数τ，κ和λ，用于模态A和模态B的网络f x和f y。将δ(x i, y j)定义为：0 )0对于采样的小批量 { x i , y i } N i=0将 { x i , y i } N i =1 排队进入 Q，并出队最旧的键；0for all k . . , | Q |}d0c x k = 10| Q | �0Q0|| x k ||∙|| x j ||0c y k = 10| Q | �0y j ∈ Q0|| y k ||∙|| y j ||0if c x k < γ then: ˆ N IR x ← ˆ N IR x ∪ x k0if c y k < γ then: ˆ N IR y ← ˆ N IR y ∪ y k0end for for all i ∈ { 1 ,. . . , N } 0α x i = 10| Q | �0Q0|| x i ||∙|| x j ||0α y i = 10| Q | �0y j ∈ Q0|| y i ||∙|| y j ||0if α x i < γ then: ˆ N IE x ← ˆ N IE x ∪ y i0if α y i < γ then: ˆ N IE y ← ˆ N IE y ∪ x iend if0κ )0L x i = − w x i log δ ( x i ,0δ ( x i ,y i )+ �0δ ( x i ,y k )+ λ �0δ ( x i ,x k )0L y i = − w y i log δ ( y i ,0δ ( y i ,x i )+ �0δ ( y i ,x k )+ λ �0δ ( y i ,y k )0end for return encoders f x (∙ ) and f y ( ∙ )0我们认为减少语义冲突的影响并从负样本集中去除假阴性样本是重要的。这是非常复杂的，因为我们没有直接的信息，例如哪些样本是假阴性样本。为此，我们引入了有影响力样本的概念，并提出了基于有影响力样本的两个组件：负样本集修剪和损失加权。0有影响力的样本。我们假设与其他样本强相关的样本更有可能共享语义，并因此更有可能导致语义冲突。我们使用排队技术来存储数据样本。这使我们能够计算更可靠的语义相似度分数。此外，最近的研究[49，6，16]表明，大量的负样本对于对比表示学习至关重要。队列大小可以远大于小批量大小，并且我们通过用当前小批量替换最旧的小批量来逐步更新它。因此，给定队列Qx中的M个样本集合M = {x n} M n =1，我们将有影响力的样本定义为一个样本x i，其中Loss weighting.Moreover, we suggest using the connec-tivity to emphasize samples with large aggregated connec-tivity over those with low connectity. Samples with verylow connectivity can be regarded as outliers to the dataset.They are too sparse to positively inﬂuence the shape ofthe embedding. Thus, we reduce their inﬂuence on therepresentation learning. At the same time, we increasethe weight of inﬂuential samples, since the cross-modalinformation of these samples should have a large impact onthe shape of the embedding. In particular, for each sampleand modality we introduce a weightw(xi) = exp(C(xi)/κ),(5)where κ is a hyperparameter. While we deﬁned the connec-tivity, the inﬂuential samples, and the weights for modalityA, the same applies for modality B.The ﬁnal CrossCLR loss is (Lx + Ly)/2 withLx = −Ei∈M��w(xi) logδ(xi,yi)δ(xi,yi)+�yk∈ˆNExδ(xi,yk)+λ�xk∈ˆNRxδ(xi,xk)�� (6)Ly = −Ei∈M��w(yi) logδ(yi,xi)δ(yi,xi)+�xk∈ˆNEyδ(yi,xk)+λ�yk∈ˆNRyδ(yi,yk)�� (7)4. Experiments4.1. Datasets and MetricsWe conducted experiments on LSMDC [37] andYoucook2 [55] datasets.LSMDC [37] contains 118,081 short video clips ex-tracted from 202 movies. Each clip is annotated with aJane folds her armsHer daughter approaches Daniel, who leans against the wall.Transformer............Sentence Paragraph Clip Video TransformerTransformerTransformerTransformerTransformerEncoder BERT Encoder BERT LossLoss......Figure 4. Architecture. The model consist of two branches:one for modality A (e.g. Video) and one for modality B (e.g.Text). Each modality is represented by features from a pretrainedexpert, which we keep frozen (visual Encoder and BERT). Theseembeddings are fed into a transformer, which maps the inputfeatures into a joint embedding space. For video and text, we usea two-level hierarchy of transformers, where the loss is appliedat the clip/sentence-level and at the video/paragraph level. Thesecond stage takes the features from the joint embedding of theﬁrst transformer as input.caption, extracted from either the movie script or the audiodescription. The test set is composed of 1000 videos, frommovies not present in the training set.Youcook2 [55] contains 2000 videos with a total num-ber of 14k clips. This dataset is collected from YouTubeand covers 89 types of recipes. There are ∼ 9.6k clipsfor training and ∼ 3.2k clips for validation. For each clip,there is a manually annotated textual description.Evaluation protocol. We evaluate the learned embed-dings on the modality−to−modality retrieval task in termsof Recall@K, median rank (MdR), and mean rank (MnR).Given a query, its K nearest neighbours are retrieved fromthe database. The retrieval is considered successful, if thecorrect sample is among the K nearest neighbors.4.2. Expert FeaturesWe encode the content of a video with pre-trained mod-els trained for different semantic tasks, namely appearance,scene, action, object, and howto100m [26]. We extractper-frame features. To be speciﬁc, we use a ResNeSt269model [53] pretrained on ImageNet to extract appearanceinformation, and a DenseNet161 model [54] pretrainedon Places365 to extract scene information. In terms ofaction information, we adopt a R(2+1)D model [43] withResNet152 backbone pretrained on IG65M, and extract theﬁnal global pooled feature. For object features, we use aFaster-RCNN model [35] with ResNet50 FPN backbone.For Youcook2 experiments we use the Howto100m fea-tures provided by [12]. For LSMDC results in the Table 1,we used action and appearance features as input to themodel. For the SOTA comparisons in Table 4 we utilizedall expert features including appearance, action, scene, ob-14540对于在Q x中与许多其他样本强相关的样本xi，我们通过特征的聚合相似度来衡量连接性C(x i)。0C ( x i ) = 10M0M �0j =10|| x i || ∙ || x j || (4)0连接性越高，数据样本的影响力越大。有影响力的样本往往位于语义聚类的中心或在聚类之间建立链接，如图3所示。我们使用每个样本的连接性进行修剪和加权。0负样本集修剪。可以通过对样本的连通性进行阈值处理，以识别有影响力的样本I x。给定样本集Qx和阈值γ，我们将I x定义为I x = {x i | C(x i) > γ, � x i ∈ Qx}。为了减少对比学习中的假阴性效应，我们从负样本集中移除所有有影响力的样本（在每个模态中）。因此，我们重新定义了模态间和模态内的负样本集为：ˆ N E x = {yj |� (x j, y j) ∈ M, x j � I x}和ˆ N R x = {x j |� x j ∈ Q x, x j � Ix}。这也在图2-右侧中有所说明。Table 1. Comparison among contrastive learning losses. Video-text retrieval results with different contrastive learning losses on theYouCook2 and LSMDC dataset. CrossCLR shows consistently higher retrieval scores than previous losses.Youcook2LSMDCText =⇒ VideoVideo =⇒ TextText =⇒ VideoVideo =⇒ TextR@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10MaxMargin [11, 24]15.0±0.3437.0±0.3849.1±0.5114.2±0.1435.3±0.0247.2±0.088.222.731.89.123.331.8MIL-NCE [26]18.0±0.2341.9±0.5353.9±1.0616.4±0.5141.5±0.2454.1±0.548.924.232.510.425.434.9CLIP [34]17.8±0.4042.1±0.7854.4±0.6617.0±0.5442.0±0.4855.0±0.589.724.132.69.523.832.5DCL [7]17.9±0.8841.5±0.5954.1±0.7916.8±0.3942.0±0.3755.3±0.629.024.933.28.623.432.2NT-Xent [4]17.5±0.4442.4±0.2755.0±0.7717.3±0.5841.6±0.8954.6±1.09.323.632.510.025.133.4CrossCLR19.5±0.4945.9±0.5558.3±0.7618.5±0.3244.8±0.8257.9±0.7710.926.234.712.026.135.3Table 2. Ablation study on CrossCLR loss. We quantify theindividual contributions of the CrossCLR components: proximityweighting (PW ), intra-modality alignment (IM), and negativepruning (NP ) (reported avg and std over 5 ru

下载后可阅读完整内容，剩余1页未读，立即下载