ActBERT:全球-局部视频-文本表示学习

需积分: 14 1 下载量 74 浏览量 更新于2024-08-05 收藏 1.1MB PDF 举报
"ActBERT: Learning Global-Local Video-Text Representations" 这篇论文"ActBERT: Learning Global-Local Video-Text Representations"是发表在CVPR(计算机视觉与模式识别会议)上的一篇研究,作者Linchao Zhu和Yi Yang来自百度研究院和悉尼科技大学的ReLER实验室。该论文主要探讨了如何利用未标注的数据进行视频-文本的自监督学习,以构建全局-局部的视频-文本表示。 ActBERT的核心思想在于通过挖掘全局动作信息来分析文本和局部区域对象之间的相互作用。这使得模型能够从配对的视频序列和文本描述中揭示出详细的视觉和文本关系模型。全局视角提供了对人类整体行为的理解,而局部视角则关注于精细的物体识别。 为了整合这三种信息源(全局动作、局部区域对象和语言描述),论文提出了一种名为Tangled Transformer block (TNT)的结构。TNT块能够编码并处理这些来源中的信息,通过从上下文信息中精巧地提取线索,发现全局-局部对应关系。这种设计强化了联合视频-文本表示,使其既能够理解微小的物体细节,又能够捕捉到全局的人类意图。 在下游的视频-语言任务中,如文本-视频剪辑检索、视频字幕生成和视频问答等,ActBERT的泛化能力得到了验证。这些任务通常要求模型能够理解视频内容,并与提供的文本信息精确匹配。通过在这些任务上的实验,作者证明了ActBERT的有效性和广泛的应用潜力。 ActBERT的创新之处在于它提供了一种新颖的方法,将视频中的动态行为和静态物体与文本描述相结合,形成统一的表示,这对于视频理解和多模态信息处理具有重要意义。此外,该模型的自监督学习方法允许在大规模无标注数据集上进行训练,降低了对大量人工注释的依赖,从而推动了视频理解技术的发展。

(3) 参考利用下面的程序代码,完成代码注释中要求的两项任务。 import re """ 下面ref是2020年CVPR的最佳论文的pdf格式直接另存为文本文件后, 截取的参考文献前6篇的文本部分。 请利用该科研文献的这部分文本,利用正则表达式、字符串处理等方法, 编程实现对这6篇参考文献按下面的方式进行排序输出。 a.按参考文献标题排序 b.按出版年份排序 """ ref = """[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3D point clouds. In Proc. ICML, 2018 [2] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proc. ICCV, 2015 [3] Peter N. Belhumeur, David J. Kriegman, and Alan L. Yuille. The bas-relief ambiguity. IJCV, 1999 [4] Christoph Bregler, Aaron Hertzmann, and Henning Biermann. Recovering non-rigid 3D shape from image streams. In Proc. CVPR, 2000 [5] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas. Shapenet: An information-rich 3d model reposi-tory. arXiv preprint arXiv:1512.03012, 2015 [6] Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dy-lan Drover, Rohith MV, Stefan Stojanov, and James M. Rehg. Unsupervised 3d pose estimation with geometric self-supervision. In Proc. CVPR, 2019""" ref_str = re.sub(r'\[([0-9]{1})\]', r'$[\1]', ref) # 添加分隔$ print(ref_str) #脚手架代码 ref_str_2 = re.sub(r'([a-zA-Z]{2})\.', r'\1.#', ref_str) # 添加分隔# print(ref_str_2) #脚手架代码 ref_str2 = ref_str_2.replace("\n", "") ref_list = ref_str2.split("$") print(ref_list) #脚手架代码 [提示: 排序可以采用内置函数sorted(),语法如下: sorted(iterable, /, *, key=None, reverse=False), 注意掌握形式参数中带“/”和“*”的用途]

2023-05-26 上传

The human visual cortex is biased towards shape components while CNNs produce texture biased features. This fact may explain why the performance of CNN significantly degrades with low-labeled input data scenarios. In this paper, we propose a frequency re-calibration U-Net (FRCU-Net) for medical image segmentation. Representing an object in terms of frequency may reduce the effect of texture bias, resulting in better generalization for a low data regime. To do so, we apply the Laplacian pyramid in the bottleneck layer of the U-shaped structure. The Laplacian pyramid represents the object proposal in different frequency domains, where the high frequencies are responsible for the texture information and lower frequencies might be related to the shape. Adaptively re-calibrating these frequency representations can produce a more discriminative representation for describing the object of interest. To this end, we first propose to use a channel-wise attention mechanism to capture the relationship between the channels of a set of feature maps in one layer of the frequency pyramid. Second, the extracted features of each level of the pyramid are then combined through a non-linear function based on their impact on the final segmentation output. The proposed FRCU-Net is evaluated on five datasets ISIC 2017, ISIC 2018, the PH2, lung segmentation, and SegPC 2021 challenge datasets and compared to existing alternatives, achieving state-of-the-art results.请详细介绍这段话中的技术点和实现方式

2023-05-29 上传