视频检索重排：多模态融合提升搜索精度

165 浏览量更新于2024-09-01 收藏 538KB PDF 举报

"视频搜索重排中的多模态融合：一种提高检索效果的方法" 本文发表在《IEEE Transactions on Knowledge and Data Engineering》（TKDE）上，日期为2007年9月467号，由Shikui Wei、Yao Zhao、Zhenfeng Zhu和Nan Liu合作完成。随着大规模搜索引擎日志对用户行为的深入分析，研究人员发现用户往往关注搜索结果的前几项。因此，如何提升搜索结果的准确性，特别是在排名靠前的部分，对于搜索引擎来说至关重要。传统上，提升视频搜索性能的方法存在不足，要么忽视了用户对搜索结果上层的关注，要么在实际应用中遇到困难。为了克服这些问题，作者提出了一种名为CR-Reranking的灵活而高效的视频搜索重排方法。CR-Reranking的主要目标是通过跨模态融合（Cross-Reference, CR）策略来增强检索的精确度。 CR-Reranking首先利用多模态特征分别对初始返回的结果进行重新排序，这种方法在集群级别进行操作。具体来说，每种模态（如视觉、音频、文本等）独立地评估和排列视频片段，捕捉不同维度上的相关性。然后，这些来自不同模态的排序后的簇被协同利用，共同推断出与查询最相关的镜头。这种方法考虑了用户的多元信息需求，提高了对高相关性的判断准确度。实验结果显示，CR-Reranking显著提升了视频搜索的质量，特别是在优化了搜索结果的顶部排名方面。这表明通过多模态融合，搜索引擎能够更好地满足用户的即时需求，从而提高整体用户体验和满意度。这项研究不仅对视频搜索领域的技术发展有所贡献，也为其他信息检索系统提供了一个重要的改进思路，即通过整合多种数据源和理解用户的行为模式，来提升检索系统的精度和效率。

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE-2007-09-0467 1

Multimodal Fusion for Video Search

Reranking

Shikui Wei, Yao Zhao, Zhenfeng Zhu, Nan Liu

Abstract—Analysis on click-through data from a very large search engine log shows that users are usually interested in the top-

ranked portion of returned search results. Therefore, it is crucial for search engines to achieve high accuracy on the top-ranked

documents. While many methods exist for boosting video search performance, they either pay less attention to the above factor

or encounter difficulties in practical applications. In this paper, we present a flexible and effective reranking method, called CR-

Reranking, to improve the retrieval effectiveness. To offer high accuracy on the top-ranked results, CR-Reranking employs a

cross-reference (CR) strategy to fuse multimodal cues. Specifically, multimodal features are first utilized separately to rerank the

initial returned results at the cluster level, and then all the ranked clusters from different modalities are cooperatively used to

infer the shots with high relevance. Experimental results show that the search quality, especially on the top-ranked results, is

improved significantly.

Index Terms—Clustering, Image/video retrieval, Multimedia databases

——————————



——————————

1 INTRODUCTION

S an emerging research field, content-based video

retrieval (CBVR) has attracted a great deal of atten-

tion in recent years. While various retrieval models

have been developed to improve video search quality,

most of them implement search procedure by implicitly

or explicitly measuring the similarity between the query

and database shots in some low-level feature spaces [1].

However, such similarity is not usually consistent with

human perception due to the limitation of current im-

age/video understanding techniques. That is, the seman-

tic gap exists between the low-level features and high-

level semantics. For example, although a scene with red

flags and a scene with red buildings share similar color

features, they have completely different semantic mean-

ings. The semantic gap will enlarge linearly with the in-

crease of dataset size since a larger dataset means more

confusion, which thereby leads to rapid deterioration of

search performance. Performance comparison [2] between

TRECVID’05 and TRECVID’06 evaluation on all the three

search types, i.e. automatic, manual, and interactive, also

reveals it. Consequently, it is more attainable for low-

level features to reliably distinguish different shots in a

relatively small collection, which is the basis of proposed

reranking scheme.

If we consider that the final aim of search engines is to

meet users’ information needs, it is reasonable to take

user satisfaction and user behavior into account when

designing a search engine. According to the analysis in

[3], users are rarely patient to go through the entire result

list. Instead, they usually check the top-ranked docu-

ments. Analysis on click-through data from a very large

web search engine log also reflects such preference [3],

[4]. Therefore, it is more crucial to offer high accuracy on

the top-ranked documents than to improve the whole

search performance on the entire result list [5].

1.1 Related Work

Many methods have been proposed for improving the

retrieval performance of video search engines. The earlier

work [6], [7], [8], [9], [10], [11], which is based on relev-

ance feedback (RF) strategy, focuses mainly on the re-

finement of the initial search results in an interactive fa-

shion. However, RF-based methods require users’ labe-

ling for updating the query model, which is usually time-

consuming and even impractical in some search scena-

rios. In contrast, pseudo-relevance feedback (PRF) based

methods assume that the top-ranked documents are rele-

vant and use them to automatically refine the search

process [12]. For instance, the co-retrieval algorithm [13]

treats the top-ranked results as positive examples and

others as negative ones. Using these noisy training sam-

ples, a re-trained retrieval model is then built via an Ada-

boost based ensemble learning method. Although both

RF- and PRF-based methods have achieved precision im-

provement on the entire result list by returning more re-

levant shots, no mechanism guarantees that these rele-

vant shots will be top positioned.

Recently, the metasearch strategy [14], [15], which is

originally put forward in the field of information retriev-

al, is imported to CBVR for improving video retrieval

effectiveness. The key idea of metasearch is that multiple

————————————————

x S.K. Wei is with the Institute of Information Science, Bei

Jiaoton

University, Beijing 100044, China. E-mail: shkwei@gmail.com.

x Y. Zhao is with the Institute of Information Science, Beijing Jiaotong Uni-

versity, Beijing 100044, China. E-mail: yzhao@bjtu.edu.cn.

x Z.F. Zhu is with the Institute o

ormation Science, Bei

Jiaoton

University, Beijing 100044, China. E-mail: zhfzhu@bjtu.edu.cn.

x N. Liu is with the Institute of Information Science, Beijing Jiaotong Uni-

versity, Beijing 100044, China. E-mail: 05112073@bjtu.edu.cn

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authorized licensed use limited to: NORTHERN JIAOTONG UNIVERSITY. Downloaded on December 15, 2009 at 01:32 from IEEE Xplore. Restrictions apply.

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38733382

粉丝: 3
资源: 880

视频检索重排：多模态融合提升搜索精度

Deep Learning and Multimodal Fusion of 3D Point Cloud

exploration of deep learning-based multimodal fusion for semantic road scene

multimodal fusion

multimodal token fusion for vision transformers

Multimodal Representation for Neural Code Search

multimodal_fusion_project

Low-rank-Multimodal-Fusion-master

multimodal-MER-fusion

Watch and Buy: A Large-Scale Multimodal Dataset for Fashion Identification in Livestreaming-数据集

Multimodal-Video-Emotion-Recognition-Pytorch-master

最新资源