IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE-2007-09-0467 1
Multimodal Fusion for Video Search
Reranking
Shikui Wei, Yao Zhao, Zhenfeng Zhu, Nan Liu
Abstract—Analysis on click-through data from a very large search engine log shows that users are usually interested in the top-
ranked portion of returned search results. Therefore, it is crucial for search engines to achieve high accuracy on the top-ranked
documents. While many methods exist for boosting video search performance, they either pay less attention to the above factor
or encounter difficulties in practical applications. In this paper, we present a flexible and effective reranking method, called CR-
Reranking, to improve the retrieval effectiveness. To offer high accuracy on the top-ranked results, CR-Reranking employs a
cross-reference (CR) strategy to fuse multimodal cues. Specifically, multimodal features are first utilized separately to rerank the
initial returned results at the cluster level, and then all the ranked clusters from different modalities are cooperatively used to
infer the shots with high relevance. Experimental results show that the search quality, especially on the top-ranked results, is
improved significantly.
Index Terms—Clustering, Image/video retrieval, Multimedia databases
——————————
——————————
1 INTRODUCTION
S an emerging research field, content-based video
retrieval (CBVR) has attracted a great deal of atten-
tion in recent years. While various retrieval models
have been developed to improve video search quality,
most of them implement search procedure by implicitly
or explicitly measuring the similarity between the query
and database shots in some low-level feature spaces [1].
However, such similarity is not usually consistent with
human perception due to the limitation of current im-
age/video understanding techniques. That is, the seman-
tic gap exists between the low-level features and high-
level semantics. For example, although a scene with red
flags and a scene with red buildings share similar color
features, they have completely different semantic mean-
ings. The semantic gap will enlarge linearly with the in-
crease of dataset size since a larger dataset means more
confusion, which thereby leads to rapid deterioration of
search performance. Performance comparison [2] between
TRECVID’05 and TRECVID’06 evaluation on all the three
search types, i.e. automatic, manual, and interactive, also
reveals it. Consequently, it is more attainable for low-
level features to reliably distinguish different shots in a
relatively small collection, which is the basis of proposed
reranking scheme.
If we consider that the final aim of search engines is to
meet users’ information needs, it is reasonable to take
user satisfaction and user behavior into account when
designing a search engine. According to the analysis in
[3], users are rarely patient to go through the entire result
list. Instead, they usually check the top-ranked docu-
ments. Analysis on click-through data from a very large
web search engine log also reflects such preference [3],
[4]. Therefore, it is more crucial to offer high accuracy on
the top-ranked documents than to improve the whole
search performance on the entire result list [5].
1.1 Related Work
Many methods have been proposed for improving the
retrieval performance of video search engines. The earlier
work [6], [7], [8], [9], [10], [11], which is based on relev-
ance feedback (RF) strategy, focuses mainly on the re-
finement of the initial search results in an interactive fa-
shion. However, RF-based methods require users’ labe-
ling for updating the query model, which is usually time-
consuming and even impractical in some search scena-
rios. In contrast, pseudo-relevance feedback (PRF) based
methods assume that the top-ranked documents are rele-
vant and use them to automatically refine the search
process [12]. For instance, the co-retrieval algorithm [13]
treats the top-ranked results as positive examples and
others as negative ones. Using these noisy training sam-
ples, a re-trained retrieval model is then built via an Ada-
boost based ensemble learning method. Although both
RF- and PRF-based methods have achieved precision im-
provement on the entire result list by returning more re-
levant shots, no mechanism guarantees that these rele-
vant shots will be top positioned.
Recently, the metasearch strategy [14], [15], which is
originally put forward in the field of information retriev-
al, is imported to CBVR for improving video retrieval
effectiveness. The key idea of metasearch is that multiple
————————————————
x S.K. Wei is with the Institute of Information Science, Bei
in
Jiaoton
University, Beijing 100044, China. E-mail: shkwei@gmail.com.
x Y. Zhao is with the Institute of Information Science, Beijing Jiaotong Uni-
versity, Beijing 100044, China. E-mail: yzhao@bjtu.edu.cn.
x Z.F. Zhu is with the Institute o
In
ormation Science, Bei
in
Jiaoton
University, Beijing 100044, China. E-mail: zhfzhu@bjtu.edu.cn.
x N. Liu is with the Institute of Information Science, Beijing Jiaotong Uni-
versity, Beijing 100044, China. E-mail: 05112073@bjtu.edu.cn
A
Digital Object Indentifier 10.1109/TKDE.2009.145 1041-4347/$25.00 © 2009 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: NORTHERN JIAOTONG UNIVERSITY. Downloaded on December 15, 2009 at 01:32 from IEEE Xplore. Restrictions apply.