大规模图像检索：随机树量化与BOVW模型

需积分: 50 196 浏览量更新于2024-09-14 1 收藏 1.46MB PDF 举报

"这篇论文介绍了在大规模图像检索中应用BOVW模型的系统，用户通过选择查询图像的区域来提供查询对象，系统则返回包含相同对象的图片列表。论文探讨了在超过100万张来自Flickr的照片数据集上的性能，并强调了构建图像特征词汇表的时间和性能瓶颈问题。为了克服这个问题，作者提出了一种基于随机树的新型量化方法，这种方法在广泛的实地测试中优于当前最先进的技术。实验表明，这种量化方法显著提高了检索效率和准确性。" BOVW（Bag-of-Visual-Words）模型是一种在计算机视觉中广泛应用的概念，它源自于自然语言处理中的Bag-of-Words模型。在NLP中，BoW模型忽略文本的语法和顺序，只关注文档中单词的出现频率。在图像处理领域，BOVW模型将图像的局部特征（如SIFT、SURF等）视为“视觉单词”，并将图像转化为这些视觉单词的无序集合，从而实现对图像内容的表示。 KMEANS聚类算法是构建BOVW模型的关键步骤。在图像特征提取后，KMEANS算法用于将这些特征向量聚类到预先设定的簇（即“词汇”或“码本”）。每个簇的中心点代表一个“视觉单词”，而原始图像特征则被分配到最近的簇，形成图像的“词袋”表示。这种表示方式极大地减少了数据的维度，使得大规模图像检索成为可能。论文中提到的挑战在于随着数据集增大，构建特征词汇表变得极其耗时且效率低下。为了解决这一问题，研究者提出了基于随机树的量化方法。这种方法利用随机森林或者决策树结构对特征进行快速近似量化，有效地减少了计算复杂性，同时保持了较高的检索精度。相比于传统的KMEANS等方法，这种方法在大规模数据集上表现更优，能够快速匹配空间特征，提高检索速度。这篇论文贡献了在大规模图像检索中应用BOVW模型的创新策略，特别是通过随机树量化方法优化了图像特征表示和检索性能，对于提升大规模图像数据库的搜索效率具有重要意义。

Object retrieval with large vocabularies and fast spatial matching

James Philbin

, Ond

rej Chum

, Michael Isard

, Josef Sivic

and Andrew Zisserman

Department of Engineering Science, University of Oxford

Microsoft Research, Silicon Valley

{james,ondra,josef,az}@robots.ox.ac.uk misard@microsoft.com

Abstract

In this paper, we present a large-scale object retrieval

system. The user supplies a query object by selecting a

region of a query image, and the system returns a ranked

list of images that contain the same object, retrieved from

a large corpus. We demonstrate the scalability and perfor-

mance of our system on a dataset of over 1 million images

crawled from the photo-sharing site, Flickr [3], using Ox-

ford landmarks as queries.

Building an image-feature vocabulary is a major time

and performance bottleneck, due to the size of our dataset.

To address this problem we compare different scalable

methods for building a vocabulary and introduce a novel

quantization method based on randomized trees which we

show outperforms the current state-of-the-art on an exten-

sive ground-truth. Our experiments show that the quanti-

zation has a major effect on retrieval quality. To further

improve query performance, we add an efﬁcient spatial ver-

iﬁcation stage to re-rank the results returned from our bag-

of-words model and show that this consistently improves

search quality, though by less of a margin when the visual

vocabulary is large.

We view this work as a promising step towards much

larger, “web-scale” image corpora.

1. Object retrieval from a large corpus

We are motivated by the problem of retrieving, from a

large corpus of images, the subset of images that contain a

query object.

In practice, no algorithm will be able to make

a perfect binary determination of whether or not an image

lies in the query subset, and in fact even human judges may

disagree on this due to occlusion, distortion, etc. We there-

fore address the slightly different problem of ranking each

image in the corpus to determine the likelihood that it con-

tains the query object, and aim to return to the user some

The query object is speciﬁed by a user selecting part of a query image,

so it is really a “query region” however we will refer to it as an object to

avoid overloading the term region.

preﬁx of this ranked list, in descending rank order.

A naive and inefﬁcient solution to this task would be to

formulate a ranking function and apply it to every image in

the dataset before returning a ranked list. This is very com-

putationally expensive for large corpora and the standard

method in text retrieval [4, 8] is to use a bag of words model,

efﬁciently implemented as an inverted ﬁle data-structure.

This acts as an initial “ﬁltering” step, greatly reducing the

number of documents that need to be considered.

Recent work in object based image retrieval [20, 24] has

mimicked simple text-retrieval systems using the analogy

of “visual words.” Images are scanned for “salient” regions

and a high-dimensional descriptor is computed for each re-

gion. These descriptors are then quantized or clustered into

a vocabulary of visual words, and each salient region is

mapped to the visual word closest to it under this cluster-

ing. An image is then represented as a bag of visual words,

and these are entered into an index for later querying and

retrieval. Typically, no spatial information about the image-

location of the visual words is used in the ﬁltering stage.

Despite the analogy between “visual words” and words

in text documents, the trade-offs in ranking images and web

pages are somewhat different. An image query is generated

from an example image region and typically contains many

more words than a text query. The words are “noisier” how-

ever: in the web search case the user deliberately attempts

to choose words that are relevant to the query, whereas the

choice of words is abstracted away by the system in the

image-retrieval case, and cannot be understood or guided

by the user. Consequently, while web-search engines usu-

ally treat every query as a conjunction, object-retrieval sys-

tems typically include images that contain only, for exam-

ple, 90% of the query words, in the ﬁltered set.

The biggest difference, however, is that the visual words

in an image-retrieval query encode vastly more spatial

structure than a text query. A user who types a three-word

text query may in general be searching for documents con-

taining those three words in any order, at any positions in

the document. A visual query however, since it is selected

from a sample image, automatically and inescapably in-

下载后可阅读完整内容，剩余7页未读，立即下载

xioayu99

粉丝: 1

大规模图像检索：随机树量化与BOVW模型

词袋模型BOVW

顶会CVPR 2021上与【图像分类】相关的论文（5篇）

基于BOVW场景分类的matlab代码

基于BoVW模型的图像自定位图像匹配方法

使用图像分类 (CNN/BoVW) 的 MCQ 评估：处理划掉答案的灵活 MCQ 评估系统-matlab开发

BoVW在蒙古历史文献图像上发现关键词的案例研究

两篇基于sift特征的图像检索论文

语义空间匹配的图像分类

图像匹配方面的论文

使用视觉单词嵌入和RNN表示单词图像以在历史文档图像上发现关键字

最新资源