
Object retrieval with large vocabularies and fast spatial matching
James Philbin
1
, Ond
ˇ
rej Chum
1
, Michael Isard
2
, Josef Sivic
1
and Andrew Zisserman
1
1
Department of Engineering Science, University of Oxford
2
Microsoft Research, Silicon Valley
{james,ondra,josef,az}@robots.ox.ac.uk misard@microsoft.com
Abstract
In this paper, we present a large-scale object retrieval
system. The user supplies a query object by selecting a
region of a query image, and the system returns a ranked
list of images that contain the same object, retrieved from
a large corpus. We demonstrate the scalability and perfor-
mance of our system on a dataset of over 1 million images
crawled from the photo-sharing site, Flickr [3], using Ox-
ford landmarks as queries.
Building an image-feature vocabulary is a major time
and performance bottleneck, due to the size of our dataset.
To address this problem we compare different scalable
methods for building a vocabulary and introduce a novel
quantization method based on randomized trees which we
show outperforms the current state-of-the-art on an exten-
sive ground-truth. Our experiments show that the quanti-
zation has a major effect on retrieval quality. To further
improve query performance, we add an efficient spatial ver-
ification stage to re-rank the results returned from our bag-
of-words model and show that this consistently improves
search quality, though by less of a margin when the visual
vocabulary is large.
We view this work as a promising step towards much
larger, “web-scale” image corpora.
1. Object retrieval from a large corpus
We are motivated by the problem of retrieving, from a
large corpus of images, the subset of images that contain a
query object.
1
In practice, no algorithm will be able to make
a perfect binary determination of whether or not an image
lies in the query subset, and in fact even human judges may
disagree on this due to occlusion, distortion, etc. We there-
fore address the slightly different problem of ranking each
image in the corpus to determine the likelihood that it con-
tains the query object, and aim to return to the user some
1
The query object is specified by a user selecting part of a query image,
so it is really a “query region” however we will refer to it as an object to
avoid overloading the term region.
prefix of this ranked list, in descending rank order.
A naive and inefficient solution to this task would be to
formulate a ranking function and apply it to every image in
the dataset before returning a ranked list. This is very com-
putationally expensive for large corpora and the standard
method in text retrieval [4, 8] is to use a bag of words model,
efficiently implemented as an inverted file data-structure.
This acts as an initial “filtering” step, greatly reducing the
number of documents that need to be considered.
Recent work in object based image retrieval [20, 24] has
mimicked simple text-retrieval systems using the analogy
of “visual words.” Images are scanned for “salient” regions
and a high-dimensional descriptor is computed for each re-
gion. These descriptors are then quantized or clustered into
a vocabulary of visual words, and each salient region is
mapped to the visual word closest to it under this cluster-
ing. An image is then represented as a bag of visual words,
and these are entered into an index for later querying and
retrieval. Typically, no spatial information about the image-
location of the visual words is used in the filtering stage.
Despite the analogy between “visual words” and words
in text documents, the trade-offs in ranking images and web
pages are somewhat different. An image query is generated
from an example image region and typically contains many
more words than a text query. The words are “noisier” how-
ever: in the web search case the user deliberately attempts
to choose words that are relevant to the query, whereas the
choice of words is abstracted away by the system in the
image-retrieval case, and cannot be understood or guided
by the user. Consequently, while web-search engines usu-
ally treat every query as a conjunction, object-retrieval sys-
tems typically include images that contain only, for exam-
ple, 90% of the query words, in the filtered set.
The biggest difference, however, is that the visual words
in an image-retrieval query encode vastly more spatial
structure than a text query. A user who types a three-word
text query may in general be searching for documents con-
taining those three words in any order, at any positions in
the document. A visual query however, since it is selected
from a sample image, automatically and inescapably in-
1-4244-1180-7/07/$25.00 ©2007 IEEE