Image Classification and Retrieval are ONE
Lingxi Xie
1
, Richang Hong
2
, Bo Zhang
3
, and Qi Tian
4
1,3
LITS, TNLIST, Dept. of Computer Sci&Tech, Tsinghua University, Beijing 100084, China
2
School of Computer and Information, Hefei University of Technology, Hefei 230009, China
4
Department of Computer Science, University of Texas at San Antonio, TX 78249, USA
1
198808xc@gmail.com,
2
hongrc@hfut.edu.cn,
3
dcszb@mail.tsinghua.edu.cn,
4
qitian@cs.utsa.edu
ABSTRACT
In this paper, we demonstrate that the essentials of image
classification and retrieval are the same, since both tasks
could be tackled by measuring the similarity between im-
ages. To this end, we propose ONE (Online Nearest-neighbor
Estimation), a unified algorithm for both image classifica-
tion and retrieval. ONE is surprisingly simple, which only
involves manual object definition, regional description and
nearest-neighbor search. We take advantage of PCA and
PQ approximation and GPU parallelization to scale our al-
gorithm up to large-scale image search. Experimental results
verify that ONE achieves state-of-the-art accuracy in a wide
range of image classification and retrieval benchmarks.
Categories and Subject Descriptors
I.4.10 [Image Processing and Computer Vision]: Im-
age Representation—Statistical; I.4.7 [Image Processing
and Computer Vision]: Feature Measurement—Feature
representation
General Terms
Algorithms, Experiments, Performance
Keywords
Image Classification, Image Retrieval, ONE, CNN
1. INTRODUCTION
Past decades have witnessed an impressive bloom of mul-
timedia applications based on image understanding. For ex-
ample, the number of categories in image classification has
grown from a few to tens of thousands [13], and deep Convo-
lutional Neural Networks (CNN) have been verified efficient
in large-scale learning [25]. Meanwhile, image retrieval has
been transplanted from toy programs to commercial search
engines indexing billions of images, and new user intention-
s such as fine-grained concept search [62] are realized and
proposed in this research field.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICMR ’15, June 23 – 26, 2015, Shanghai, China
Copyright 2015 ACM 978-1-4503-3274-3/15/06 ...$15.00.
natural scene
mountain
terrace
Search by “mountain”
Search by “natural scene”
Search by “terrace”
QUERY
TP TP
TP
TP TP TP
Fused Results
TP TP TP TP TP
Figure 1: An image retrieval example illustrating the intu-
ition of ONE (best viewed in color PDF). On a query image,
it is possible to find a number of semantic objects. Searching
for nearest neighbors with one object might not capture the
exact query intention, but fusing yields satisfying results.
A yellow circle with the word TP indicates a true-positive
image. Images are collected from the Holiday dataset [20].
Both image classification and retrieval receive a query
image at a time. Classification tasks aim at determining
the class or category of the query, for which a number of
training samples are provided and an extra training process
is often required. For retrieval, the goal is to rank a large
number of candidates according to their relevance to the
query, and candidates are considered as independent units,
i.e., without explicit relationship between them. Both im-
age classification and retrieval tasks could be tackled by the
Bag-of-Visual-Words (BoVW) model. However, the ways of
performing classification [10][26] and retrieval [46][38] are,
most often, very different. Although all the above algo-
rithms start from extracting patch or regional descriptors,
the subsequent modules, including feature encoding, index-
ing/training and online querying, are almost distinct.
In this paper, we suggest using only ONE (Online Nearest-
neighbor Estimation) algorithm for both image classification
and retrieval. This is achieved by computing similarity be-
tween the query and each category or image candidate. In-
spired by [4], we detect multiple object proposals on the
query and each indexed image, and extract high-quality fea-
tures on each object to provide better image description. On