Contents lists available at ScienceDirect
Neurocomputing
journa l homepa ge: www.elsevier.com/locate/neucom
Separable vocabulary and feature fusion for image retrieval based on sparse
representation
Yanhong Wang
a,b
, Yigang Cen
a,b,
⁎
, Ruizhen Zhao
a,b
, Yi Cen
c
, Shaohai Hu
a,b
, Viacheslav Voronin
d
,
Hengyou Wang
e
a
Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China
b
Key Laboratory of Advanced Information Science and Network Technology of Beijing, Beijing 100044, China
c
School of Information Engineering, Minzu University of China, Beijing 100081, China
d
Department of Radio-electronic Systems, Don State Technical University, Shakhty 346500, Russia
e
School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
ARTICLE INFO
Keywords:
Separable vocabulary
Sparse representation
Feature fusion
Image retrieval
ABSTRACT
Visual vocabulary is the core of the Bag-of-visual-words (BOW) model in image retrieval. In order to ensure the
retrieval accuracy, a large vocabulary is always used in traditional methods. However, a large vocabulary will
lead to a low recall. In order to improve recall, vocabularies with medium sizes are proposed, but they will lead
to a low accuracy. To address these two problems, we propose a new method for image retrieval based on feature
fusion and sparse representation over separable vocabulary. Firstly, a large vocabulary is generated on the
training dataset. Secondly, the vocabulary is separated into a number of vocabularies with medium sizes.
Thirdly, for a given query image, we adopt sparse representation to select a vocabulary for retrieval. In the
proposed method, the large vocabulary can guarantee a relatively high accuracy, while the vocabularies with
medium sizes are responsible for high recall. Also, in order to reduce quantization error and improve recall,
sparse representation scheme is used for visual words quantization. Moreover, both the local features and the
global features are fused to improve the recall. Our proposed method is evaluated on two benchmark datasets,
i.e., Coil20 and Holidays. Experiments show that our proposed method achieves good performance.
1. Introduction
In recent years, content-based image retrieval (CBIR) is a very hot
research issue of computer vision and multimedia information.
Although it has achieved rapid development, researchers have not yet
to standardize various image retrieval systems [1]. Image retrieval still
remains as a challenging problem. It is the fact that effects of image
retrieval are failed due to occlusion, distortion, corrosion and the
different lighting conditions.
Image retrieval means that, for a given query image, we will retrieve
all the similar images from the database. Similar images are defined as
images contain the same objects or a scene viewed under different
imaging conditions [2]. In the past years, the BOW model [3,4] has
achieved great effect in image retrieval area. This model is inspired by
the text retrieval system [3–5]. It contains four major steps: (1). Local
features are extracted from each image, such as the SIFT descriptor [6],
rootSIFT descriptor [7] and SURF descriptor [8] etc. (2). Each local
descriptor is quantized to a visual word according to a pre-trained
vocabulary by an unsupervised clustering approach. (3). Each image is
represented by a frequency histogram of visual words. (4). Retrieval
results are returned according to the similarities between the query
image and the images of dataset.
Vocabulary plays a very important role in the BOW model. For a
large number of local features, in order to ensure the retrieval accuracy,
we need to train a large visual vocabulary. But a large visual vocabulary
will lead to a low recall and other issues [9,10]. In order to improve the
recall, in previous works, there are two main types of solutions: Firstly,
the size of the vocabulary is changed. For examples, in [2], Jegou et al.
proposed to use the vocabulary with medium size to improve recall.
However, this will lead to a low accuracy [10].In[11,12], the author
represented images with vector of locally aggregated descriptors
(VLAD), which can be viewed as a simplification of the fisher vector
(FV) [13] representation. Moreover, the VLAD method only requires a
small vocabulary in the retrieval process. Secondly, multiple vocabul-
aries based strategies are used. The vocabularies are usually generated
by an independent training dataset. In [14], the author proposed a
Bayes merging approach to down-weight the indexed features in the
intersection set. In [15], instead of computing the multiple vocabul-
http://dx.doi.org/10.1016/j.neucom.2016.08.106
Received 27 February 2016; Received in revised form 17 July 2016; Accepted 8 August 2016
⁎
Corresponding author at: Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China.
E-mail address: ygcen@bjtu.edu.cn (Y. Cen).
Neurocomputing 236 (2017) 14–22
Available online 17 November 2016
0925-2312/ © 2016 Elsevier B.V. All rights reserved.
MARK