Multi-Index Fusion via Similarity Matrix Pooling
for Image Retrieval
Xin Chen
∗
, Jun Wu
∗
, Shaoyan Sun
†
,QiTian
‡
∗
Tongji University, Shanghai, China
†
University of Science and Technology of China, Hefei, China
‡
University of Texas at San Antonio, Texas, TX 78249
{1410452, wujun}@tongji.edu.cn, sunshy@mail.ustc.edu.cn, qi.tian@utsa.edu
Abstract—Different kinds of features hold some distinct merits,
making them complementary to each other. Inspired by this
idea an index level multiple feature fusion scheme via similarity
matrix pooling is proposed in this paper. We first compute the
similarity matrix of each index, and then a novel scheme is used
to pool on these similarity matrices for updating the original
indices. Compared with the existing fusion schemes, the proposed
scheme performs feature fusion at index level to save memory
and reduce computational complexity. On the other hand, the
proposed scheme treats different kinds of features adaptively
based on its importance, thus improves retrieval accuracy. The
performance of the proposed approach is evaluated using two
public datasets, which significantly outperforms the baseline
methods in retrieval accuracy with low memory consumption
and computational complexity.
I. INTRODUCTION
With the explosive increase of visual data in recent years,
image retrieval has become an urgent need to find useful infor-
mation in massive visual data. Content-based image retrieval
(CBIR) is such a fabulous way. Typically, a CBIR system
represents an image as a vector with fixed dimension and
measures the similarity between two images by computing
the Euclidean or cosine distance between these two vectors.
The vector may be a holistic global feature vector or a sparse
histogram vector constructed by local features. Different types
of features have different representative power, resulting in
different performance of CBIR systems.
Early CBIR systems usually use global holistic low di-
mension feature vectors and deal with small scale datasets,
requiring relatively low memory consumption and computa-
tional complexity, thus developing efficient index schemes is
not necessary. With the invention of Scale-Invariant Feature
Transform (SIFT) [1], methods for image representation be-
come much more complex and the scale of image datasets also
grows, traditional methods for similarity measure show their
limitations.
Inspired by the successful text retrieval system, the inverted
index structure and Bag-of-Visual-Words (BoVW) model are
introduced into CBIR systems for efficient image retrieval [2].
In this framework, local features extracted from an image
are quantized into different visual words of a pre-trained
codebook (bag). Then the quantized features are weighted with
Term Frequency-Inverse Document Frequency (TF-IDF) [3],
generating a histogram vector used for image representation.
The similarity between a query image and dataset images is
measured by counting the co-occurrence of the same visual
words for a pair of query and dataset images. The database is
organized according to visual words, each visual word consists
of multiple image entries, each image entry corresponds to
an identity (ID) of the image and its TF-IDF weight. In the
online query stage, we only need to traverse those lists of
visual words appearing in the query image. In this way, both
the memory consumption for the storage of index and the
computational complexity for the online retrieval are greatly
reduced. These advantages make this framework mainstream
of content-based image retrieval for a decade. A considerable
number of works focus on further improving the retrieval
accuracy and efficiency upon this framework [19], [20], [21],
[22], [23], [24], [25].
Computer vision field witnesses revolutionary changes
caused by deep neural network. Especially, deep convolu-
tional neural networks (CNN) has improved a considerable
amount of vision tasks to a new state-of-the-art performance
[11], [26], [27], [28]. The powerful discriminative ability of
CNN feature has been widely explored in the task of image
retrieval. Babenko et al. [4] proposed to use one layer of
CNN activations as image representation for image retrieval.
Hariharan et al. [10] exploited the spatial information to
aggregate the feature maps of the same location in a specific
convolutional layer into a hypercolumn vector, which was used
in the task of object segmentation. Furthermore, Ng et al.
[6] proposed to aggregate these hypercolumn vectors into one
vector with Vector of Locally Aggregated Descriptors (VLAD)
[17] as image representation for image retrieval. A multi-scale
orderless pooling scheme [5] was proposed to aggregate CNN
features from images of multiple scales with VLAD. These
schemes significantly boost the image retrieval accuracy.
Local features such as SIFT have powerful representative
ability for detailed information of images but lack holistic
ability. In BoVW model, sufficient details are kept in the index
without preserving the global spatial information. A lot of
schemes are proposed to aggregate spatial information into
the index [19], [20], [21]. Although these schemes do help to
boost the retrieval performance, they just integrate partial low
order spatial information, which is not good enough. On the
contrary, global features such as fully-connected layer CNN
are good at collecting global information of an image with
IEEE ICC 2017 SAC Symposium Big Data Networking Track
978-1-4673-8999-0/17/$31.00 ©2017 IEEE