Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for commercial advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from permissions@acm.org.
VRCAI 2014, November 30 – December 02, 2014, Shenzhen, China.
Copyright © ACM 978-1-4503-3254-5/14/11 $15.00
http://dx.doi.org/10.1145/2670473.2670510
Visual Saliency Based Bag of Phrases for Image Retrival
Lijuan Duan
∗
College of Computer Science
Beijing University of Technology
Wei Ma
†
College of Computer Science
Beijing University of Technology
Jun Miao
‡
Key Laboratory of Intelligent
Information Processing of CAS,
Institute of Computing
Technology, CAS,
Beijing 100190, China
Xuan Zhang
§
College of Computer Science
Beijing University of Technology
Abstract
This paper presents a saliency based bag-of-phrases (Saliency-BoP
for short) method for image retrieval. It combines saliency detec-
tion with visual phrase construction to extract bag-of-phrase fea-
tures. To achieve this, the method first detects salient regions in im-
ages. Then, it constructs visual phrases using the word pairs which
are from the same salient regions. Finally, it extracts the bag of vi-
sual phrases from the first K salient regions to describe images. Ex-
perimental results on Corel 1K and Microsoft Research Cambridge
image database demonstrated that the Saliency-BoP method outper-
forms related methods such as Bag-of-Words (BoW) or Saliency-
BoW.
CR Categories: I.4.7 [Image Processing and Computer Vision]:
Feature Measurement—Feature representation;
Keywords: Image retrieval, saliency, bag-of-phrases
1 Introduction
Image retrieval has obtained a great improvement in recent years.
Many features from low levels to high semantic levels were applied
in image retrieval. There are a great improvement on precision and
speed in image retrieval, but it is not critical breakthrough. The key
is that there is no way to recognize the real meaning of image. In
other words, image retrieval just like human searching for objects
in the picture would come true unless the true meaning of the im-
age can be recognized. The paper tries to propose a new image
descriptor based on human visual saliency to represent the image
effectively.
Bag-of-words (BoW) model [Csurka et al. 2004] is very popular
in image retrieval and image classification currently. It represents
images by using features sets. The main idea of BOW is com-
posed of three major steps: 1) extract image features; 2) construct
the codebook; 3) get an image descriptor through mapping features
into codebook. Then we rank these images by calculating similar-
ities between the query and the images from the database. Philbin
∗
e-mail: ljduan@bjut.edu.cn
†
e-mail: mawei@bjut.edu.cn
‡
e-mail: jmiao@ict.ac.cn
§
e-mail: zhangxuan2011@emails.bjut.edu.cn
[Philbin et al. 2007] applied BoW model in large scale image re-
trieval for the first time and the method had a good performance.
BoW model has no context information and just calculates frequen-
cies of words. Zhang [Zhang et al. 2011b] proposed a new method
of constructing visual phrases [Jiang et al. 2012; Zhang et al. 2011a]
which involve the relative positions of visual words for image re-
trieval. The method outperforms BoW since the context was in-
volved. Shaban [Shabany et al. 2013] proposed a global similarity
method that combined manifold with BoW model. These methods
have good performances in image retrieval and image classification.
We consider the problem from another view. We construct visual
phrases that contain contextual information in salient regions. This
correctness of the idea will be proved by experiments later.
In general, when features are extracted in an entire image, noises
are usually involved. This may cause negative effects in image
retrieval for background information which influences the repre-
sentation of images obviously. Visual attention model was intro-
duced by researchers to alleviate this problem. It locates salient
regions of scenes through simulating automatic selective attention
mechanism of humans. The paper demonstrates constructing vi-
sual phrases with contextual information in salient regions can de-
crease noises. Itti [ITTI et al. 1998] first proposed visual attention
model. This theory thinks that human fixations stay in different
regions with different time interval and human fixations are in dis-
order when human observe a picture. These phenomena are consid-
ered to be the difference of visual attention on the image. Saliency
map, which simulates human true fixations, can get from visual at-
tention model. Many researchers proposed methods to calculate
saliency maps from different aspects with good performance. For
example, Duan et al. [Duan et al. 2011] proposed visual saliency
detection by spatially weighted dissimilarity. Its saliency map is
calculated by using dissimilarity between blocks and positions. So
far, saliency model had already been used in many fields, such as
sparse coding [Kanan and Cottrell 2010] and image segmentation
[Achanta et al. 2008].
The main innovation of this paper is to introduce visual attention
to image retrieval and on this basis, to construct visual phrases
in salient regions according to certain rules. We call this method
as bag of phrases based on visual saliency (Saliency-BoP). Ex-
periments on Corel 1K and Microsoft Research Cambridge image
database [Ulusoy and Bishop 2005] show that our method outper-
forms the BoW model.
The rest of the paper is organized as follows. In Section 2, we in-
troduce our method in details. Experimental results and discussion
are presented in Section 3. Finally, a conclusion is drawn in Section
4.
243