There are many famous keypoints detectors and descriptors [29–33]
[59,60], such as Harris keypoints detector , SIFT , SURF, PCA-SIFT and ORB
(oriented F AST and Ro tated BRIEF), where SIFT is the most popular local
feature representation. It can be used to perform reliable matching
between different views of an object or scene [29].Inordertoperform
as good as SIFT with lower computational complexity , the SURF [32] or
ORB [59] canbeconsideredasanefficient alternati ve to SIFT. R ecently ,
bag-of-visual words (BOW) models or its variants have been reported in
the literatur es and used for object-based image retrieval, object recogni-
tion and scene categorization [34–41].In[34], Sivic and Zisserman have
proposed the bag-of-visual words (BOW) model which in essence
borrows techniq ues from text retriev al. In BOW model, local features
extr acted from an image by using SIFT , SURF or other keypoints
detectors, and then mapped into a set of visual words. Finally , an image
is represent ed as a histogram of visual word occurrences. It is so called
the standard BOW baseline, and can be considered as one of state-of-
the-art methods. Since the visual wor ds us ually come fr om clust ering
implementation which needs heavy computational burdens. Besides,
visual words have two major limitations that the lack of any e xplicit
semantic meanings and the ambiguity of visual words. Indeed, improv-
ing the visual vocabulary , incorporating spatial information and seman-
tic attributes can r educe the limitations and can also impr ove the
performances of BOW models [35–41].
There are extensive studies in feature extraction and image
representation within image retrieval and object recognition frame-
work. However, developing computational visual-attention model
within CBIR framework needs to be further studied.
3. Gray level co-occurrence matrix (GLCM)
Before discussing the proposed computational visual-attention
model in more details, a brief introduction of gray level co-occur-
rence matrix (GLCM) is given, since our saliency model involves
Haralick's gray level co-occurrence matrix [1 6].
Co-occurrence matrix is the most famous statistical approach in
textural image processing. In 1973, Haralick have put forward the
gray level co-occurrence matrix, and extracted a set of 14 features to
describe texture images features, such as energy, inverse difference
moment, contrast, entropy and so on [16]. It remains popular today
by virtue of good performance. The value of a gray image at any
coordinates(x, y) is denoted as f(x, y) ¼w, wA {0, 1, …, 255}. In order
to conveniently define the co-occurrence matrix, the pixel position
at the coordinates(x, y)isdenotedasP,whereP¼(x,
y). Let there are
two pixel positions P
1
¼(x
1
, y
1
)andP
2
¼(x
2
, y
2
), their pixel values
are f (P
1
)¼w and f (P
2
)¼ŵ.Iftheprobabilityoftwovaluesw and ŵ
co-occur with two pixel positions related by d, the cell entry (w, ŵ)
of co-occurrence matrix GLCMðw;
^
w; dÞ can be defined as follows:
GLCMðw;
^
w; dÞ¼prff ðp
1
Þ¼w4 f ðp
2
Þ¼
^
wjjp
1
p
2
j¼dgð1Þ
where 4 denotes the logical AND operation. In GLCM algorithm,
energy, entropy, contrast and inverse difference moment often
utilized to describe image features [16], but the discrimination
power does not enough to achieve the satisfactory performance of
image retrieval especially on larger scale datasets [21]. If all cell
entries of co-occurrence matrix are used to describe image features,
the vector dimension would be very high and is not always increase
retrieval accuracy.
However, some features extracted from GLCM have definite
physical meaning in texture image analysis, where energy is a
measure of textural uniformity of an image. When the image under
consideration is homogenous, energy reaches its maximum [43].
The conspicuity areas can be considered as those areas which have
significant visual differences and are not the homogenous areas.
Inspired by above views, the energy feature of GLCM is used as the
inhibition term in the stage of saliency map detection, instead of
using the local maxima normalization operator in Itti's model [5].
4. The saliency structures model and descriptor
Human's visual attention consists of pre-attenti ve and attentive
stage according to Treisman's feature integration theory [4].Inthepre-
attentive stage, only “pop-out” features are detected. Whereas in the
attentiv e stage, relationships between various features are found and
grouping [4,1 4].Inthispaper,saliencystructuremodelisproposedto
content-based image retrieval according to Tr eisman's feature integra-
tion theory [4] and Julesz’ texton theory [49,50]. In feature extr action
and image representation, Orientation-selecti ve mechanism which
derivedfromtheworksofHubelandWieselisusedtoourmodel
[1]. Color , intensity and orientation are considered as the primary
visual features which are commonly used in many saliency models
[4,5].Inordertodetect“pop-out” features, a novel visual cue, namely
color volume, with edge information together is introduced into our
saliency model and used to detect saliency regions.
It is crucially important to emphasize that saliency structure model
can be considered as an improved version of micro-structures model by
combining a bottom-up component of visual attention and orientation-
selective mechanism, where the saliency structures ar e defined as the
bar -shaped structures according to orientation-selective mechanism by
using oriented Gabor filters, whereas micro-structur es are defined as the
collection of certain underlying colors [3]. The basic principle of the
proposed descriptor is to generate three tuples histograms considering
the bar -shaped structures and oriented Gabor filters via a very special
type, whereas micro-structure descript or is adopted the probability
statistics method to describe features.
The flow diagram of the proposed saliency model within CBIR
framework is illustrated in Fig. 2.
In the proposed saliency model within CBIR framework, we
mainly focus on: (1) the construction of saliency structure model
and (2) image representation. Where the construction of saliency
structure model mainly consists of three stages: (a) extraction of
the primary visual features, (b) the saliency map detection and
(c) the combination of bar-shaped structure and oriented Gabor
filters for saliency structure detection.
4.1. Extraction of the primary visual features
Human's visual system is more sensitive to color , orientation and
intensity information [5]. In many visual saliency models, color is
implemented as R-G (red-green) and B-Y (blue-yellow) channels
inspired by color -opponent neurons in V1 cortex [5] [1 3 ].Theaverage
of three color channels is usually used as intensity . Orientation is often
implemented as a convolution with oriented Gabor filt ers.
It is well known that HSV color space could mimic human's color
perception well. In order to extract the primary visual features for
image representation and simplify manipulation, the q uantization of
visual features needed to be implemented in HSV color space. For
example, the task of color quantization is to select and assign a limited
set of colors for representing a giv e color imag e with maximum
fidelity [44]. The color quantization techniques are more fully des-
cribed in many books of digital images processing and will not be
described in detail here.
In order to obtain color map, H, S and V color channels are
uniform quantized into 6, 3 and 3 bins, respectively, so that in total
6 3 3¼54 color combinations are obtained, M
C
ðx; yÞ denotes
the color combinations or color map, as M
C
ðx; yÞ¼w; wA f0; 1;
…; N
C
1g, where N
C
¼ 54 in this paper.
Intensity information is given by V color channel. After uniform
quantization, we can obtain the intensity map M
I
ðx; yÞ,asM
I
ðx; yÞ
¼ s; s A f0; 1; …; N
I
1g, where N
I
¼ 16. Since the computational
G.-H. Liu et al. / Pattern Recognition 48 (2015) 2554–25662556