显着性结构模型：基于内容的图像检索新方法

需积分: 5 132 浏览量更新于2024-08-12 收藏 4.53MB PDF 举报

"使用视觉视觉注意模型的基于内容的图像检索" 在计算机视觉领域，基于内容的图像检索（CBIR）是一项重要的任务，它旨在通过分析图像内容来寻找与查询图像相似的图像。传统的CBIR系统通常依赖于颜色、纹理和形状等基本视觉特征。然而，这种方法往往忽视了人类视觉系统在关注图像时的动态和选择性注意。因此，模拟视觉注意力机制以提升CBIR的性能成为了研究焦点。本文提出了一种新颖的计算视觉注意力模型，称为显着性结构模型，用于增强CBIR的效果。显着性结构模型引入了新的视觉提示——颜色量，结合边缘信息，来检测图像中的显著区域，而非仅仅依赖传统的颜色、强度和方向特征。颜色量是一种考虑颜色分布的度量，可以更精确地捕捉到视觉显著性。同时，利用边缘信息有助于识别物体轮廓，进一步提升显著区域的定位准确性。在抑制背景噪声方面，研究者采用了灰度共生矩阵（GLCM）的能量特征构建全局抑制图。GLCM是纹理分析的强大工具，它记录了像素对之间的相对位置和灰度级关系。通过其能量特征，可以有效地识别和抑制图像中的非显著部分，从而使显著区域更加突出。为了更好地表达图像的结构信息，文章还提出了一种新的图像表示方法——显着性结构直方图。这种方法激发了CBIR框架内的方向选择机制，使得图像的显著特征在不同方向上都能得到有效的编码。显着性结构直方图的引入，使得系统能够更全面地理解图像的显著结构，从而提高检索的准确性和鲁棒性。实验部分，作者在两个不同的数据集上评估了所提算法的性能。结果显示，该算法相对于标准的 Bag-of-Words（BOW）基线和微结构描述符有显著的改进，证明了显着性结构模型和直方图在捕获和利用视觉注意力方面的有效性。总结来说，这篇论文提出了一种综合考虑颜色、边缘和结构信息的视觉注意模型，通过显着性结构模型和直方图，提高了基于内容的图像检索的性能。这一方法不仅加深了对人类视觉系统理解的模拟，也为图像检索领域带来了新的理论和技术支持。

There are many famous keypoints detectors and descriptors [29–33]

[59,60], such as Harris keypoints detector , SIFT , SURF, PCA-SIFT and ORB

(oriented F AST and Ro tated BRIEF), where SIFT is the most popular local

feature representation. It can be used to perform reliable matching

between different views of an object or scene [29].Inordertoperform

as good as SIFT with lower computational complexity , the SURF [32] or

ORB [59] canbeconsideredasanefﬁcient alternati ve to SIFT. R ecently ,

bag-of-visual words (BOW) models or its variants have been reported in

the literatur es and used for object-based image retrieval, object recogni-

tion and scene categorization [34–41].In[34], Sivic and Zisserman have

proposed the bag-of-visual words (BOW) model which in essence

borrows techniq ues from text retriev al. In BOW model, local features

extr acted from an image by using SIFT , SURF or other keypoints

detectors, and then mapped into a set of visual words. Finally , an image

is represent ed as a histogram of visual word occurrences. It is so called

the standard BOW baseline, and can be considered as one of state-of-

the-art methods. Since the visual wor ds us ually come fr om clust ering

implementation which needs heavy computational burdens. Besides,

visual words have two major limitations that the lack of any e xplicit

semantic meanings and the ambiguity of visual words. Indeed, improv-

ing the visual vocabulary , incorporating spatial information and seman-

tic attributes can r educe the limitations and can also impr ove the

performances of BOW models [35–41].

There are extensive studies in feature extraction and image

representation within image retrieval and object recognition frame-

work. However, developing computational visual-attention model

within CBIR framework needs to be further studied.

3. Gray level co-occurrence matrix (GLCM)

Before discussing the proposed computational visual-attention

model in more details, a brief introduction of gray level co-occur-

rence matrix (GLCM) is given, since our saliency model involves

Haralick's gray level co-occurrence matrix [1 6].

Co-occurrence matrix is the most famous statistical approach in

textural image processing. In 1973, Haralick have put forward the

gray level co-occurrence matrix, and extracted a set of 14 features to

describe texture images features, such as energy, inverse difference

moment, contrast, entropy and so on [16]. It remains popular today

by virtue of good performance. The value of a gray image at any

coordinates(x, y) is denoted as f(x, y) ¼w, wA {0, 1, …, 255}. In order

to conveniently deﬁne the co-occurrence matrix, the pixel position

at the coordinates(x, y)isdenotedasP,whereP¼(x,

y). Let there are

two pixel positions P

¼(x

, y

)andP

¼(x

, y

), their pixel values

are f (P

)¼w and f (P

)¼ŵ.Iftheprobabilityoftwovaluesw and ŵ

co-occur with two pixel positions related by d, the cell entry (w, ŵ)

of co-occurrence matrix GLCMðw;

w; dÞ can be deﬁned as follows:

GLCMðw;

w; dÞ¼prff ðp

Þ¼w4 f ðp

Þ¼

wjjp

p

j¼dgð1Þ

where 4 denotes the logical AND operation. In GLCM algorithm,

energy, entropy, contrast and inverse difference moment often

utilized to describe image features [16], but the discrimination

power does not enough to achieve the satisfactory performance of

image retrieval especially on larger scale datasets [21]. If all cell

entries of co-occurrence matrix are used to describe image features,

the vector dimension would be very high and is not always increase

retrieval accuracy.

However, some features extracted from GLCM have deﬁnite

physical meaning in texture image analysis, where energy is a

measure of textural uniformity of an image. When the image under

consideration is homogenous, energy reaches its maximum [43].

The conspicuity areas can be considered as those areas which have

signiﬁcant visual differences and are not the homogenous areas.

Inspired by above views, the energy feature of GLCM is used as the

inhibition term in the stage of saliency map detection, instead of

using the local maxima normalization operator in Itti's model [5].

4. The saliency structures model and descriptor

Human's visual attention consists of pre-attenti ve and attentive

stage according to Treisman's feature integration theory [4].Inthepre-

attentive stage, only “pop-out” features are detected. Whereas in the

attentiv e stage, relationships between various features are found and

grouping [4,1 4].Inthispaper,saliencystructuremodelisproposedto

content-based image retrieval according to Tr eisman's feature integra-

tion theory [4] and Julesz’ texton theory [49,50]. In feature extr action

and image representation, Orientation-selecti ve mechanism which

derivedfromtheworksofHubelandWieselisusedtoourmodel

[1]. Color , intensity and orientation are considered as the primary

visual features which are commonly used in many saliency models

[4,5].Inordertodetect“pop-out” features, a novel visual cue, namely

color volume, with edge information together is introduced into our

saliency model and used to detect saliency regions.

It is crucially important to emphasize that saliency structure model

can be considered as an improved version of micro-structures model by

combining a bottom-up component of visual attention and orientation-

selective mechanism, where the saliency structures ar e deﬁned as the

bar -shaped structures according to orientation-selective mechanism by

using oriented Gabor ﬁlters, whereas micro-structur es are deﬁned as the

collection of certain underlying colors [3]. The basic principle of the

proposed descriptor is to generate three tuples histograms considering

the bar -shaped structures and oriented Gabor ﬁlters via a very special

type, whereas micro-structure descript or is adopted the probability

statistics method to describe features.

The ﬂow diagram of the proposed saliency model within CBIR

framework is illustrated in Fig. 2.

In the proposed saliency model within CBIR framework, we

mainly focus on: (1) the construction of saliency structure model

and (2) image representation. Where the construction of saliency

structure model mainly consists of three stages: (a) extraction of

the primary visual features, (b) the saliency map detection and

ﬁlters for saliency structure detection.

4.1. Extraction of the primary visual features

Human's visual system is more sensitive to color , orientation and

intensity information [5]. In many visual saliency models, color is

implemented as R-G (red-green) and B-Y (blue-yellow) channels

inspired by color -opponent neurons in V1 cortex [5] [1 3 ].Theaverage

of three color channels is usually used as intensity . Orientation is often

implemented as a convolution with oriented Gabor ﬁlt ers.

It is well known that HSV color space could mimic human's color

perception well. In order to extract the primary visual features for

image representation and simplify manipulation, the q uantization of

visual features needed to be implemented in HSV color space. For

example, the task of color quantization is to select and assign a limited

set of colors for representing a giv e color imag e with maximum

ﬁdelity [44]. The color quantization techniques are more fully des-

cribed in many books of digital images processing and will not be

described in detail here.

In order to obtain color map, H, S and V color channels are

uniform quantized into 6, 3 and 3 bins, respectively, so that in total

6  3  3¼54 color combinations are obtained, M

ðx; yÞ denotes

the color combinations or color map, as M

ðx; yÞ¼w; wA f0; 1;

…; N

1g, where N

¼ 54 in this paper.

Intensity information is given by V color channel. After uniform

quantization, we can obtain the intensity map M

ðx; yÞ,asM

ðx; yÞ

¼ s; s A f0; 1; …; N

1g, where N

¼ 16. Since the computational

G.-H. Liu et al. / Pattern Recognition 48 (2015) 2554–25662556

剩余12页未读，继续阅读

weixin_38504687

粉丝: 6
资源: 937

显着性结构模型：基于内容的图像检索新方法

基于内容的图像检索

python计算机视觉编程——基于BOF的图像检索（附代码） 计算机视觉.pdf

C++实现的基于内容的图像检索

基于深度学习的计算机视觉中图像检索算法

gbvs视觉显著性检测模型

基于神经网络的图像检索方法的研究现状??csdn

小样本图像检索transformer

广义高斯模型图像检索建模代码

那基于视觉注意的跨模态信息对齐通常关注如何将图像和文本之间进行精确的匹配和对齐技术有什么

itti 视觉注意 matlab

最新资源

python计算机视觉编程——基于BOF的图像检索（附代码）计算机视觉.pdf