1552 IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 10, OCTOBER 2015
Visual Saliency Detection With Free Energy Theory
Ke Gu, Student Member, IEEE, Guangtao Zhai, Member, IEEE,WeisiLin, Senior Member, IEEE,
Xiaokang Yang, Senior Member, IEEE, and Wenjun Zhang, Fellow, IEEE
Abstract—Visual saliency can be thought of as the product of
human brain activity. Most existing models were built upon local
features or global features or both. Lately, a so-called free energy
principle unifies several brain theories within one framewor
k,
and tells where easily surprise human viewers in a visual stimulus
through a psychological measure. We believe that this “surprise”
should be highly related to visual saliency, and the
reby introduce a
novel computational Free Energy inspired Saliency detection tech-
nique (FES). Our method computes the local entropy of the gap
between an input image signal and its predicted co
unterpart that
is reconstructed from the input one with a semi-parametric model.
Experimental results prove that our algorithm predicts human fix-
ation points accurately and is superior to clas
sical/state-of-the-art
competitors.
Index Terms—Bi-lateral filtering, free energy, linear autoregres-
sive (AR) model, saliency detection, semi-parametric model.
I. INTRODUCTION
S
ALIENCY detection is an active and important research
topic in both image processing a
nd computer vision com-
munities. In many applications of graphics, design and human
computer interaction, we strongly concern about where human
beings look in a scene–where s
aliency spots are located. Visual
saliency can promote the study of quality assessment [1][2], ob-
ject recognition [3][4], and computer graphics [5]. So an effi-
cient and effective comp
utational model is eagerly required to
detect salient areas in the encountered scene.
More than hundreds of saliency detection models have been
proposed during the pa
st 25 years [6], and this number is ex-
pected to be increasing quickly. Existing methods are divided
into two types according to distinct attentional mechanisms:
1) top-down task-de
pendent methods; 2) bottom-up stimulus-
driven methods. Because top-down approaches require prior
knowledge about the visual content, bottom-up approaches that
Manuscript received February 03, 2015; revised March 06, 2015; accepted
March 13, 2015. Date of publication March 18, 2015; date of current version
March 24, 2015. This work was supported in part by the National Science Foun-
dation of China under Grants 61025005, 61371146, 61221001, and 61390514,
the Foundation for the Author of National Excellent Doctoral Dissertationof
China under Grant 201339, and by the Shanghai Municipal Commission of
Economy and Informatization under Grant 140310. The associate editor coordi-
nating the review of this manuscript and approving it for publication was Prof.
Zhengguo Li.
K. Gu, G. Zhai, X. Yang, and W. Zhang are with Institute of Image Com-
munication and Information Processing, Shanghai Key Laboratory of Digital
Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai
200240, China (e-mail: guke.doctor@gmail.com; zhaiguangtao@sjtu.edu.cn;
xkyang@sjtu.edu.cn; zhangwenjun@sjtu.edu.cn).
W. Lin is with the School of Computer Engineering, Nanyang Technological
University, Singapore 639798 (e-mail: wslin@ntu.edu.sg).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LSP.2015.2413944
only use information from the visual signal itself have been
broadly and deeply researched.
We in this letter concentrate on bottom-up methods. Many
techniques in this class were modeled to seek for locations wi
th
maximum local saliency and employ biologically motivated
local features [7]–[10]. These features, which mainly consist of
intensity, edge, texture, color and orientation, a
re inspired by
neural responses in lateral geniculate nucleus and V1 cortex.
The benchmark Itti model [7] provides a general architecture
for detecting visual saliency. This model works
by first subsam-
pling an input image into a Gaussian pyramid, decomposing
each pyramid level into various channels for color, intensity
and orientation, and then summing and nor
malizing maps in
each channel across scales to yield the final saliency map.
Some other relevant algorithms depend on global features
[11]–[15]. The techniques mainly atte
mpt to find regions from
a visual signal that implies unique frequencies in transform do-
mains. This renders these algorithms quickly and precisely de-
tect visual “pop-outs” due to global
considerations, thus to lo-
cate possible salient objects. The classical spectral residual (SR)
model [11] was established upon the finding that more high-
frequency information than lo
w-frequency one is stored in the
residual, and the remaining Fourier amplitude spectrum is used
to constitute a saliency map.
Recently, the adoption of on
ly local or global features was
found to be somewhat limited. Thus, an increasing number of
nowadays studies have been devoted to incorporating both two
types of features for sa
liency detection [16]–[20]. Most of them
were developed based on complementary strategies, thereby
gaining substantially high performance. In [18], the authors
took into account loc
al and global image patch rarities (LG) as
two complementary processes to design the saliency detection
model. In [19], content-aware saliency detection (CAS) model
combines four bas
ic principles of human visual attention, i.e.
local low-level considerations, global considerations, visual
organization rules, and high-level factors.
It is human viewe
rs deciding visual saliency, and thus the
most valid technique should highly approximate the response
of the human brain to visual stimuli. Friston has lately unified
some brain the
ories within the free-energy framework, which
indicates that the brain inference process always attempts to
infer the meaningful part from a visual stimulus by removing
the uncerta
inty [21]. It is natural that there exists a gap between
the real scene and the brain’s prediction due to the fact that the
internal generative model cannot be universal. It is the gap that
makes hum
an viewers “surprise”, andthusattractsmuchmore
human attention. Therefore, we hypothesize that this gap (i.e.
“surprise”) highly correlates with the visual saliency. Based on
this po
stulation, this letter designs a new computational Free En-
ergy inspired Saliency detection model (FES). Our work com-
putes the local entropy of the gap between an image and its pre-
dicte
d version reconstructed from the input one by a semi-para-
metric model, which fuses the parametric autoregressive (AR)
1070-9908 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.