Learning to Detect A Salient Object
Tie Liu
1
Jian Sun
2
Nan-Ning Zheng
1
Xiaoou Tang
2
Heung-Yeung Shum
2
1
Xi’an Jiaotong University
2
Microsoft Research Asia
Xi’an, P.R. China Beijing, P.R. China
Abstract We study visual attention by detecting a salient
object in an input image. We formula te salient object detec-
tion as an image segmentation problem, where we separate
the salient object from the image bac kgrou nd. We propose
a set o f novel features including multi-scale contra st, center-
surround histogram, and color spatial distribution to describe
a salient object locally, regionally, and globally. A Condi-
tional Random Field is learned to effectively co mbine these
features for salient object detection. We also constructed a
large image database containing tens of thousands of care-
fully labeled imag e s b y multiple users. To our kn owledge, it
is the first large image database for quantitative evaluation of
visual attention algorithms. We validate o ur approach o n this
image database, which is public available with this paper.
1. Introduction
“Everyone knows what attention is...”
—William James, 1890
The human brain and visual system pay more attention
to some parts of an image. Visual attention has been studied
by re searchers in physiology, psychology, neural systems,
and computer vision for a long time. There are many
applications for visual attention, for example, automatic
image c ropping [23], adaptive image display on small de-
vices [4], image/video compression, a dvertising design [7],
and image collection browsing. Recent studies [18, 2 2, 26]
demonstra te d that visual attention helps object recognition,
tracking, and detec tion as well.
Most existing visual attention approa c hes are based on
the bottom-up computationa l framework [3, 6, 8, 9, 10, 11,
19, 25] because visual attention is in ge neral unconsciou sly
driven by low-level stimu lus in the scene such as intensity,
contrast, and motion. These approache s consist of the fol-
lowing three steps. The first step is feature extraction, in
which multiple low-level visual features, such as intensity,
color, orientation, texture and motion are extracted from the
image at multiple scales. The second step is saliency com-
putation. The saliency is computed by a center-surround
operation [10], self-informa tion [3], or graph-based random
(a) (b) (c)
Figure 1. Salient map. From top to bottom: input image, salient map
computed by Itti’s algorithm (http:// ww w.saliencytoolbox.net), and
salient map computed by our approach.
walk [6] using multiple features. After normalization and
linear/non-linear combination, a master map [24] or a salient
map [11] is computed to represent the saliency of each im-
age pixel. Last, a few key locations on the saliency map are
identified by winner-take-all, or inhibition-o f-return, or other
non-lin ear operations. While these approaches have worked
well in finding a few fixation locations in both synth etic and
natural images, they have not been able to accurately detect
where visual attention should be.
For instance, the middle row in Figure 1 shows three
salient maps computed using Itti’s a lgorithm [10]. Notice
that the saliency concentrates on several small local regions
with high contrast structures, e.g. , the background grid in (a),
the shadow in (b), and the foreground boundary in (c). Al-
though the leaf in (a) commands mu ch attention, the saliency
for the leaf is low. Therefore, these salient maps computed
from low-level features are not a good indicatio n for where a
user’s attention is while perusing the se images.
In this paper, we incor porate the high level concept of
salient obje c t into the p rocess of visual attention com putation.
In Figure 1, the leaf, car, and woman attract the most visual
attention in each respective image. We call them salient ob-
jects, or foreground objects that we are familiar with. As can