Salient Region Detection Based on Binocular Vision
Zhong LIU Weihai CHEN Yuhua ZOU Xingming WU
School of Automation Science and Electrical Engineering
BeiHang University
Beijing 100191, China.
lzpro@126.com, whchenbuaa@126.com, chenyusiyuan@126.com
Abstract—Selective visual attention is a kind of mechanism of the
primate visual system for rapidly focusing on attractive objects
or regions in visual environment. Numerous visual attention
models have been developed and optimized over the past decades.
Most of the existing models concentrate on static monocular
image, but little attention has been devoted to stereo depth
information which is an important aspect of human perception.
A region-based binocular saliency detection approach
considering depth information is proposed in this paper. The
difference of left and right image is used for computing disparity
map and coarse saliency map. Hue, saturation, and intensity (HSI)
color space is adopted and mean-shift algorithm is used for image
segmentation. This study shows that the proposed region-based
saliency computational method can effectively detect salient
region, and it is more suitable for real time applications such as
obstacle detection and visual navigation for its simplicity.
Keywords- saliency;visual attention;binocular;segmentation;
I. I
NTRODUCTION
Selective visual attention is one of the most important and
effective mechanisms of primate visual system. It can be
considered as a biological process of selecting the most
valuable portion to operate from a large amount of visual
information. This remarkable function makes primate direct
their gaze to interesting things rapidly, such as fire, light, food
and some attractive regions. Since saliency is a crucial factor in
human visual tasks, it has long been a research topic of great
interest studied by researchers in physiology, psychology, and
neural systems. Although a large amount of effort has been
made, the underlying neural mechanisms of visual saliency
remain inexplicit. Some evidence illuminates the approximate
visual attention process in some sense. Visual information
proceeds along two parallel pathways including a dorsal stream
and a ventral stream. The former one is related to focusing
attention on regions or objects in a scene. The latter one is
responsible for identification and recognition tasks. Biological
visual selection is usually divided into two complementary
mechanisms. One is fast, pre-attentive, bottom-up visual
attention. The other is slower, top-down visual attention which
is task-dependent. In this paper, the rapid, saliency-driven,
bottom-up attention is considered.
Over the past decades, numerous visual attention
computational models have been proposed and many different
algorithms have been developed. These algorithms can be
broadly classified as biologically based and purely
computational, or a combination. Most of existing bottom-up
attention models construct saliency map to reflect the salience
of each key region in a scene. The model of Itti et al. [1] is
derived from a biologically plausible architecture which is
based on a neurobiology framework introduced by Koch and
Ullman [2]. Itti’s model computes saliency maps for features of
luminance, color, and orientation at different scales using the
feature integration theory. The various scales are then used to
perform center-surround operations [3] using a Difference of
Gaussians (DoG) approach. Then, the center-surround maps are
blended to produce two conspicuity maps, one aggregating
color and another aggregating intensity information. Finally,
these two maps are blended in a saliency map. For its definite
biological characteristic, Itti’s model has been widely
implemented in some fields, such as image compression, object
detection, and image segmentation, etc.
Achanta et al. [4] proposed a purely computational model
which computes local multiscale color and luminance feature
contrast to generate saliency map. Ma and Zhang [5] propose
an alternative local contrast-based model obtaining saliency
map by summing up the differences of image pixels with their
respective surrounding pixels in a small neighborhood, which
is not based on any biological model. Bruce and Tsotsos [6]ˈ
[7] uses Shannon’s self-information measure to compute visual
saliency which is based on information maximization theory
that represents a biologically plausible model of saliency
detection. Harel’s model[8] is graph-based, computing saliency
from distance-weighted multi-scale feature dissimilarity maps.
Guo and Zhang [9] introduce a model using the phase spectrum
of quaternion Fourier transform (PQFT). Each pixel of the
image is represented by a quaternion that consists of color,
intensity and motion features. Bian and Zhang[15] use spectral
whitening(SW) as a normalization procedure which represents
salient features and localized motion. This approach effectively
suppresses redundant background information and ego-motion
which reflects a principle of the human visual system.
Most of current visual attention models process monocular
image. The procedures proposed by these models are mostly
computationally expensive as the correlated processes carried
out in the brain are significantly complex. The majority of
previous research on visual saliency detection focused on
computing saliency of each pixel. It makes the result have low
resolution, poorly defined borders and expensive to compute.
For simplifying the computational process and optimizing the
salient region boundaries, a region-based approach is proposed
This work is supported by the National Nature Science Foundation of China
under Grant No.61075075, 61175108, and National High Technology Research
and Development Program of China under Grant No.2011AA040902
1862
978-1-4577-2119-9/12/$26.00
c
2011 IEEE