1254 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 11, NOVEMBER 1998
Short Papers
A Model of Saliency-Based Visual Attention
for Rapid Scene Analysis
Laurent Itti, Christof Koch, and Ernst Niebur
Abstract—A visual attention system, inspired by the behavior and the
neuronal architecture of the early primate visual system, is presented.
Multiscale image features are combined into a single topographical
saliency map. A dynamical neural network then selects attended
locations in order of decreasing saliency. The system breaks down the
complex problem of scene understanding by rapidly selecting, in a
computationally efficient manner, conspicuous locations to be analyzed
in detail.
Index Terms—Visual attention, scene analysis, feature extraction,
target detection, visual search.
————————
F
————————
1INTRODUCTION
P
RIMATES have a remarkable ability to interpret complex scenes in
real time, despite the limited speed of the neuronal hardware avail-
able for such tasks. Intermediate and higher visual processes appear
to select a subset of the available sensory information before further
processing [1], most likely to reduce the complexity of scene analysis
[2]. This selection appears to be implemented in the form of a spa-
tially circumscribed region of the visual field, the so-called “focus of
attention,” which scans the scene both in a rapid, bottom-up, sali-
ency-driven, and task-independent manner as well as in a slower,
top-down, volition-controlled, and task-dependent manner [2].
Models of attention include “dynamic routing” models, in
which information from only a small region of the visual field can
progress through the cortical visual hierarchy. The attended region
is selected through dynamic modifications of cortical connectivity
or through the establishment of specific temporal patterns of ac-
tivity, under both top-down (task-dependent) and bottom-up
(scene-dependent) control [3], [2], [1].
The model used here (Fig. 1) builds on a second biologically-
plausible architecture, proposed by Koch and Ullman [4] and at
the basis of several models [5], [6]. It is related to the so-called
“feature integration theory,” explaining human visual search
strategies [7]. Visual input is first decomposed into a set of topo-
graphic feature maps. Different spatial locations then compete for
saliency within each map, such that only locations which locally
stand out from their surround can persist. All feature maps feed, in
a purely bottom-up manner, into a master “saliency map,” which
topographically codes for local conspicuity over the entire visual
scene. In primates, such a map is believed to be located in the
posterior parietal cortex [8] as well as in the various visual maps in
the pulvinar nuclei of the thalamus [9]. The model’s saliency map
is endowed with internal dynamics which generate attentional
shifts. This model consequently represents a complete account of
bottom-up saliency and does not require any top-down guidance
to shift attention. This framework provides a massively parallel
method for the fast selection of a small number of interesting im-
age locations to be analyzed by more complex and time-
consuming object-recognition processes. Extending this approach
in “guided-search,” feedback from higher cortical areas (e.g.,
knowledge about targets to be found) was used to weight the im-
portance of different features [10], such that only those with high
weights could reach higher processing levels.
2MODEL
Input is provided in the form of static color images, usually digit-
ized at 640 ¥ 480 resolution. Nine spatial scales are created using
dyadic Gaussian pyramids [11], which progressively low-pass
filter and subsample the input image, yielding horizontal and ver-
tical image-reduction factors ranging from 1:1 (scale zero) to 1:256
(scale eight) in eight octaves.
Each feature is computed by a set of linear “center-surround”
operations akin to visual receptive fields (Fig. 1): Typical visual
neurons are most sensitive in a small region of the visual space
(the center), while stimuli presented in a broader, weaker antago-
nistic region concentric with the center (the surround) inhibit the
neuronal response. Such an architecture, sensitive to local spatial
discontinuities, is particularly well-suited to detecting locations
which stand out from their surround and is a general computa-
tional principle in the retina, lateral geniculate nucleus, and pri-
mary visual cortex [12]. Center-surround is implemented in the
model as the difference between fine and coarse scales: The center
is a pixel at scale c Œ {2, 3, 4}, and the surround is the corresponding
pixel at scale s = c + d, with d Œ {3, 4}. The across-scale difference
between two maps, denoted “*” below, is obtained by interpolation
to the finer scale and point-by-point subtraction. Using several scales
not only for c but also for d = s - c yields truly multiscale feature
extraction, by including different size ratios between the center and
surround regions (contrary to previously used fixed ratios [5]).
2.1 Extraction of Early Visual Features
With r, g, and b being the red, green, and blue channels of the in-
put image, an intensity image I is obtained as I = (r + g + b)/3. I is
0162-8828/98/$10.00 © 1998 IEEE
²²²²²²²²²²²²²²²²
• L. Itti and C. Koch are with the Computation and Neural Systems Pro-
gram, California Institute of Technology—139-74, Pasadena, CA 91125.
E-mail: {itti, koch}@klab.caltech.edu.
•
E. Niebur is with the Johns Hopkins University, Krieger Mind/Brain Insti-
tute, Baltimore, MD 21218. E-mail: niebur@jhu.edu.
Manuscript received 5 Feb. 1997; revised 10 Aug. 1998. Recommended for accep-
tance by D. Geiger.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 107349.
Fig. 1. General architecture of the model.