A Combined Model for Scan Path in Pedestrian
Searching
Lijuan Duan, Zeming Zhao, Wei Ma*, Jili Gu,
Zhen Yang
College of Computer Science and Technology
Beijing University of Technology, China
{ljduan, mawei, yangzhen}@bjut.edu.cn
{zhaozeming, gujili}@emails.bjut.edu.cn
Yuanhua Qiao
College of Applied Science
Beijing University of Technology, China
qiaoyuanhua@bjut.edu.cn
Abstract—Target searching, i.e. fast locating target
objects in images or videos, has attracted much attention in
computer vision. A comprehensive understanding of factors
influencing human visual searching is essential to design
target searching algorithms for computer vision systems. In
this paper, we propose a combined model to generate scan
paths for computer vision to follow to search targets in
images. The model explores and integrates three factors
influencing human vision searching, top-down target
information, spatial context and bottom-up visual saliency,
respectively. The effectiveness of the combined model is
evaluated by comparing the generated scan paths with
human vision fixation sequences to locate targets in the same
images. The evaluation strategy is also used to learn the
optimal weighting coefficients of the factors through linear
search. In the meanwhile, the performances of every single
one of the factors and their arbitrary combinations are
examined. Through plenty of experiments, we prove that the
top-down target information is the most important factor
influencing the accuracy of target searching. The effects
from the bottom-up visual saliency are limited. Any
combinations of the three factors have better performances
than each single component factor. The scan paths obtained
by the proposed model are optimal, since they are most
similar to the human vision fixation sequences.
Keywords—visual attention; bottom-up visual saliency;
top-down target information; spatial context
I. INTRODUCTION
Human visual attention, one of the most important
mechanisms in biological vision systems [1], [3], [4],
guides us to fast locate a specific kind of targets in images.
A comprehensive understanding of factors influencing
human visual searching is essential to design computer
vision systems. In this paper, we explore three factors,
bottom-up visual saliency, top-down target information
and spatial context, which influence human vision systems
to search targets (pedestrians) in images. The factors have
been experimentally evaluated, separately or integratedly
in literatures [5], [6], [7]. The paper presents a combined
model which integrates the three factors with optimal
weights, to guide target searching for computer vision
systems. The weights are learned by linear search [2]. The
performance of the combined model on the generation of
scan paths is evaluated by comparing with human vision
scan paths.
Psychological studies show that at each moment,
humans are attracted to salient parts in images [6], [8], [9].
The bottom-up saliency clue is considered to have
influences on computer visual searching, which has been
experimentally proved by Itti et al. [10]. On the other
hand, during visual searching, humans not only fixate on a
target, but also scan regions or objects with similar shapes
to the target [11], [12]. For example, during searching for
a pedestrian, objects of a rectangular shape, or with a
circle on the top would attract attention. The spatial
context information provides rich cues to target positions
for human vision [13], [14], [15]. It is widely used in
object detection [14] and recognition [16].
Based on the above facts, the paper experimentally
explores each factor and presents a method to combine
them for efficient target searching in images. The
proposed method is given in section II. Section III
This research is partially sponsored
y Natural Science Foundatio
of China (Nos.61003105, 61175115 and 61370113), the Importation an
Development of High-Caliber Talents Project of Beijing Municipa
Institutions (CIT&TCD201304035), Jing-Hua Talents Project of Beijin
University of Technology (2014-JH-L06), and Ri-Xin Talents Project o
Beijing University of Technology (2014-RX-L06), and the Internationa
Communication Ability Development Plan for Young Teachers o
Beijing University of Technology (No.2014-16).
Fig. 1. The workflow of scan path generation. The saliency map
and target map are computed based on the input image. The
searching guide map is obtained by combining the spatial context
map. At each round of fixation choosing, the strategies of WTA
2014 International Joint Conference on Neural Networks (IJCNN)
July 6-11, 2014, Beijing, China
978-1-4799-1484-5/14/$31.00 ©2014 IEEE