Learning to predict where human gaze is using quaternion DCT based
regional saliency detection
Ting Li *
a
, Yi Xu
a
, Chongyang Zhang
a
a
1 Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University,
Shanghai, China;2 Shanghai Key Laboratory of Digital Media Processing and Transmissions,
Shanghai, China
ABSTRACT
Many current visual attention approaches used semantic features to accurately capture human gaze. However, these
approaches demand high computational cost and can hardly be applied to daily use. Recently, some quaternion-based
saliency detection models, such as PQFT (phase spectrum of Quaternion Fourier Transform), QDCT (Quaternion
Discrete Cosine Transform), have been proposed to meet real-time requirement of human gaze tracking tasks. However,
current saliency detection methods used global PQFT and QDCT to locate jump edges of the input, which can hardly
detect the object boundaries accurately. To address the problem, we improved QDCT-based saliency detection model by
introducing superpixel-wised regional saliency detection mechanism. The local smoothness of saliency value distribution
is emphasized to distinguish noises of background from salient regions. Our algorithm called saliency confidence can
distinguish the patches belonging to the salient object and those of the background. It decides whether the image patches
belong to the same region. When an image patch belongs to a region consisting of other salient patches, this patch should
be salient as well. Therefore, we use saliency confidence map to get background weight and foreground weight to do the
optimization on saliency map obtained by QDCT. The optimization is accomplished by least square method. The
optimization approach we proposed unifies local and global saliency by combination of QDCT and measuring the
similarity between each image superpixel. We evaluate our model on four commonly-used datasets (Toronto, MIT, OSIE
and ASD) using standard precision-recall curves (PR curves), the mean absolute error (MAE) and area under curve
(AUC) measures. In comparison with most state-of-art models, our approach can achieve higher consistency with human
perception without training. It can get accurate human gaze even in cluttered background. Furthermore, it achieves better
compromise between speed and accuracy.
Keywords:
saliency detection, superpixels, quaternion transform, optimization model
1. INTRODUCTION
In last two decades, saliency detection has been extensively studied in the fields of artificial intelligence, computer
vision and video analysis, providing visual attention cues for the solution of the ill-posed problems. There are many
aspects that will attract human attention such as bottom-up, top-down, and knowledge-driven visual cues. In the early
efforts, many models were proposed motivated by neural selective visual model (Koch & Ullman, 1985). Itti et al.
9
proposed a biologically model using multiple low level features such as color, intensity and orientation features at
multiple scales and using center-surround mechanism. After a saliency map is computed for each feature channel, they
are normalized and combined into a master saliency map using winner-take-all strategy. However, it had high
computational complexity and over-parameterization problems. Moreover, it in fact does not match human saccades
according to eye-tracking data. In order to achieve higher consistency with human visual system, other saliency detection
models
6,10,11
based on machine learning were proposed, for example, the well-known Judd's model
6
. It used a set of low-
level, mid-level and high-level image features and needed training by a linear support vector machine. This kind of
approaches demands high computational cost due to the training step. However, as we know, human vision can
effortlessly judge the importance of image regions and locate a salient object without any training even in a totally
strange environment or cluttered scene.
*E-mail: tina_ww@sjtu.edu.cn;
Proc. of SPIE Vol. 9217 92171K-1