Human detection based on pyramidal statistics of oriented filtering
and online learned scene geometrical model
Min Li
n
, Qi Hu, Yu Wang, Weishan Dong
IBM China Research Laboratory, Diamond A, Zhongguancun Software Park, Haidian District, Beijing 100193, China
article info
Article history:
Received 24 November 2011
Received in revised form
15 April 2012
Accepted 30 August 2012
Communicated by Liang Wang
Available online 26 September 2012
Keywords:
Object detection
Human detection
Pyramidal statistics of oriented filtering
Online-learned geometrical model
abstract
We study the problem of robust human detection. In this paper, a new descriptor, Pyramidal Statistics
of Oriented Filtering (PSOF), is proposed for human shape representation. Unlike traditional one-scale
gradient-based methods, the PSOF descriptor utilizes a Gabor filter bank to obtain multi-scale pixel-
level orientation information and makes use of locally normalized pyramidal statistics of these Gabor
responses to represent object shape, which shows great robustness to image noise and blur. Besides, to
exclude detection outliers that violate perspective projection in image sequence, a geometrical model is
learned online to describe the relationship between object’s average height and the foot-point
coordinate. Experimental results on both static images and video sequences show that PSOF detector
performs much better than one of the state-of-the-art detectors.
& 2012 Elsevier B.V. All rights reserved.
1. Introduction
Human detection has drawn much attention in computer vision
community during the last decade [19,7,15,14,28,3,8,29,22], because
human is one of the most important objects in many applications,
such as visual surveillance, intelligent transportation system, HCI
(Human Computer Interaction) and robotics. However, human detec-
tion is still facing many challenges, including wide range of human
poses, image blur and low contrast imaging condition.
Effective feature representation plays a key role in human
detection. Early efforts on human detection focus on Haar wavelet
features [19,15,27,28]. Haar feature computes the gray difference
of adjacent regions at different scales, and can effectively describe
structures like human eyes, nose, and lip. Viola and Jones [27]
improve the computation efficiency of Haar features by a novel
technique called integral image, which can calculate a single Haar
feature at any scale with a constant computational cost. Efficient
feature computation and a cascade classifier structure make
Viola’s detector achieve great success in face detection [27].
However, the classification performance of Haar features is poor
in real surveillance scenes because of large human-pose changes
and illumination variations. Since human shape is illumination-
invariant and distinctive comparing to background structures,
recent work mainly focuses on shape based human detection.
Dalal et al. [3] proposed a novel feature set called HOG (histo-
grams of oriented gradients), which uses locally normalized
histograms of gradients to represent shape information. Experi-
mental results show that HOG descriptor provides much better
classification performance than Haar features in human detection
in complex scenes [3]. Zhu et al. [30] extend the HOG descriptor
and utilize a cascade classifier structure to increase detection
speed. Li et al. [9,11] further study the performance of HOG
descriptor in head–shoulder based human detection in crowded
scenes and find that HOG works much more effectively than the
SIFT (scale-invariant feature transformation) [13] descriptor and
Haar features. In [29], a set of edgelet (a short segment of line or
curve) features is proposed to represent human shape and
exhibits good detection performance in crowded scenes. Sabz-
meydani and Mori [22] propose a set of shapelet features (mid-
level features) generated from low-level gradient information
using AdaBoost for human detection. In [25], a pedestrian detec-
tion method based on the covariance matrix descriptor [24] is
proposed and shows better performance on the INRIA dataset [3]
than the HOG descriptor, but an experimental study conducted by
Paisitkriangkrai et al. [18] shows that the covariance matrix
descriptor is slightly inferior to the HOG descriptor on the
DaimlerChrysler pedestrian dataset created in [16].
This paper aims to propose a human detection method that
not only has excellent detection performance in good imaging
condition, shown in Fig. 1(a), but can also work well under bad
imaging conditions, such as blur and low contrast with much
noise, shown in Fig. 1(b) and (c). Since features based on gradients
or edges are often sensitive to image noise or blur, we propose a
Contents lists available at SciVerse ScienceDirect
journal home page: www.elsevier.com/locate/neucom
Neurocomputing
0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2012.08.025
n
Corresponding author.
E-mail addresses: minliml@cn.ibm.com, ziwenwilliamson@gmail.com (M. Li),
huqihq@cn.ibm.com (Q. Hu), yuwangbj@cn.ibm.com (Y. Wang),
dongweis@cn.ibm.com (W. Dong).
Neurocomputing 101 (2013) 338–346