Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
INRIA Rhˆone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
Abstract
We study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-
tion as a test case. After reviewing existing edge and gra-
dient based descriptors, we show experimentally that grids
of Histograms of Oriented Gradient (HOG) descriptors sig-
nificantly outperform existing feature sets for human detec-
tion. We study the influence of each stage of the computation
on performance, concluding that fine-scale gradients, fine
orientation binning, relatively coarse spatial binning, and
high-quality local contrast normalization in overlapping de-
scriptor blocks are all important for good results. The new
approach gives near-perfect separation on the original MIT
pedestrian database, so we introduce a more challenging
dataset containing over 1800 annotated human images with
a large range of pose variations and backgrounds.
1 Introduction
Detecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses that
they can adopt. The first need is a robust feature set that
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
scriptors provide excellent performance relative to other ex-
isting feature sets including wavelets [17,22]. The proposed
descriptors are reminiscent of edge orientation histograms
[4,5], SIFT descriptors [12] and shape contexts [1], but they
are computed on a dense grid of uniformly spaced cells and
they use overlapping local contrast normalizations for im-
proved performance. We make a detailed study of the effects
of various implementation choices on detector performance,
taking “pedestrian detection” (the detection of mostly visible
people in more or less upright poses) as a test case. For sim-
plicity and speed, we use linear SVM as a baseline classifier
throughout the study. The new detectors give essentially per-
fect results on the MIT pedestrian test set [18,17], so we have
created a more challenging set containing over 1800 pedes-
trian images with a large range of poses and backgrounds.
Ongoing work suggests that our feature set performs equally
well for other shape-based object classes.
We briefly discuss previous work on human detection in
§2, give an overview of our method §3, describe our data
sets in §4 and give a detailed description and experimental
evaluation of each stage of the process in §5–6. The main
conclusions are summarized in §7.
2 Previous Work
There is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-
tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
al [18] describe a pedestrian detector based on a polynomial
SVM using rectified Haar wavelets as input descriptors, with
a parts (subwindow) based variant in [17]. Depoortere et al
give an optimized version of this [2]. Gavrila & Philomen
[8] take a more direct approach, extracting edge images and
matching them to a set of learned exemplars using chamfer
distance. This has been used in a practical real-time pedes-
trian detection system [7]. Viola et al [22] build an efficient
moving person detector, using AdaBoost to train a chain of
progressively more complex region rejection rules based on
Haar-like wavelets and space-time differences. Ronfard et
al [19] build an articulated body detector by incorporating
SVM based limb classifiers over 1
st
and 2
nd
order Gaussian
filters in a dynamic programming framework similar to those
of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
[9]. Mikolajczyk et al [16] use combinations of orientation-
position histograms with binary-thresholded gradient magni-
tudes to build a parts based method containing detectors for
faces, heads, and front and side profiles of upper and lower
body parts. In contrast, our detector uses a simpler archi-
tecture with a single detection window, but appears to give
significantly higher performance on pedestrian images.
3 Overview of the Method
This section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation details
are postponed until §6. The method is based on evaluating
well-normalized local histograms of image gradient orienta-
tions in a dense grid. Similar features have seen increasing
use over the past decade [4,5,12,15]. The basic idea is that
local object appearance and shape can often be characterized
rather well by the distribution of local intensity gradients or
1