fitting error and assoc iated model parameters can be
learned from examples [9].
2.2.2 Discriminative Models
In contrast to the generativemodels,discriminative
models approximate the Bayesian maximum-a-posteriori
decision by learning the parameters of a discriminant
function (decision boundary) between the pedestrian and
nonpedestrian classes from training examples. We will
discuss the merits and drawbacks of several feature
representations and continue with a review of classifier
architectures and techniques to break down the complexity
of the pedestrian class.
Features. Local filters operating on pixel intensities are a
frequently used feature set [59]. Nonadaptive Haar wavelet
features have been popularized by Papageorgiou and
Poggio [53] and adapted by many others [48], [64], [74].
This overcomplete feature dictionary represents local in-
tensity differences at various locations, scales, and orienta-
tions. Their simplicity and fast evaluation using integral
images [41], [74] contributed to the popularity of Haar
wavelet features. However, the many-ti mes redundant
representation, due to overlapping spatial shifts, requires
mechanisms to select the most appropriate subset of features
out of the vast amount of possible features. Initially, this
selection was manually designed for the pedestrian class, by
incorporating prior knowledge about the geometric config-
uration of the human body [48], [53], [64]. Later, automatic
feature selection procedures, i.e., variants of AdaBoost [18],
were employed to select the most discriminative feature
subset [74].
The automatic extraction of a subset of nonadaptive
features can be regarded as optimizing the features for the
classification task. Likewise, the particular configuration of
spatial features has been included in the actual optimiza-
tion itself, yielding feature sets that adapt to the under-
lying data set during training. Such features are referred to
as local receptive fields [19], [23], [49], [68], [75], in
reference to neural structures in the human visual cortex
[24]. Recent studies have empirically demonstrated the
superiority of adaptive local receptive field features over
nonadaptive Haar wavelet features with regard to pedes-
trian classification [49], [68].
Another class of local intensity-based features is code-
book feature patches, extracted around interesting points in
the image [1], [39], [40], [61]. A codebook of distinctive
object feature patches along with geometrical relations is
learned from training data followed by clustering in the
space of feature patches to obtain a compact representation
of the underlying pedestrian class. Based on this represen-
tation, feature vectors have been extracted including
information about the presence and geometric relation of
codebook patches [1], [39], [40], [61].
Others have focused on discontinuities in the image
brightness function in terms of models of local edge
structure. Well-normalized image gradient orientation histo-
grams, computed over local image blocks, have become
popular in both dense [11], [62], [63], [80], [83] (HOG,
histograms of oriented gradients) and sparse representations
[42] (SIFT, scale-invariant feature transform), where sparse-
ness arises from preprocessing with an interest-point
detector. Initially, dense gradient orientation histograms
were computed using local image blocks at a single fixed
scale [11], [62] to limit the dimensionality of the feature vector
and computational costs. Extensions to variable-sized blocks
have been presented in [63], [80], [83]. Results indicate a
performance improvement over the original HOG approach.
Recently, local spatial variation and correlation of gradient-
based features have been encoded using covariance matrix
descriptors which increase robustness toward illumination
changes [71].
Yet others have designed local shape filters that
explicitly incorporate the spatial configuration of salient
edge-like structures. Multiscale features based on horizon-
tal and vertical co-occurrence groups of dominant gradient
orientation have been introduced by Mikolajczyk et al. [45].
Manually designed sets of edgelets, representing local line
or curve segments, have been proposed to capture edge
structure [76]. An extension to these predefined edgelet
features has recently been introduced w ith regard to
adapting the local edgelet features to the underlying image
data [60]. So-called shapelet features are assembled from
low-level oriented gradient responses using AdaBoost, to
yield more discriminative local features. Again, variants of
AdaBoost are frequently used to select the most discrimi-
native subset of features.
As an extension to spatial fe atures, sp atiotemporal
features have been proposed to capture human motion
[12], [15], [65], [74], especially gait [27], [38], [56], [75]. For
example, Haar wavelets and local shape filters have been
extended to the temporal domain by incorporating intensity
differences over time [65], [74]. Local receptive field features
have been generalized to spatiotemporal receptive fields
[27], [75]. HOGs have been extended to histograms of
differential optical flow [12]. Several papers compared the
performance of otherwise identical spatial and spatiotem-
poral features [12], [74] and reported superior performance
of the latter at the drawback of requiring temporally aligned
training samples.
Classifier architectures. Discriminati ve classification
techniques aim at determining an optimal decision bound-
ary between pattern classes in a feature space. Feed-forward
multilayer neural networks [33] implement linear discrimi-
nant functions in the feature space in which input patterns
have been mapped nonlinearly, e.g., by using the pre-
viously described feature sets. Optimality of the decision
boundary is assessed by minimizing an error criterion with
respect to the network parameters, i.e., mean squared error
[33]. In the context of pedestrian detection, multilayer
neural networks have been applied particularly in conjunc-
tion with adaptive local receptive field features as non-
linearities in the hidden network layer [19], [23], [49], [68],
[75]. T his architecture unifies feature e xtraction and
classification within a single model.
Support Vector Machines (SVMs) [73] have evolved as a
powerful tool to solve pattern classification problems. In
contrast to neural networks, SVMs do not minimize some
artificial error metric but maximize the margin of a linear
decision boundary (hyperplane) to achieve maximum
separation between the object classes. Regarding pedestrian
classification, linear SVM classifiers have been used in
2182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 12, DECEMBER 2009