“analytic” features. This removes the need to perform a costly
projection step when computing dense feature maps.
Significant variations in shape and appearance, such as
those caused by extreme viewpoint changes, are not well
captured by a 2D deformable model. Aspect graphs [31] are
a classical formalism for capturing significant changes that
are due to viewpoint variation. Mixture models provide a
simpler alternative approach. For example, it is common to
use multiple templates to encode frontal and side views of
faces and cars [36]. Mixture models have been used to
capture other aspects of appearance variation as well, such
as when there are multiple natural subclasses in an object
category [5].
Matching a deformable model to an image is a difficult
optimization problem. Local search methods require initi-
alization near the correct solution [2], [7], [43]. To guarantee
a globally optimal match, more aggressive search is needed.
One popular approach for part-based models is to restrict
part locations to a small set of possible locations returned by
an interest point detector [1], [18], [42]. Tree (and star)
structured pictorial structure models [9], [15], [19] allow for
the use of dynamic programming and generalized distance
transforms to efficiently search over all possible object
configurations in an image, without restricting the possible
locations for each part. We use these techniques for
matching our models to images.
Part-based deformable models are parameterized by the
appearance of each part and a geometric model capturing
spatial relationships among parts. For generative models,
one can learn model parameters using maximum likelihood
estimation. In a fully supervised setting, training images are
labeled with part locations and models can often be learned
using simple methods [9], [15]. In a weakly supervised
setting, training images may not specify locations of parts.
In this case, one can simultaneously estimate part locations
and learn model parameters with EM [2], [18], [42].
Discriminative training methods select model parameters
so as to minimize the mistakes of a detection algorithm on a
set of training images. Such approaches directly optimize
the decision boundary between positive and negative
examples. We believe that this is one reason for the success
of simple models trained with discriminative methods, such
as the Viola-Jones [41] and Dalal-Triggs [10] detectors. It has
been more difficult to train part-based models discrimina-
tively, though strategies exist [4], [23], [32], [34].
Latent SVMs are related to hidden CRFs [32]. However,
in a latent SVM, we maximize over latent part locations as
opposed to marginalizing over them, and we use a hinge
loss rather than log loss in training. This leads to an efficient
coordinate-descent style algorithm for training, as well as a
data-mining algorithm that allows for learning with very
large data sets. A latent SVM can be viewed as a type of
energy-based model [27].
A latent SVM is equivalent to the MI-SVM formulation of
multiple instance learning (MIL) in [3], but we find the
latent variable formulation more natural for the problems
we are interested in.
1
A different MIL framework was
previously used for training object detectors with weakly
labeled data in [40].
Our method for data-mining hard examples during
training is related to working set methods for SVMs (e.g.,
[25]). The approach described here requires relatively few
passes through the complete set of training examples and is
particularly well suited for training with very large data
sets, where only a fraction of the examples can fit in RAM.
The use of context for object detection and recognition
has received increasing attention in the recent years. Some
methods (e.g., [39]) use low-level holistic image features for
defining likely object hypothesis. The method in [22] uses a
coarse but semantically rich representation of a scene,
including its 3D geometry, estimated using a variety of
techniques. Here, we define the context of an image using
the results of running a variety of object detectors in the
image. The idea is related to [33] where a CRF was used to
capture co-occurrences of objects, although we use a very
different approach to capture this information.
A preliminary version of our system was described in
[17]. The system described here differs from the one in
[17] in several ways, including the introduction of mixture
models; here, we optimize the true latent SVM objective
function using stochastic gradient descent, while in [17]
we used an SVM package to optimize a heuristic
approximation of the objective; here, we use new features
that are both lower dimensional and more informative; we
now postprocess detections via bounding box prediction
and context rescoring.
3MODELS
All of our models involve linear filters that are applied to
dense feature maps. A feature map is an array whose entries
are d-dimensional feature vectors computed from a dense
grid of locations in an image. Intuitively, each feature vector
describes a local image patch. In practice, we use a variation
of the HOG features from [10], but the framework described
here is independent of the specific choice of features.
A filter is a rectangular template defined by an array of
d-dimensional weight vectors. The response, or score, of a
filter F at a position ðx; yÞ in a feature map G is the “dot
product” of the filter and a subwindow of the feature map
with top-left corner at ðx; yÞ:
X
x
0
;y
0
F ½x
0
;y
0
G½x þ x
0
;yþ y
0
:
We would like to define a score at different positions and
scales in an image. This is done using a feature pyramid
which specifies a feature map for a finite number of scales
in a fixed range. In practice, we compute feature pyramids
by computing a standard image pyramid via repeated
smoothing and subsampling, and then computing a feature
map from each level of the image pyramid. Fig. 3 illustrates
the construction.
The scale sampling in a feature pyramid is determined by a
parameter defining the number of levels in an octave. That
is, is the number of levels we need to go down in the
pyramid to get to a feature map computed at twice the
resolution of another one. In practice, we have used ¼ 5 in
training and ¼ 10 at test time. Fine sampling of scale space is
important for obtaining high performance with our models.
The system in [10] uses a single filter to define an object
model. That system detects objects by computing the score
of the filter at each position and scale of a HOG feature
pyramid and thresholding the scores.
1630 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
1. We defined a latent SVM in [17] before realizing the relationship to
MI-SVM.