decreases, which has the same effect as increasing the
amount of occlusion.
The method proposed in [18] tries to overcome these
limitations by considering the image gradients in contrast to
the image contours. It relies on the dot product as a
similarity measure between the template gradients and
those in the image. Unfortunately, this measure rapidly
declines with the distance to the object location or when the
object appearance is even slightly distorted. As a result, the
similarity measure must be evaluated densely and with
many templates to handle appearance variations, making
the method computationally costly. Using image pyramids
provides some speed improvements; however, fine but
important structures tend to be lost if one does not carefully
sample the scale space.
Contrary to the above-mentioned methods, there are also
approaches addressing the general visual recognition pro-
blem: They are based on statistical learning and aim at
detecting object categories rather than a priori, known object
instances. While they are better at category generalization,
they are usually much slower during learning and runtime,
which makes them unsuitable for online applications.
For example, Amit et al. [19] proposed a coarse to fine
approach that spreads gradient orientations in local
neighborhoods. The amount of spreading is learned for
each object part in an initial stage. While this approach—
used for license plate reading—achieves high recognition
rates, it is not real-time capable.
Histogram of Gradients (HoG) [1] is another related and
very popular method. It statistically describes the distribu-
tion of intensity gradients in localized portions of the image.
The approach is computed on a dense grid with uniform
intervals and usesoverlapping local histogram normalization
for better performance. It has proven to give reliable results
but tends to be slow due to the computational complexity.
Ferrari et al. [4] provided a learning-based method that
recognizes objects via a Hough-style voting scheme with a
nonrigid shape matcher on object boundaries of a binary
edge image. The approach applies statistical methods to
learn the model from few images that are only constrained
within a bounding box around the object. While giving very
good classification results, the approach is neither appro-
priate for object tracking in real time due to its expensive
computation nor is it precise enough to return the accurate
pose of the object. Additionally, it is sensitive to the results of
the binary edge detector, an issue that we discussed before.
Kalal et al. [20] very recently developed an online
learning-based approach. They showed how a classifier
can be trained online in real time, with a training set
generated automatically. However, as we will see in the
experiments, this approach is only suitable for smooth
background transitions and not appropriate to detect
known objects over unknown backgrounds.
Opposite to the above-mentioned learning-based meth-
ods, there are also approaches that are specifically trained
on different viewpoints. As with our template-based
approach, they can detect objects under different poses,
but typically require a large amount of training data and a
long offline training phase. For example, in [5], [21], [22],
one or several classifiers are trained to detect faces or cars
under various views.
More recent approaches for 3D object detection are
related to object class recognition. Stark et al. [23] rely on 3D
CAD models and generate a training set by rendering them
from different viewpoints. Liebelt and Schmid [24] combine
a geometric shape and pose prior with natural images. Su
et al. [25] use a dense, multiview representation of the
viewing sphere combined with a part-based probabilistic
representation. While these approaches are able to general-
ize to the object class, they are not real-time capable and
require expensive training.
From the related works which also take into account
depth data there are mainly appro aches related to
pedestrian detection [26], [27], [28], [29]. They use three
kinds of cues: image intensity, depth, and motion (optical
flow). The most recent approach of Enzweiler et al. [26]
builds part-based models of pedestrians in order to handle
occlusions caused by other objects and not only self-
occlusions modeled in other approaches [27], [29]. Besides
pedestrian detection, there has been an approach to object
classification, pose estimation, and reconstruction intro-
duced by Sun et al. [30]. The training data set is composed
of depth and image intensities ,while the object classes
are detected using the modified Hough transform. While
quite effective in real applications, these approaches still
require exhaustive training using large training data sets.
This is usually prohibited in robotic applications, where
the robot has to explore an unknown environment and
learn new objects online.
As mentioned in the introduction, we recently proposed
a method to detect textureless 3D object instances from
different viewpoints based on templates [7]. Each object is
represented as a set of templates, relying on local dominant
gradient orientations to build a representation of the input
images and the templates. Ext racting the dominant
orientations is useful to tolerate small translations and
deformations. It is fast to perform and, most of the time,
discriminant enough to avoid generating too many false
positive detections.
However, we noticed that this approach degrades
significantly when the gradient orientations are disturbed
by stronger gradients of different orientations coming from
background clutter in the input images. In practice, this
often happens in the neighborhood of the silhouette of an
object, which is unfortunate as the silhouette is a very
important cue especi ally for textureless o bjects. The
method we propose in this paper does not suffer from
this problem while running at the same speed. Addition-
ally, we show how to extend our approach to handle 3D
surface normals at the same time if a dense depth sensor
like the Kinect is available. As we will see, this increases
the robustness significantly.
3PROPOSED APPROACH
In this section, we describe our template representation and
show how a new representation of the input image can be
built and used to parse the image to quickly find objects. We
will start by deriving our similarity measure, emphasizing
the contribution of each aspect of it. We also show how we
implement our approach to efficiently use modern proces-
sor architectures. Additionally, we demonstrate how to
878 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 5, MAY 2012