
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
lowe@cs.ubc.ca
Abstract
Proc. of the International Conference on
Computer Vision,
Corfu (Sept. 1999)
An object recognitionsystem has been developed that uses a
newclass of local image features. The features are invariant
to image scaling, translation,and rotation, and partially in-
variant to illuminationchanges and affine or 3D projection.
These features share similar properties with neurons in in-
ferior temporal cortex that are used for object recognition
in primate vision. Features are efficiently detected through
a staged filtering approach that identifies stable points in
scale space. Image keys are created that allow for local ge-
ometric deformations by representing blurred image gradi-
ents in multiple orientation planes and at multiple scales.
The keys are used as input to a nearest-neighbor indexing
method that identifies candidateobject matches. Final veri-
fication of each match is achieved by finding a low-residual
least-squares solution for the unknown model parameters.
Experimental results show that robust object recognition
can be achieved in cluttered partially-occluded images with
a computation time of under 2 seconds.
1. Introduction
Object recognition in cluttered real-world scenes requires
local image features that are unaffected by nearby clutter or
partial occlusion. The features must be at least partially in-
variant to illumination,3D projective transforms, and com-
mon object variations. On the other hand, the features must
also be sufficiently distinctive to identify specific objects
among many alternatives. The difficultyof the object recog-
nition problem is due in large part to the lack of success in
finding such image features. However, recent research on
the use of dense local features (e.g., Schmid & Mohr [19])
has shown that efficient recognition can often be achieved
by using local image descriptors sampled at a large number
of repeatable locations.
This paper presents a new method for image feature gen-
eration called the Scale Invariant Feature Transform (SIFT).
This approach transforms an image into a large collection
of local feature vectors, each of which is invariant to image
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
rior temporal (IT) cortex in primate vision. This paper also
describes improved approaches to indexing and model ver-
ification.
The scale-invariant features are efficiently identified by
using a staged filtering approach. The first stage identifies
key locations in scale space by looking for locations that
are maxima or minima of a difference-of-Gaussian function.
Each point is used to generate a feature vector that describes
the local image region sampled relative to its scale-space co-
ordinate frame. The features achieve partial invariance to
local variations, such as affine or 3D projections, by blur-
ring image gradient locations. This approach is based on a
model of the behavior of complex cells in the cerebral cor-
tex of mammalian vision. The resulting feature vectors are
called SIFT keys. In the current implementation, each im-
age generates on the order of 1000 SIFT keys, a process that
requires less than 1 second of computation time.
The SIFT keys derived from an image are used in a
nearest-neighbour approach to indexing to identify candi-
date object models. Collections of keys that agree on a po-
tentialmodel pose are first identifiedthrougha Hough trans-
formhash table, and then througha least-squares fit to a final
estimate of model parameters. When at least 3 keys agree
on the model parameters with low residual, there is strong
evidence for the presence of the object. Since there may be
dozens of SIFT keys in the image of a typical object, it is
possible to have substantial levels of occlusion in the image
and yet retain high levels of reliability.
The current object models are represented as 2D loca-
tions of SIFT keys that can undergo affine projection. Suf-
ficient variation in feature location is allowed to recognize
perspective projection of planar shapes at up to a 60 degree
rotationaway from the camera or to allow up to a 20 degree
rotation of a 3D object.
1