Matching Local Self-Similarities across Images and Videos
Eli Shechtman Michal Irani
Dept. of Computer Science and Applied Math
The Weizmann Institute of Science
76100 Rehovot, Israel
Abstract
We present an approach for measuring similarity be-
tween visual entities (images or videos) based on match-
ing internal self-similarities. What is correlated across
images (or across video sequences) is the internal lay-
out of local self-similarities (up to some distortions), even
though the patterns generating those local self-similarities
are quite different in each of the images/videos. These in-
ternal self-similarities are efficiently captured by a com-
pact local “self-similarity descriptor”, measured densely
throughout the image/video, at multiple scales, while ac-
counting for local and global geometric distortions. This
gives rise to matching capabilities of complex visual data,
including detection of objects in real cluttered images using
only rough hand-sketches, handling textured objects with
no clear boundaries, and detecting complex actions in clut-
tered video data with no prior learning. We compare our
measure to commonly used image-based and video-based
similarity measures, and demonstrate its applicability to ob-
ject detection, retrieval, and action detection.
1. Introduction
Determining similarity between visual data is necessary
in many computer vision tasks, including object detection
and recognition, action recognition, texture classification,
data retrieval, tracking, image alignment, etc. Methods for
performing these tasks are usually based on representing
an image using some global or local image properties, and
comparing them using some similarity measure.
The relevant representations and the corresponding sim-
ilarity measures can vary significantly. Images are of-
ten represented using dense photometric pixel-based prop-
erties or by compact region descriptors (features) often
used with interest point detectors. Dense properties in-
clude raw pixel intensity or color values (of the entire im-
age, of small patches [25, 3] or fragments [22]), texture
filters [15] or other filter responses [18]. Common com-
pact region descriptors include distribution based descrip-
tors (e.g., SIFT [13]), differential descriptors (e.g., local
derivatives [12]), shape-based descriptors using extracted
edges (e.g. Shape Context [1]), and others. For a compre-
hensive comparison of many region descriptors for image
matching see [16].
Figure 1. These images of the same object (a heart) do NOT share
common image properties (colors, textures, edges), but DO share
a similar geometric layout of local internal self-similarities.
Although these representations and their corresponding
measures vary significantly, they all share the same basic
assumption – that there exists a common underlying visual
property (i.e., pixels colors, intensities, edges, gradients or
other filter responses) which is shared by the two images (or
sequences), and can therefore be extracted and compared
across images/sequences. This assumption, however, may
be too restrictive, as illustrated in Fig. 1. There is no ob-
vious image property shared between those images. Nev-
ertheless, we can clearly notice that these are instances of
the same object (a heart). What makes these images similar
is the fact that their local intensity patterns (in each image)
are repeated in nearby image locations in a similar relative
geometric layout. In other words, the local internal lay-
outs of self-similarities are shared by these images, even
though the patterns generating those self-similarities are
not shared by those images. The notion of self similarity
in video sequences is even stronger than in images. E.g.,
people wear the same clothes in consecutive frames and
backgrounds tend to change gradually, resulting in strong
self-similar patterns in local space-time video regions.
In this paper we present a “local self-similarity descrip-
tor” which captures internal geometric layouts of local
self-similarities within images/videos, while accounting for
small local affine deformations. It captures self-similarity
of color, edges, repetitive patterns (e.g., the right image in
Fig. 1) and complex textures in a single unified way. A tex-
tured region in one image can be matched with a uniformly
colored region in the other image as long as they have a
similar spatial layout. These self-similarity descriptors are
estimated on a dense grid of points in image/video data, at
multiple scales. A good match between a pair of images (or
a pair of video sequences), corresponds to finding a match-
ing ensemble of such descriptors – with similar descriptor
values at similar relative geometric positions, up to small
non-rigid deformations. This allows to match a wide vari-
1