Spectral Hashing
Yair Weiss
1,3
3
School of Computer Science,
Hebrew University,
91904, Jerusalem, Israel
yweiss@cs.huji.ac.il
Antonio Torralba
1
1
CSAIL, MIT,
32 Vassar St.,
Cambridge, MA 02139
torralba@csail.mit.edu
Rob Fergus
2
2
Courant Institute, NYU,
715 Broadway,
New York, NY 10003
fergus@cs.nyu.edu
Abstract
Semantic hashing[1] seeks compact binary codes of data-points so that the
Hamming distance between codewords correlates with semantic similarity.
In this paper, we show that the problem of finding a best co de for a given
dataset is closely related to the problem of graph partitioning and can
be shown to be NP hard. By relaxing the original problem, we obtain a
spectral method whose solutions are simply a subset of thresholded eigen-
vectors of the graph Laplacian. By utilizing recent results on convergence
of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of
manifolds, we show how to efficiently calculate the code of a novel data-
point. Taken together, both learning the code and applying it to a novel
point are extremely simple. Our experiments show that our codes outper-
form the state-of-the art.
1 Introduction
With the advent of the Internet, it is now possible to use huge training sets to address
challenging tasks in machine learning. As a motivating example, consider the recent work
of Torralba et al. who collected a dataset of 80 million images from the Internet [2, 3]. They
then used this weakly labeled dataset to perform scene categorization. To categorize a novel
image, they simply searched for similar images in the dataset and used the labels of these
retrieved images to predict the label of the novel image. A similar approach was used in [4]
for scene completion.
Although conceptually simple, actually carrying out such methods requires highly efficient
ways of (1) storing millions of images in memory and (2) quickly finding similar images to
a target image.
Semantic hashing, introduced by Salakhutdinov and Hinton[5] , is a clever way of addressing
both of these challenges. In semantic hashing, each item in the database is represented by a
compact binary code. The code is constructed so that similar items will have similar binary
codewords and there is a simple feedforward network that can calculate the binary code for
a novel input. Retrieving similar neighbors is then done simply by retrieving all items with
codes within a small Hamming distance of the code for the query. This kind of retrieval can
be amazingly fast - millions of queries per second on standard computers. The key for this
method to work is to learn a good code for the dataset. We need a code that is (1) easily
computed for a novel input (2) requires a small number of bits to code the full dataset and
(3) maps similar items to similar binary codewords.
To simplify the problem, we will assume that the items have already been embedded in
a Euclidean space, say R
d
, in which Euclidean distance correlates with the desired simi-
larity. The problem of finding such a Euclidean emb edding has been addressed in a large
1