Fitted Spectral Hashing
Yu Wang
1,2
, Sheng Tang
1
, Yalin Zhang
1,2
, JinTao Li
1
, DanYi Chen
3
1
Institute of Computing Technology, Chinese Academy of Sciences,Beijing 100190, China
2
Graduate University of Chinese Academy of Sciences, Beijing 100190, China
3
School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,
Beijing 100081, China
{wangyu, ts, zhangyalin, jtli}@ict.ac.cn, rainachen1216@163.com
ABSTRACT
Spectral hashing (SpH) is an efficient and simple binary
hashing method, which assumes that data are sampled from
a multidimensional uniform distribution. However, this as-
sumption is too restrictive in practice. In this paper we
propose an improved method, Fitted Spectral Hashing, to
relax this distribution assumption. Our work is based on
the fact that one-dimensional data of any distribution could
be mapped to a uniform distribution without changing the
local neighbor relations among data items. We have found
that this mapping on each PCA direction has certain reg-
ular pattern, and could fit data well by S-Curve function,
Sigmoid function. With more parameters Fourier function
also fit data well. Thus with Sigmoid function and Fourier
function, we propose two binary hashing methods. Experi-
ments show that our methods are efficient and outperform
state-of-the-art methods.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval
models
General Terms
Algorithms, Experimentation
Keywords
Spectral hashing, Sigmoid function, Fourier function
1. INTRODUCTION
Similarity search is an essential problem in the field of
machine learning, computer vision and information retrieval.
However, with increasing amounts of data, similarity search
faces following challenges: efficient storing millions of items
in memory and quickly finding similar items to a query item.
Recent work [1] shows that binary hashing methods are a
powerful way to address those challenges:
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
MM’13, October 21–25, 2013, Barcelona, Spain.
Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00.
http://dx.doi.org/10.1145/2502081.2502169.
• The highly compressed binary co des can be loaded into
main memory efficiently;
• Searching similar items can be extremely fast with
Hamming distances calculated by bit XOR operation:
an ordinary PC today would be able to do millions of
hamming distance computation in just a few millisec-
onds .
The basic idea of binary hashing methods is to formulate
projections from items to binary codes, so as to approxi-
mately preserve a given similarity function of interest [2].
”Good” binary codes should meet the entropy maximizing
criterion. According to the information theory [3], the max-
imal entropy of a source alphabet is attained by having a
uniform probability distribution. If the entropy of binary
codes over data set is small, it means that data are mapped
to only a small number of codes, thus rendering the codes
inefficient.
However, many state-of-the-art methods do not meet this
criterion. One of the most well-known binary hashing meth-
ods is locality sensitive hashing method (E2LSH), which cal-
culates binary codes by projecting data on random vectors
with random thresholds, and as shown in [4] the hamming
distance between binary codes will asymptotically approach
the Euclidean distance between data items. The Kernel-
ized version (KLSH) [5] widens the accessibility of E2LSH
to generic normalized kernel functions. Rather than using
random vectors, the authors have pursued machine learning
approaches, e.g. the restricted Boltzmann method (RBM)
[6] and Boosting [7], to accelerate the document and image
retrieval.
When data are uniformly distributed in a hyper-rectangle,
Spectral hashing (SpH) [8], derived from the spectral graph
partitioning problem, meets the entropy maximizing crite-
rion. Bits can be calculated efficiently by the eigenfunc-
tions of the weighted Laplacian defined on R
1
. This sim-
ple method outperforms above methods. However, the as-
sumption of SpH is too restrictive in practice. Like SpH,
Self-Taught Hashing (STH) [1] is also related to the spec-
tral graph partitioning, but uses ration-cut to address the
entropy maximizing criterion and applies support vector ma-
chine (SVM) to yield hash codes for out-of-sample objects.
STH can work with any data distribution, while suffering
with high computational cost. The binarized dimensional-
ity reduction technique Latent Semantic Indexing (LSI) [9]
and its improved version Laplacian Co-Hashing (LCH) [10]
are efficient to get binary codes of documents. Via setting
the threshold to the median value of left singular vectors of