CLSH: Cluster-based Locality-Sensitive Hashing
Xiangyang Xu, Tongwei Ren, Gangshan Wu*
State Key Laboratory for Novel Software Technology
Nanjing University, Nanjing, China
xiangyang.xu@smail.nju.edu.cn, {rentw, gswu}@nju.edu.cn
ABSTRACT
Locality-sensitive hashing (LSH) usually consumes large
memory in similarity search, which limits its scalability for
large scale applications. In this paper, we propose a novel
cluster-based locality-sensitive hashing (CLSH) approach,
which extends the conventional LSH framework and aims at
indexing and searching large scale high-dimensional dataset-
s. We first utilize a clustering algorithm to partition the
raw feature dataset into clusters, and map these clusters
to a distributed cluster. Then, LSH meth od is applied to
construct the index for each cluster, and we present two
criteria to choose the cluster(s) for further detailed search
in order to improve the search quality. This proposed
framework comes with following properties. Firstly, CLSH
can cope with large scale feature dataset. Secondly, the
generated clusters can guide the feature dataset automatical
mappings to a distributed cluster. After that, the search
time can be reduced a lot by searching on multiple comput-
ing nodes. Experiments show that the proposed approach
outperforms the existing approaches in terms of efficiency
and scalability.
Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Indexing meth-
ods; H.3.3 [Information Search and Retrieval]: Clus-
tering, Search process
General Terms
Algorithm, Experimentation, Performance
Keywords
Approximate Nearest Neighbor search, clustering, Locality-
Sensitive Hashing, distributed cluster, high-dimensional in-
dexing
1. INTRODUCTION
Nearest neighbor search, also kn own as similarity search,
plays an important role in multimedia applications, such as
information retrieval, data analysis and object recognition
[2, 8, 7, 14]. Tree-based search meth ods, including k-
d tree and R-tree, usually have good performance for
nearest neighbor search on low-dimensional features, but
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICIMCS’14, July 10–12, 2014, Xiamen, Fujian, China.
Copyright 2014 ACM 978-1-4503-2810-4/14/07 ...$15.00.
they inevitably suffer enormous negative in fluence on search
performance when the feature dimensionality increases.
To improve the search performance on high-dimensional
features, Approximate Nearest Neighbor (ANN) search was
proposed, which aims at balancing search result quality and
response time by only providing the approximate results in
nearest neighbor search [1]. It has only small difference in
search results to the exact nearest neighbor search when the
user’s quality notion is accurately captured, which is good
enough for most practical applications [3]. In ANN search,
hashing based methods is dominant for their insensitivity to
feature dimensionality, in which Locality-Sensitive Hashing
(LSH) [5, 3] is one kind of the pioneering hashing based
ANN search and widely used methods. Recently, various
extensional works of LSH were presented, such as multi-
probe LSH [11], query-adaptive LSH [6], etc., and amounts
of learning based hashing methods were proposed, including
sp ectral hashing (SH) [17], semi-supervised hashing (SSH)
[16], weakly-supervised hashing [12] and kernelized LSH
(KLSH) [9], etc. These methods substantially reduce the
searching time while preserving t he comparable search qual-
ity. H owever, they are all designed in the centralized settings
and their abilities are limited by the memory capacity
of single compu ting node. Therefore, their scalability is
severely limited by the scale of feature dataset and the
ability of the single computing node.
To overcome the above problem, we propose a novel
cluster-based locality-sensitive hashing (CLSH) framework,
which aims to extend LSH method for indexing and search-
ing large scale high-dimensional feature datasets. Here,
“cluster-based” h as two meanings. In one respect, our
approach ap plies clustering on the raw feature dataset,
and constructs the index for each cluster. In the other
respect, our approach is carried out on a distributed cluster
which comprises of multiple computing no des. Figure 1
shows the overview of our app roach. First, we utilize a
clustering algorithm to partition the raw feature dataset
into clusters. Then, the feature dataset are automatically
mapped to different computing nodes with the guide of
these clusters. After this, for each cluster, LSH method
is applied to construct the index. In the nearest neighbor
search phrase, one cluster or several clusters are selected for
further detailed search to obtain retrieval results and the
search time will be red uced a lot.
Compared to the primary LSH and other hashing based
ANN methods, our approach has the following advantages:
• CLSH can cope with larger scale feature dataset by
applying LSH meth od on each cluster instead of the
whole feature dataset;