SES-LSH: Shuffle-Efficient Locality Sensitive Hashing for Distributed Similarity
Search
Dongsheng Li, Wanxin Zhang, Siqi Shen, Yiming Zhang
National Lab for Parallel and Distributed Processing, College of Computer
National University of Defense Technology, Changsha, Hunan
Email: dsli@nudt.edu.cn, kevinzwx1992@gmail.com, shensiqi@nudt.edu.cn, ymzhang@nudt.edu.cn
Abstract—Locality Sensitive Hashing (LSH) is a
similarity search technique for many web services, such as
content-based retrieval services for images and videos. Due to
its popularity, much research effort has been devoted to
improving the search quality, and the indexing and query
performance of LSH. However, most existing variants of LSH
can only run on single node, which limits their applicability to
large-scale data. In this paper, we
a Shuffle-Efficient
Similarity Search scheme based on LSH, which can be
efficiently executed in distributed environments, to serve a
massive amount of data. In SES-LSH, a shuffle efficient
indexing scheme is proposed to reduce the data shuffle when
constructing hash tables, and a location-aware querying scheme
is proposed to improve the query performance. We have
implemented a prototype of SES-LSH based on Spark, and
optimizations have been utilized to improve the fine-
grained hash table operations of distributed LSH. Extensive
experiments using large-scale real-world datasets show that
SES-LSH is remarkably more efficient than existing methods.
Keywords-Locality Sensitive Hashing; shuffle; location-aware
querying; Similarity Search.
I. INTRODUCTION
Similarity search [1] has been playing an increasingly
important role in many web services, such as content-based
retrieval services for moving objects [2], images [3], tweets [4]
and other feature-rich data. The basic but essential task in
similarity search is “nearest neighbor search” problem: given
a query object (e.g., an image), how to find the most nearby or
similar objects among all objects. To perform similarity
search efficiently, many solutions use tree-based indexing
techniques to retrieve accurate results, such as R-tree [5], K-
D tree [6], SR-tree [7] and cover-tree [8], which perform well
in low-dimension space. However, feature-rich data are
typically represented as h
igh-dimensional feature vectors,
those work [5-8] suffer from the “curse of dimensionality”,
thus perform poorly when the number of dimensions of data
is large (e.g., > 10) [1].
Locality Sensitive Hashing (LSH) [9] is one of the most
widely used similarity search methods for querying high-
dimensional data. LSH uses hash functions which cause
similar objects have hig
her probabilities of colliding in the
same hash buckets, whereas dissimilar objects will locate
differently with high chances.
Many researchers have developed variants of LSH to
improve its search quality [10-12], indexing structure [13-14],
and query strategy [15-17]. However, most of existing LSH
variants can only run on single node instead of multiple nodes.
Thus, most of them cannot deal with large-
exceed the processing power of a single node.
To support large-scale LSH, we first design a
straightforward distributed version of LSH, called SLSH,
which can run in parallel on multiple nodes. However, such a
simple distributed LSH variant faces three overhead problems
which significantly affect the performance of distributed LSH.
(
i) Shuffle overhead. When constructing hash tables, data
objects with the same hash values will be shuffled across
server nodes which causes heavy network and disk I/O
overhead. (ii) Query broadcast overhead. When retrieving
similar objects, the query object needs to be broadcast to many
nodes to ensure obtaining all similar results. (iii) Hash table
operation overhead. In distributed environments, the hash
table update, deletion, and query in LSH is costly.
To make distributed LSH more efficient,
implement a Shuffle-Efficient Similarity Search scheme
based on LSH (SES-LSH). In SES-LSH, a shuffle efficient
indexing scheme is designed to reduce data shuffle when
constructing hash tables. Based on the indexing scheme, a
location aware querying scheme is proposed to reduce the
query broadcast overhead and reduce the query time.
We have implemented a prototy
-LSH based on
Spark [18, 19]. Through extensive experiments using large-
scale real-world dataset, SES-LSH can handle much larger
dataset than the existing method (Spark-Hash [20]), and we
show that SES-LSH can be remarkably more efficient than
Spark-Hash [20] and SLSH.
The main contributions of the paper are listed as follows:
x We propose LSH on Spark (SLSH), which extends the
capability of LSH to index and query large-scale data on
distributed nodes.
x Based on SLSH, we propose SES-LSH, including a
shuffle efficient indexing and location aware querying
scheme, which improves the performance of distributed
LSH notably. The source code of SES-LSH is released
on GitHub [21].
x We perform extensive experiments with the large-scale
real-world dataset, which demonstrate
and efficiency of the proposed methods.
II. P
RELIMINARIES AND RELATED WORK
A. Locality Sensitive Hashing
LSH uses a set of locality sensitive hash functions which
maps from d-dimensional real number space
to another
2017 IEEE 24th International Conference on Web Services
978-1-5386-0752-7/17 $31.00 © 2017 IEEE
DOI 10.1109/ICWS.2017.99
822