分布式敏感哈希 SES-LSH：大规模数据下的高效相似搜索

26 浏览量更新于2024-08-27 收藏 469KB PDF 举报

SES-LSH（Shuffle-Efficient LocalitySensitive Hashing for Distributed Similarity Search）是一项针对大规模数据的分布式散列相似性搜索算法，由李东生、张婉欣、沈思奇和张义明等人在国防科技大学国家并行与分布式处理实验室的研究工作中提出。该研究源于广泛应用于图像和视频内容基于的检索服务等Web服务的局部敏感哈希（LSH）技术，LSH由于其高效性和查询性能而备受关注。然而，当前大多数LSH变体局限于单节点运行，这限制了它们在处理大规模数据时的实用性。为了解决这一问题，SES-LSH的设计旨在克服数据量大带来的挑战。它的核心创新包括一个shuffle-efficient indexing scheme（shuffle-efficient索引构建方案），该方案有效地减少了构建哈希表过程中数据的移动和复制，从而提高了系统的效率。此外， SES-LSH还引入了location-aware querying schema（位置感知查询策略），这种策略能够根据数据分布和查询需求，更智能地定位和处理查询，进一步优化了查询性能。在分布式环境中，SES-LSH能够有效利用多台计算机的计算和存储资源，显著提升了在海量数据上的搜索质量和响应速度。这对于许多依赖大规模数据处理的服务，如社交媒体推荐、广告匹配和物联网设备的数据分析等领域具有重要意义。通过减少数据传输的开销和提高整体的计算效率，SES-LSH为分布式场景下的相似性搜索提供了一个强大且实用的解决方案。总结来说，SES-LSH不仅改进了LSH原有的理论框架，还实现了其在分布式环境中的高效应用，是现代大数据时代下解决相似性搜索问题的一项关键贡献。研究人员通过细致的设计和优化，确保了在大规模数据处理中能够保持良好的性能和可扩展性，对于推动未来云计算和大数据技术的发展具有积极的推动作用。

SES-LSH: Shuffle-Efficient Locality Sensitive Hashing for Distributed Similarity

Dongsheng Li, Wanxin Zhang, Siqi Shen, Yiming Zhang

National Lab for Parallel and Distributed Processing, College of Computer

National University of Defense Technology, Changsha, Hunan

Email: dsli@nudt.edu.cn, kevinzwx1992@gmail.com, shensiqi@nudt.edu.cn, ymzhang@nudt.edu.cn

Abstract—Locality Sensitive Hashing (LSH) is a

widely used

similarity search technique for many web services, such as

content-based retrieval services for images and videos. Due to

its popularity, much research effort has been devoted to

improving the search quality, and the indexing and query

performance of LSH. However, most existing variants of LSH

can only run on single node, which limits their applicability to

large-scale data. In this paper, we

present

a Shuffle-Efficient

Similarity Search scheme based on LSH, which can be

efficiently executed in distributed environments, to serve a

massive amount of data. In SES-LSH, a shuffle efficient

indexing scheme is proposed to reduce the data shuffle when

constructing hash tables, and a location-aware querying scheme

is proposed to improve the query performance. We have

implemented a prototype of SES-LSH based on Spark, and

several

optimizations have been utilized to improve the fine-

grained hash table operations of distributed LSH. Extensive

experiments using large-scale real-world datasets show that

SES-LSH is remarkably more efficient than existing methods.

Keywords-Locality Sensitive Hashing; shuffle; location-aware

querying; Similarity Search.

I. INTRODUCTION

Similarity search [1] has been playing an increasingly

important role in many web services, such as content-based

retrieval services for moving objects [2], images [3], tweets [4]

and other feature-rich data. The basic but essential task in

similarity search is “nearest neighbor search” problem: given

a query object (e.g., an image), how to find the most nearby or

similar objects among all objects. To perform similarity

search efficiently, many solutions use tree-based indexing

techniques to retrieve accurate results, such as R-tree [5], K-

D tree [6], SR-tree [7] and cover-tree [8], which perform well

in low-dimension space. However, feature-rich data are

typically represented as h

igh-dimensional feature vectors,

those work [5-8] suffer from the “curse of dimensionality”,

thus perform poorly when the number of dimensions of data

is large (e.g., > 10) [1].

Locality Sensitive Hashing (LSH) [9] is one of the most

widely used similarity search methods for querying high-

dimensional data. LSH uses hash functions which cause

similar objects have hig

her probabilities of colliding in the

same hash buckets, whereas dissimilar objects will locate

differently with high chances.

Many researchers have developed variants of LSH to

improve its search quality [10-12], indexing structure [13-14],

and query strategy [15-17]. However, most of existing LSH

variants can only run on single node instead of multiple nodes.

Thus, most of them cannot deal with large-

scale data which

exceed the processing power of a single node.

To support large-scale LSH, we first design a

straightforward distributed version of LSH, called SLSH,

which can run in parallel on multiple nodes. However, such a

simple distributed LSH variant faces three overhead problems

which significantly affect the performance of distributed LSH.

(

i) Shuffle overhead. When constructing hash tables, data

objects with the same hash values will be shuffled across

server nodes which causes heavy network and disk I/O

overhead. (ii) Query broadcast overhead. When retrieving

similar objects, the query object needs to be broadcast to many

nodes to ensure obtaining all similar results. (iii) Hash table

operation overhead. In distributed environments, the hash

table update, deletion, and query in LSH is costly.

To make distributed LSH more efficient,

we design and

implement a Shuffle-Efficient Similarity Search scheme

based on LSH (SES-LSH). In SES-LSH, a shuffle efficient

indexing scheme is designed to reduce data shuffle when

constructing hash tables. Based on the indexing scheme, a

location aware querying scheme is proposed to reduce the

query broadcast overhead and reduce the query time.

We have implemented a prototy

pe of SES

-LSH based on

Spark [18, 19]. Through extensive experiments using large-

scale real-world dataset, SES-LSH can handle much larger

dataset than the existing method (Spark-Hash [20]), and we

show that SES-LSH can be remarkably more efficient than

Spark-Hash [20] and SLSH.

The main contributions of the paper are listed as follows:

x We propose LSH on Spark (SLSH), which extends the

capability of LSH to index and query large-scale data on

distributed nodes.

x Based on SLSH, we propose SES-LSH, including a

shuffle efficient indexing and location aware querying

scheme, which improves the performance of distributed

LSH notably. The source code of SES-LSH is released

on GitHub [21].

x We perform extensive experiments with the large-scale

real-world dataset, which demonstrate

the effectiveness

and efficiency of the proposed methods.

II. P

RELIMINARIES AND RELATED WORK

A. Locality Sensitive Hashing

LSH uses a set of locality sensitive hash functions which

maps from d-dimensional real number space 



to another

2017 IEEE 24th International Conference on Web Services

DOI 10.1109/ICWS.2017.99

822

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38509504

粉丝: 1
资源: 951

分布式敏感哈希 SES-LSH：大规模数据下的高效相似搜索

一种基于分布式LSH的海量视频快速检索方法

serverless-ses-mjml:使用MJML为AWS Simple Email Service生成电子邮件模板

travel-ses-app:第一资本SES提交

node-amazon-ses-example:使用Amazon SES发送电子邮件的示例

lambda-ses-forwarder:AWS SES电子邮件转发器Lambda函数

aws-ses-recorder:AWS Lambda用于处理SES退回和交付的功能

aws-lambda-ses-forwarder：使用AWS Lambda和SES进行无服务器电子邮件转发

bundle-aws-ses-monitor：Symfony捆绑包，用于通过AWS SNS管理AWS SES通知

aws-cfn-ses-domain:适用于Amazon SES域和电子邮件身份的AWS CloudFormation资源

serverless-ses-template:适用于Amazon Simple Email Service的无服务器模板同步插件

最新资源