集群局部敏感哈希：解决大规模高维数据索引与搜索

186 浏览量更新于2024-08-27 收藏 152KB PDF 举报

CLSH: Cluster-based Locality-Sensitive Hashing (CLSH) 是一项针对大规模高维度数据集进行索引和搜索的创新算法，由Xiangyang Xu、Tongwei Ren和Gangshan Wu三位研究人员在南京大学的研究中提出。这项工作主要解决了一个在传统Locality-sensitive Hashing (LSH) 中普遍存在的问题：由于内存消耗较大，它在处理大规模数据时的可扩展性受限。传统的LSH方法依赖于对相似度搜索中的哈希函数，其核心是通过设计能够捕获数据点局部相似性的哈希函数，将高维空间中的数据映射到低维哈希空间。然而，当数据集规模增大时，这可能导致内存占用过多，影响效率和性能。CLSH正是为了解决这一问题，通过以下步骤实现： 1. 数据预处理：首先，采用聚类算法对原始特征数据集进行划分，将数据分组成多个小的、相关的集群。这样可以将大数据集分割成更易于管理的部分。 2. 分布式处理：将每个集群映射到一个分布式的数据结构，如分布在多台计算机或分布式存储系统上的集群。这有助于减少单点压力，提高整体系统的可扩展性和容错性。 3. 哈希索引构建：在每个集群内部，应用LSH技术构建索引，以便快速定位可能相似的数据点。这一步骤利用了LSH的局部敏感性特性，即哈希函数对相似数据点的碰撞概率较高。 4. 搜索策略优化：为了提高搜索质量，提出了两个选择准则，用于决定哪些集群需要进一步详细搜索。这可能是基于特定的相似度阈值，或者是基于每个集群的哈希表大小和查询结果的质量来确定。 CLSH框架的优势包括： - 可扩展性：通过分簇和分布式处理，CLSH能有效应对大规模数据集，减轻内存负担。 - 自动映射：通过生成的聚类，数据集的自动映射使得在集群内部处理更高效，同时减少了全局搜索的时间复杂度。 - 高效搜索：通过集群内部的哈希索引，可以快速找到潜在的相似项，提升搜索效率。 CLSH是一种在保证搜索性能的同时兼顾大型高维数据处理能力的改进型LSH方法，对于大规模数据的实时搜索和分析具有重要意义，适用于诸如推荐系统、图像检索、社交网络分析等领域。

CLSH: Cluster-based Locality-Sensitive Hashing

Xiangyang Xu, Tongwei Ren, Gangshan Wu*

State Key Laboratory for Novel Software Technology

Nanjing University, Nanjing, China

xiangyang.xu@smail.nju.edu.cn, {rentw, gswu}@nju.edu.cn

ABSTRACT

Locality-sensitive hashing (LSH) usually consumes large

memory in similarity search, which limits its scalability for

large scale applications. In this paper, we propose a novel

cluster-based locality-sensitive hashing (CLSH) approach,

which extends the conventional LSH framework and aims at

indexing and searching large scale high-dimensional dataset-

s. We ﬁrst utilize a clustering algorithm to partition the

raw feature dataset into clusters, and map these clusters

to a distributed cluster. Then, LSH meth od is applied to

construct the index for each cluster, and we present two

criteria to choose the cluster(s) for further detailed search

in order to improve the search quality. This proposed

framework comes with following properties. Firstly, CLSH

can cope with large scale feature dataset. Secondly, the

generated clusters can guide the feature dataset automatical

mappings to a distributed cluster. After that, the search

time can be reduced a lot by searching on multiple comput-

ing nodes. Experiments show that the proposed approach

outperforms the existing approaches in terms of eﬃciency

and scalability.

Categories and Subject Descriptors

H.3.1 [Content Analysis and Indexing]: Indexing meth-

ods; H.3.3 [Information Search and Retrieval]: Clus-

tering, Search process

General Terms

Algorithm, Experimentation, Performance

Keywords

Approximate Nearest Neighbor search, clustering, Locality-

Sensitive Hashing, distributed cluster, high-dimensional in-

dexing

1. INTRODUCTION

Nearest neighbor search, also kn own as similarity search,

plays an important role in multimedia applications, such as

information retrieval, data analysis and object recognition

[2, 8, 7, 14]. Tree-based search meth ods, including k-

d tree and R-tree, usually have good performance for

nearest neighbor search on low-dimensional features, but

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ICIMCS’14, July 10–12, 2014, Xiamen, Fujian, China.

they inevitably suﬀer enormous negative in ﬂuence on search

performance when the feature dimensionality increases.

To improve the search performance on high-dimensional

features, Approximate Nearest Neighbor (ANN) search was

proposed, which aims at balancing search result quality and

response time by only providing the approximate results in

nearest neighbor search [1]. It has only small diﬀerence in

search results to the exact nearest neighbor search when the

user’s quality notion is accurately captured, which is good

enough for most practical applications [3]. In ANN search,

hashing based methods is dominant for their insensitivity to

feature dimensionality, in which Locality-Sensitive Hashing

(LSH) [5, 3] is one kind of the pioneering hashing based

ANN search and widely used methods. Recently, various

extensional works of LSH were presented, such as multi-

probe LSH [11], query-adaptive LSH [6], etc., and amounts

of learning based hashing methods were proposed, including

sp ectral hashing (SH) [17], semi-supervised hashing (SSH)

[16], weakly-supervised hashing [12] and kernelized LSH

(KLSH) [9], etc. These methods substantially reduce the

searching time while preserving t he comparable search qual-

ity. H owever, they are all designed in the centralized settings

and their abilities are limited by the memory capacity

of single compu ting node. Therefore, their scalability is

severely limited by the scale of feature dataset and the

ability of the single computing node.

To overcome the above problem, we propose a novel

cluster-based locality-sensitive hashing (CLSH) framework,

which aims to extend LSH method for indexing and search-

ing large scale high-dimensional feature datasets. Here,

“cluster-based” h as two meanings. In one respect, our

approach ap plies clustering on the raw feature dataset,

and constructs the index for each cluster. In the other

respect, our approach is carried out on a distributed cluster

which comprises of multiple computing no des. Figure 1

shows the overview of our app roach. First, we utilize a

clustering algorithm to partition the raw feature dataset

into clusters. Then, the feature dataset are automatically

mapped to diﬀerent computing nodes with the guide of

these clusters. After this, for each cluster, LSH method

is applied to construct the index. In the nearest neighbor

search phrase, one cluster or several clusters are selected for

further detailed search to obtain retrieval results and the

search time will be red uced a lot.

Compared to the primary LSH and other hashing based

ANN methods, our approach has the following advantages:

• CLSH can cope with larger scale feature dataset by

applying LSH meth od on each cluster instead of the

whole feature dataset;

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38695773

粉丝: 11
资源: 956

集群局部敏感哈希：解决大规模高维数据索引与搜索

Clsh-开源

clsh:一组用于运行和组成* nix进程的Lispy绑定

Clsh: Common Lisp绑定，实现*nix进程运行与组织

java毕设项目之ssm基于SSM的高校共享单车管理系统的设计与实现+vue(完整前后端+说明文档+mysql+lw).zip

YOLO算法-贴纸检测数据集-212张图像带标签-部分覆盖-未涵盖-完全覆盖.zip

zigbee CC2530无线自组网协议栈系统代码实现协调器按键控制终端LED灯和继电器动作.zip

手语图像分类数据集【已标注，约2,500张数据】

CNCAP 2024打分表

基于小程序的智慧校园管理系统源代码（java+小程序+mysql+LW）.zip

【图像去噪】基于matlab PolSAR GWLS滤波器图像去噪【含Matlab源码 9937期】.zip

最新资源