SK-LSH：高效索引结构，优化近似最近邻搜索

5 浏览量更新于2024-08-26 收藏 616KB PDF 举报

"SK-LSH：一种有效的索引结构，用于近似最近的邻居搜索" 在高维空间中的近似最近邻（Approximate Nearest Neighbor, ANN）搜索已经成为许多应用的基础范式，如图像识别、推荐系统和机器学习等领域。随着大数据量的增长，快速有效地寻找数据集中的近似最近邻变得至关重要。当前，局部敏感哈希（Locality Sensitive Hashing, LSH）被认为是最有前途的解决方案之一，因为它能够高效地处理高维数据并减少搜索复杂度。然而，传统的LSH方法存在一个显著的问题：候选对象的访问需要大量的随机I/O操作。为了保证返回结果的质量，需要验证足够多的对象，这会消耗巨大的I/O成本，从而降低了搜索效率。针对这一问题，作者提出了一个新的方法——排序键-LSH（Sorting Keys-LSH, SK-LSH）。 SK-LSH的核心思想是通过局部排列候选对象来减少页面访问的数量。首先，他们定义了一个新的度量标准来评估“排序键”之间的距离。这些排序键是基于对象的特征生成的，并用于确定对象之间的相似性。通过精心设计的排序策略，候选对象可以按照其哈希值的某种顺序进行排列，使得相近的对象更有可能被安排在一起，从而减少了在验证阶段需要访问的页面数量。在SK-LSH中，哈希表的构建和查询过程被优化，使得在查询时可以按顺序访问存储的候选对象，而不是随机访问。这种方法减少了对磁盘或内存的随机访问，提高了I/O效率。此外，SK-LSH还考虑了错误率的控制，确保在降低I/O成本的同时，仍然能提供高质量的近似最近邻搜索结果。实验结果表明，与现有的LSH方法相比，SK-LSH在保持搜索精度的同时，显著降低了I/O开销，尤其是在处理大规模高维数据集时，性能优势更为明显。这种方法对于需要高效处理大规模数据的实时应用，如实时推荐系统和大规模图像搜索，具有重要的实用价值。 SK-LSH是一种创新的索引结构，它通过改进LSH的哈希碰撞策略和候选对象的排序方式，有效解决了传统LSH在高维度数据下的I/O效率问题。这种方法不仅提高了搜索效率，还兼顾了搜索质量，为高维空间的近似最近邻搜索提供了新的解决方案。

3.1 Locality Sensitive Hashing

To solve the c-ANN problem, Indyk and Motwani pro-

posed the idea of LSH [8], which is formally deﬁned as fol-

lows.

Definition 1. (Locality Sensitive Hashing) Given a

distance R, an approximate ratio c and two probability val-

ues P

and P

, a hash function h : R

→ Z is called

(R, c, P

, P

)-sensitive if it satisﬁes the following condition-

s simultaneously for any two points p

, p

∈ D:

• If k p

, p

k≤ R, then P r[h(p

) = h(p

)] ≥ P

;

• If k p

, p

k≥ cR, then P r[h(p

) = h(p

)] ≤ P

;

To make sense, both c > 1 and P

≥ P

hold. In ad-

dition, a compound LSH function is denoted as G =

, . . . , h

), where h

, . . . , h

are randomly selected LSH

functions. Speciﬁcally, for ∀p ∈ D, K = G(p) = (h

(p),

. . . , h

(p)) is deﬁned as the compound hash key of point

p under G.

According to Deﬁnition 1, LSH ensures that a close pair

collides with each other with a high probability (P

) and a

far pair with a low probability (P

). This property of LSH

is also called distance-preserving.

The LSH function commonly used in Euclidean Space,

which was proposed by Datar [2], is shown as the following:

h(p) = b

a · p + bW

c (1)

Here, a is a random vector with each dimension indepen-

dently chosen from Guassian distribution and p is an ar-

bitrary point in D. b is a real number uniformly drawn

from the range [0,1]. W is also a real number represent-

ing the width of the LSH function. For two points p

, p

and an LSH function h, if k p

, p

k= r, the probability of

h(p

) = h(p

) can be computed as follows [2].

p(r, W ) = P r[h(p

) = h(p

)]

(

)(1 −

)dt

= 2norm(W/r) − 1 −

√

2π

(1 − e

−

)

(2)

Here, f

(x) =

√

2π

−

and norm(·) represents the cumu-

lative distribution function of a random variable following

Gaussian Distribution. According to Equation 2, the proba-

bility p(r, W) decreases monotonically when r increases but

grows monotonically when W rises.

Due to the distance-preserving property of LSH, it is ra-

tional to use the hash values to estimate the distance be-

tween two points. Therefore, if two points have similar hash

values, it is believed that they are close to each other with

certain conﬁdence. Based on this idea, several approaches

have been proposed for c-ANN [2, 4, 5, 12, 19]. However,

it is obvious that Equation 1 exhibits poor performance in

ﬁltering irrelevant points, as many pairs, which are distant

from each other, may share the same hash value under a sin-

gle hash function as Equation 1. In other words, numerous

false positives may be returned. To remove the irrelevan-

t points (i.e. false positives), a compound LSH function

G = (h

, h

, . . . , h

) is employed so as to improve the dis-

tinguishing capacity. Note that each element of a compound

LSH function, h

, is randomly selected as deﬁned in Equa-

tion 1. Only points sharing all the m hash values with the

query point are taken into account as candidate points, as

suggested by the basic LSH [2]. However, c-ANN search al-

gorithms should ensure that data points having similar com-

pound hash keys to the query point are taken into account as

candidates. Hence, a distance measure over compound hash

keys is required. In the following, we propose a novel mea-

sure to evaluate the distance between a pair of compound

hash keys.

3.2 Distance over Compound Hash Keys

Given a compound LSH function G and two points p

, p

∈ D, we have compound hash keys K

= G(p

) and K

G(p

), where both K

and K

are tuples containing m hash

values. Let k

1,i

(resp. k

2,i

) be the i-th element of K

(resp.

), that means k

1,i

= h

) and k

2,i

= h

Definition 2. (Preﬁx of a Compound Hash Key)

Given a point p ∈ D and its compound hash key K =

G(p) = (k

, k

, . . . , k

). The l-length preﬁx of K, de-

noted as pref (K, l), consisting of the ﬁrst l elements of K

where 1 ≤ l ≤ m, is formally deﬁned as follows.

pref(K, l) = (k

, k

, . . . , k

) (3)

Particularly, we denote pref(K, 0) as K

∅

, which is actually

an empty hash key.

Here, we are inspired by the preﬁx of a character string

and treat a compound hash key as a string of elements.

Therefore, its preﬁx is the substring constituted by its ﬁrst

several elements. For example, for a compound hash key

K = (1, 2, 3, 4), pref (K, 3) = (1, 2, 3), pref(K, 2) = (1, 2)

and pref (K, 0) = K

∅

Definition 3. (Non-preﬁx Length of Compound Hash

Keys) Given two compound hash keys K

= (k

1,1

, k

1,2

, . . . , k

1,m

)

and K

= (k

2,1

, k

2,2

, . . . , k

2,m

), if pref(K

, l) = pref(K

, l)

and pref (K

, l + 1) 6= pref(K

, l + 1), where 0 ≤ l < m,

then the non-preﬁx length between K

and K

, denoted as

KL(K

, K

), is formally deﬁned as follows:

KL(K

, K

) = m − l (4)

If pref(K

, m)=pref(K

, m), then KL(K

, K

) = 0.

A smaller non-preﬁx distance between two compound hash

keys indicates that they share a longer common preﬁx with

each other. For instance, given two compound hash keys

= (1, 2, 3, 4) and K

= (1, 2, 3, 5), KL(K

, K

) = 1 since

pref(K

, 3) = pref(K

, 3) and pref (K

, 4) 6= pref(K

, 4).

Definition 4. ((l + 1)-th Element Distance of Com-

pound Hash Keys) Given two compound hash keys K

1,1

, k

1,2

, . . . , k

1,m

) and K

= (k

2,1

, k

2,2

, . . . , k

2,m

), if

KL(K

, K

) = m − l, where 0 ≤ l < m, then the (l + 1)-th

element distance between K

and K

is deﬁned as the ab-

solute value of the distance between their (l + 1)-th elements

as follows:

KD(K

, K

) = |k

1,l+1

− k

2,l+1

| (5)

If l = m, we denote KD(K

, K

) as 0 by default.

Though the notion of non-preﬁx length can be used to

measure the distance between two compound hash keys, we

employ the (l + 1)-th element distance to further distinguish

two compound hash keys. Let us consider the following three

compound hash keys, K

= (1, 2, 3, 4), K

= (1, 2, 3, 5) and

747

剩余11页未读，继续阅读

weixin_38622777

粉丝: 5
资源: 938

SK-LSH：高效索引结构，优化近似最近邻搜索

LSH（局部敏感哈希）

multi-index-lsh:尝试在汉明空间中实现快速局部敏感哈希搜索的练习

Load-Balanced-LSH:Load-Balanced-LSH 实现高效索引

Multidimensional-Index-Structure-using-LSH:使用局部敏感哈希（LSH）的高维空间多维索引结构

SES-LSH：随机有效的分布式散列相似性搜索的敏感哈希

pyspark-lsh:PySpark 中的局部敏感哈希

kmeans-lsh:使用局部敏感哈希的k-means实现

hamming-lsh:Hamming空间的位置敏感哈希的实现

spark-LSH:Spark 上的 LSH 实现。 该想法基于 Coursera 上的斯坦福 MMD 课程

Min-Hash-LSH-Python:快速示例 Min Hash 和 LSH 实现文本重复数据删除

最新资源

spark-LSH:Spark 上的 LSH 实现。该想法基于 Coursera 上的斯坦福 MMD 课程