利用查询感知的局部敏感哈希提升高维空间近邻搜索准确性

下载需积分: 9 | PDF格式 | 1.14MB | 更新于2024-09-09 | 91 浏览量 | 举报

Query-Aware Locality-Sensitive Hashing (QALSH) 是一种针对高维欧几里得空间中近似最近邻搜索（c-Approximate Nearest Neighbor, c-ANN）问题的知名索引方案。传统上，LSH函数是以一种查询无感知的方式构建的，即在接收到任何查询之前，数据对象会被划分到不同的桶中。这种预定义的桶划分可能导致一个问题：距离查询对象更近的数据可能被分配到不同的桶，这显然不利于提高查询效率和准确性。然而，QALSH突破了这一限制，它引入了查询感知的特性。这意味着在处理查询时，会考虑到查询对象的具体位置信息和语义信息，从而动态调整桶的划分策略。这种方法旨在减少由于查询对象位置差异导致的“不理想”桶分配，提高精确匹配的可能性。相比于传统的查询无感知LSH，如外部内存中的C2LSH和LSB-Forest等，QALSH能够更好地适应实际应用场景，尤其是在处理空间和语义信息相结合的问题时。 QALSH的核心思想是构造一组针对特定查询敏感的哈希函数，这些函数在处理相同或相似距离范围内的数据时，使得相似对象有更高的概率落入同一个哈希桶。这通常通过设计多轮哈希和多个哈希函数来实现，每一轮哈希将数据点映射到更低维度的空间，同时保持局部敏感性。为了实现QALSH，研究者们提出了一种混合策略，结合了局部敏感哈希和数据结构的优化，例如随机投影、多级索引等。这些技术能够有效地减小数据维度，降低存储开销，同时在查询阶段快速定位潜在的近似最近邻。通过这种方式，QALSH能够在保持空间效率的同时，显著提升查询性能，尤其是在大规模数据集和实时应用中。总结来说，Query-Aware Locality-Sensitive Hashing是一种创新的索引技术，它通过考虑查询特征，提高了近似最近邻搜索的精度和效率。这对于地理位置服务、图像识别、推荐系统等领域具有重要意义，因为它能够更准确地识别出与查询对象相关度高的数据点，从而优化用户体验并降低计算复杂度。

where c > 1 and p

> p

. For ease of reference, p

and

are called positively-colliding probability and negatively-

colliding probability, respectively.

A query-oblivious LSH family is an LSH family H =

{h : R

→ Z} where each hash function h exploits query-

oblivious bucket partition, i.e., buckets in the hash table of

h are statically determined before any query arrives. Nor-

mally, for a query-oblivious LSH function h, two objects o

and q collide under h means h(o) = h(q), where h(o) identi-

ﬁes the bucket of o. A typical query-oblivious LSH function

is formally deﬁned as follows [2].

~a,b

(o) =



~a · ~o + b



, (1)

where ~o is a d-dimensional Euclidean vector representing

object o, ~a is a d-dimensional random vector with each en-

try drawn independently from standard normal distribution

N (0, 1). w is the pre-speciﬁed bucket width, and b is a real

number uniformly drawn from [0, w).

For two objects o

and o

, and a uniformly randomly cho-

sen hash function h

~a,b

, let s = ko

, o

k, and then their col-

lision probability is computed as follows [2]:

ξ(s) = P r

~a,b

) = h

~a,b

)]

(

)(1 −

) dt

(2)

where f

(x) =

√

2π

−

. For a ﬁxed w, ξ(s) decreases

monotonically as s increases. With ξ

= ξ(r) and ξ

= ξ(cr),

the family of hash functions h

~a,b

is (r, cr, ξ

, ξ

)-sensitive.

Speciﬁcally, if we set r = 1 and cr = c, we have Lemma 1 as

follows [2] :

Lemma 1. The query-oblivious LSH family identiﬁed by

Equation 1 is (1, c, ξ

, ξ

)-sensitive, where ξ

= ξ(1) and

= ξ(c).

3. QUERY-AWARE LSH FAMILY

In this section we ﬁrst introduce the concept of query-

aware LSH functions. Then we make a computational com-

parison of positively- and negatively-colliding probabilities

between query-oblivious and query-aware LSH families. Fi-

nally, we show that query-aware LSH family is able to sup-

port virtual rehashing in a simple and quick manner.

3.1 (1, c, p

, p

)-sensitive LSH Family

Constructing LSH functions in a query-aware manner con-

sists of two steps: random projection and query-aware bucket

partition. Formally, a query-aware hash function h

(o) :

→ R maps a d-dimensional object ~o to a number along

the real line identiﬁed by a random vector ~a, whose entries

are drawn independently from N (0, 1). For a ﬁxed ~a, the

corresponding hash function h

(o) is deﬁned as follows:

(o) = ~a · ~o (3)

For all the data objects, their projections along the ran-

dom line ~a are computed in the pre-processing step. When

a query object q arrives, we obtain the query projection by

computing h

(q). Then, we use the query as the “anchor”

to locate the anchor bucket with width w (deﬁned by h

(·)),

i.e., the interval [h

(q) −

, h

(q) +

]. If the projection

of an object o (i.e., h

(o)), falls in the anchor bucket with

width w, i.e., |h

(o) − h

(q)| ≤

, we say o collides with q

under h

We now show that the family of hash functions h

(o) cou-

pled with query-aware bucket partition is locality-sensitive.

In this sense, each h

(o) in the family is said to be a query-

aware LSH function. For objects o and q, let s = ko, qk.

Due to the stability of standard normal distribution N (0, 1),

we have that (~a · ~o − ~a · ~q) is distributed as sX, where

X is a random variable drawn from N (0, 1) [2]. Let ϕ(x)

be the probability density function (PDF) of N (0, 1), i.e.,

ϕ(x) =

√

2π

−

. The collision probability between o and

q under h

is computed as follows:

p(s) = P r

[|h

(o) − h

(q)| ≤

] = P r[|sX| ≤

]

= Pr[−

≤ X ≤

] =

−

ϕ(x) dx

(4)

Accordingly, we have Lemma 2 as follows:

Lemma 2. The query-aware hash family of all the hash

functions h

(o) that are identiﬁed by Equation 3 and coupled

with query-aware bucket partition is (1, c, p

, p

)-sensitive,

where p

= p(1) and p

= p(c).

Proof. Referring to Equation 4 , a simple calculation

shows that p(s) = 1 − 2norm(−

), where norm(x) =

−∞

ϕ(t) dt. Note that norm(x) is simply the cumulative

distribution function (CDF) of N (0, 1), which increases mono-

tonically as x increases. For a ﬁxed w, norm(−

) in-

creases monotonically as s increases, and hence p(s) de-

creases monotonically as s increases. Therefore, according

to Deﬁnition 1, the query-aware hash family identiﬁed by

Equation 3, is (1, c, p

, p

)-sensitive, where p

= p(1) and

= p(c), respectively.

3.2 Comparison of Colliding Probabilities

The eﬀectiveness of an (r, cr, p

, p

)-sensitive hash family

depends on the diﬀerence between the positively-colliding

probability and negatively-colliding probability, i.e., (p

−

), since the diﬀerence measures the degree that positively-

colliding data objects of a query q can be discriminated

from negatively-colliding ones. We now show that the novel

query-aware hash family leads to larger (p

−p

) under typi-

cal settings of bucket width w. For query-aware LSH family,

from the proof of Lemma 2, we have p

= 1 − 2norm(−

)

and p

= 1 − 2norm(−

). For query-oblivious LSH fam-

ily, we have ξ

= 1 − 2norm(−w) −

√

2πw

(1 − e

−(w

/2)

) and

= 1 − 2norm(−w/c) −

√

2πw/c

(1 − e

−(w

/2c

)

) [2].

Bucket width w is a critical parameter of an LSH function.

While E2LSH and LSB-Forest manually set w = 4.0, C2LSH

manually sets w = 1.0. For w in the range [0, 10], starting

from 0.5 and with a step of 0.5, we show the variations of the

colliding probabilities p

, p

, ξ

, and ξ

for two diﬀerent c

values in Figure 3. We ﬁnd that all the colliding probabilities

monotonically increase as w increases, and get very close to 1

as w gets close to 10. In addition, p

and p

are consistently

larger than ξ

and ξ

, respectively. Thus, we also show

the two diﬀerences (p

− p

) and (ξ

− ξ

) with respect to

w in Figure 4. We have two interesting observations: (1)

−p

) is larger than (ξ

−ξ

) under typical bucket widths,

namely w = 4.0 and w = 1.0. (2) Both (p

− p

) and

(ξ

−ξ

) tend to have maximum values in the w range [0, 10].

Observation (1) indicates that our novel query-aware LSH

family can be used to improve the performance of query-

oblivious LSH schemes such as C2LSH by leveraging a larger

− p

). Observation (2) inspires us to automatically set

剩余11页未读，继续阅读

bccvictory

粉丝: 0

利用查询感知的局部敏感哈希提升高维空间近邻搜索准确性

Interest-aware Message-Passing GCN for Recommendat.md

cost-aware-consistent-hashing:实施具有受限负载的一致哈希，并进行实验性更改以使其具有成本意识

PyTorch-Style-Aware-Content-Loss-for-Real-time-HD-Style-Transfer

Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip (MP-SoC)

Patch-Based-Image-Warping-for-Content-Aware-Retargeting

Security-aware attribute-based access control for fog-based eldercare system

Toward collinearity-aware and conflict-friendly localization for wireless sensor networks

Instance-Aware-Hashing.rar_The Dos

LoftQ LoRA-Fine-Tuning-Aware Quantization for LLM.pdf

Energy-Aware Data Allocation with Hybrid Memory for Mobile Cloud Systems

最新资源