提升搜索效率：产品量化方法与FAISS在最近邻查找中的应用

需积分: 14 147 浏览量更新于2024-08-30 收藏 579KB PDF 举报

"产品量化（Product Quantization）是最近邻搜索（Nearest Neighbor Search）中的一个重要理论基础，由Hervé Jegou、Matthijs Douze和Cordelia Schmid提出。这项工作旨在解决高维空间中的相似度搜索问题，通过将高维空间分解为低维子空间的笛卡尔积，然后对每个子空间独立进行量化。这种方法的核心是将一个向量表示为由其子空间量化索引组成的短码，使得计算两个向量之间的欧几里得距离变得高效。在产品量化中，每个原始向量被转换成一系列低维子空间的编码，这些编码可以通过简单的算术运算来估算两向量间的距离。然而，传统的欧氏距离估计可能不够精确，因此还引入了非对称版本，该版本不仅计算向量与码本的距离，还能提供更精确的向量与特定代码点的近似距离。实验结果显示，基于产品量化的最近邻搜索方法表现出高效的性能，特别是在结合倒排索引系统时。对于如SIFT和GIST这样的图像描述符，其搜索精度优于当时最先进的三种方法，显示出显著的优势。而且，这种技术在处理大规模数据库，例如包含20亿向量的大数据集上，展现出良好的可扩展性。产品量化因其在图像检索、文本分析等领域的广泛应用而备受关注，它能够有效地降低存储需求，提高查询速度，同时保持相对较高的搜索精度。作为FAISS（Facebook AI Similarity Search）库的基础技术之一，它在实际应用中扮演着关键角色，对于大数据处理和实时推荐系统等领域具有重要意义。" 总结来说，产品量化是一种用于高维向量数据的高效搜索策略，它通过子空间分解和量化来优化搜索性能，并在大量数据集上的实验验证了其优秀的搜索准确性和扩展性。这对于处理大规模数据并实现快速、精确的搜索至关重要。

where d(x, y) = ||x − y|| is the Euclidean distance

between x and y, and where p(x) is the probability

distribution function corresponding the random variable

X. For an arbitrary probability distribution function,

Equation 3 is numerically computed using Monte-Carlo

sampling, as the average of ||q(x) − x||

on a large set

of samples.

In order for the quantizer to be optimal, it has to

satisfy two properties known as the Lloyd optimality

conditions. First, a vector x must be quantized to its

nearest codebook centroid, in terms of the Euclidean

distance:

q(x) = arg min

∈C

d(x, c

). (4)

As a result, the cells are delimited by hyperplanes.

The second Lloyd condition is that the reconstruction

value must be the expectation of the vectors lying in the

Voronoi cell:

= E



x|i



p(x) x dx. (5)

The Lloyd quantizer, which corresponds to the k-

means clustering algorithm, ﬁnds a near-optimal code-

book by iteratively assigning the vectors of a training

set to centroids and re-estimating these centroids from

the assigned vectors. In the following, we assume that

the two Lloyd conditions hold, as we learn the quantizer

using k-means. Note, however, that k-means does only

ﬁnd a local optimum in terms of quantization error.

Another quantity that will be used in the following

is the mean squared distortion ξ(q, c

) obtained when

reconstructing a vector of a cell V

by the corresponding

centroid c

. Denoting by p

= P



q(x) = c



the

probability that a vector is assigned to the centroid c

, it

is computed as

ξ(q, c

) =



x, q(x)



p(x) dx. (6)

Note that the MSE can be obtained from these quan-

tities as

MSE(q) =

i∈I

ξ(q, c

). (7)

The memory cost of storing the index value, without

any further processing (entropy coding), is dlog

ke bits.

Therefore, it is convenient to use a power of two for

k, as the code produced by the quantizer is stored in a

binary memory.

B. Product quantizers

Let us consider a 128-dimensional vector, for example

the SIFT descriptor [23]. A quantizer producing 64-

bits codes, i.e., “only” 0.5 bit per component, contains

k = 2

centroids. Therefore, it is impossible to use

Lloyd’s algorithm or even HKM, as the number of

samples required and the complexity of learning the

quantizer are several times k. It is even impossible to

store the D × k ﬂoating point values representing the k

centroids.

Product quantization is an efﬁcient solution to address

these issues. It is a common technique in source coding,

which allows to choose the number of components to be

quantized jointly (for instance, groups of 24 components

can be quantized using the powerful Leech lattice). The

input vector x is split into m distinct subvectors u

, 1 ≤

j ≤ m of dimension D

∗

= D/m, where D is a multiple

of m. The subvectors are quantized separately using m

distinct quantizers. A given vector x is therefore mapped

as follows:

, ..., x

∗

| {z }

(x)

, ..., x

D−D

∗

, ..., x

| {z }

(x)

→ q



(x)), ..., q

(x)



(8)

where q

is a low-complexity quantizer associated with

the j

subvector. With the subquantizer q

we associate

the index set I

, the codebook C

and the corresponding

reproduction values c

j,i

A reproduction value of the product quantizer is

identiﬁed by an element of the product index set I =

× . . . × I

. The codebook is therefore deﬁned as the

Cartesian product

C = C

× . . . × C

, (9)

and a centroid of this set is the concatenation of centroids

of the m subquantizers. From now on, we assume that

all subquantizers have the same ﬁnite number k

∗

reproduction values. In that case, the total number of

centroids is given by

k = (k

∗

)

. (10)

Note that in the extremal case where m = D, the

components of a vector x are all quantized separately.

Then the product quantizer turns out to be a scalar

quantizer, where the quantization function associated

with each component may be different.

The strength of a product quantizer is to produce a

large set of centroids from several small sets of centroids:

those associated with the subquantizers. When learning

the subquantizers using Lloyd’s algorithm, a limited

number of vectors is used, but the codebook is, to some

extent, still adapted to the data distribution to represent.

The complexity of learning the quantizer is m times

the complexity of performing k-means clustering with

∗

centroids of dimension D

∗

剩余13页未读，继续阅读

爱丽丝大鳄啊手动阀阿三的

粉丝: 0
资源: 3

提升搜索效率：产品量化方法与FAISS在最近邻查找中的应用

最近邻搜索之乘积量化（product quantization）

PQsearch算法

product quantization IVFADC算法在Windows下的实现

product quantization SDC算法在Windows下的实现

product-quantization::upside-down_face:矢量量化算法的实现，Norm-Explicit Quantization的代码

实现矢量量化算法：探索Norm-Explicit Quantization的高效代码

036GraphTheory(图论) matlab代码.rar

026SVM用于分类时的参数优化，粒子群优化算法，用于优化核函数的c,g两个参数(SVM PSO)Matlab代码.rar

药店管理-JAVA-基于springBoot的药店管理系统的设计与实现（毕业论文+开题）

【网络】基于matlab高动态网络拓扑中OSPF网络计算【含Matlab源码 10964期】.zip

最新资源