3
where d(x, y) = ||x − y|| is the Euclidean distance
between x and y, and where p(x) is the probability
distribution function corresponding the random variable
X. For an arbitrary probability distribution function,
Equation 3 is numerically computed using Monte-Carlo
sampling, as the average of ||q(x) − x||
2
on a large set
of samples.
In order for the quantizer to be optimal, it has to
satisfy two properties known as the Lloyd optimality
conditions. First, a vector x must be quantized to its
nearest codebook centroid, in terms of the Euclidean
distance:
q(x) = arg min
c
i
∈C
d(x, c
i
). (4)
As a result, the cells are delimited by hyperplanes.
The second Lloyd condition is that the reconstruction
value must be the expectation of the vectors lying in the
Voronoi cell:
c
i
= E
X
x|i
=
Z
V
i
p(x) x dx. (5)
The Lloyd quantizer, which corresponds to the k-
means clustering algorithm, finds a near-optimal code-
book by iteratively assigning the vectors of a training
set to centroids and re-estimating these centroids from
the assigned vectors. In the following, we assume that
the two Lloyd conditions hold, as we learn the quantizer
using k-means. Note, however, that k-means does only
find a local optimum in terms of quantization error.
Another quantity that will be used in the following
is the mean squared distortion ξ(q, c
i
) obtained when
reconstructing a vector of a cell V
i
by the corresponding
centroid c
i
. Denoting by p
i
= P
q(x) = c
i
the
probability that a vector is assigned to the centroid c
i
, it
is computed as
ξ(q, c
i
) =
1
p
i
Z
V
i
d
x, q(x)
2
p(x) dx. (6)
Note that the MSE can be obtained from these quan-
tities as
MSE(q) =
X
i∈I
p
i
ξ(q, c
i
). (7)
The memory cost of storing the index value, without
any further processing (entropy coding), is dlog
2
ke bits.
Therefore, it is convenient to use a power of two for
k, as the code produced by the quantizer is stored in a
binary memory.
B. Product quantizers
Let us consider a 128-dimensional vector, for example
the SIFT descriptor [23]. A quantizer producing 64-
bits codes, i.e., “only” 0.5 bit per component, contains
k = 2
64
centroids. Therefore, it is impossible to use
Lloyd’s algorithm or even HKM, as the number of
samples required and the complexity of learning the
quantizer are several times k. It is even impossible to
store the D × k floating point values representing the k
centroids.
Product quantization is an efficient solution to address
these issues. It is a common technique in source coding,
which allows to choose the number of components to be
quantized jointly (for instance, groups of 24 components
can be quantized using the powerful Leech lattice). The
input vector x is split into m distinct subvectors u
j
, 1 ≤
j ≤ m of dimension D
∗
= D/m, where D is a multiple
of m. The subvectors are quantized separately using m
distinct quantizers. A given vector x is therefore mapped
as follows:
x
1
, ..., x
D
∗
| {z }
u
1
(x)
, ..., x
D−D
∗
+1
, ..., x
D
| {z }
u
m
(x)
→ q
1
u
1
(x)), ..., q
m
(u
m
(x)
,
(8)
where q
j
is a low-complexity quantizer associated with
the j
th
subvector. With the subquantizer q
j
we associate
the index set I
j
, the codebook C
j
and the corresponding
reproduction values c
j,i
.
A reproduction value of the product quantizer is
identified by an element of the product index set I =
I
1
× . . . × I
m
. The codebook is therefore defined as the
Cartesian product
C = C
1
× . . . × C
m
, (9)
and a centroid of this set is the concatenation of centroids
of the m subquantizers. From now on, we assume that
all subquantizers have the same finite number k
∗
of
reproduction values. In that case, the total number of
centroids is given by
k = (k
∗
)
m
. (10)
Note that in the extremal case where m = D, the
components of a vector x are all quantized separately.
Then the product quantizer turns out to be a scalar
quantizer, where the quantization function associated
with each component may be different.
The strength of a product quantizer is to produce a
large set of centroids from several small sets of centroids:
those associated with the subquantizers. When learning
the subquantizers using Lloyd’s algorithm, a limited
number of vectors is used, but the codebook is, to some
extent, still adapted to the data distribution to represent.
The complexity of learning the quantizer is m times
the complexity of performing k-means clustering with
k
∗
centroids of dimension D
∗
.