3.1 Locality Sensitive Hashing
To solve the c-ANN problem, Indyk and Motwani pro-
posed the idea of LSH [8], which is formally defined as fol-
lows.
Definition 1. (Locality Sensitive Hashing) Given a
distance R, an approximate ratio c and two probability val-
ues P
1
and P
2
, a hash function h : R
d
→ Z is called
(R, c, P
1
, P
2
)-sensitive if it satisfies the following condition-
s simultaneously for any two points p
1
, p
2
∈ D:
• If k p
1
, p
2
k≤ R, then P r[h(p
1
) = h(p
2
)] ≥ P
1
;
• If k p
1
, p
2
k≥ cR, then P r[h(p
1
) = h(p
2
)] ≤ P
2
;
To make sense, both c > 1 and P
1
≥ P
2
hold. In ad-
dition, a compound LSH function is denoted as G =
(h
1
, . . . , h
m
), where h
1
, . . . , h
m
are randomly selected LSH
functions. Specifically, for ∀p ∈ D, K = G(p) = (h
1
(p),
. . . , h
m
(p)) is defined as the compound hash key of point
p under G.
According to Definition 1, LSH ensures that a close pair
collides with each other with a high probability (P
1
) and a
far pair with a low probability (P
2
). This property of LSH
is also called distance-preserving.
The LSH function commonly used in Euclidean Space,
which was proposed by Datar [2], is shown as the following:
h(p) = b
a · p + bW
W
c (1)
Here, a is a random vector with each dimension indepen-
dently chosen from Guassian distribution and p is an ar-
bitrary point in D. b is a real number uniformly drawn
from the range [0,1]. W is also a real number represent-
ing the width of the LSH function. For two points p
1
, p
2
and an LSH function h, if k p
1
, p
2
k= r, the probability of
h(p
1
) = h(p
2
) can be computed as follows [2].
p(r, W ) = P r[h(p
1
) = h(p
2
)]
=
R
W
0
1
r
f
2
(
t
r
)(1 −
t
W
)dt
= 2norm(W/r) − 1 −
2
√
2π
r
W
(1 − e
−
W
2
2r
2
)
(2)
Here, f
2
(x) =
2
√
2π
e
−
x
2
2
and norm(·) represents the cumu-
lative distribution function of a random variable following
Gaussian Distribution. According to Equation 2, the proba-
bility p(r, W) decreases monotonically when r increases but
grows monotonically when W rises.
Due to the distance-preserving property of LSH, it is ra-
tional to use the hash values to estimate the distance be-
tween two points. Therefore, if two points have similar hash
values, it is believed that they are close to each other with
certain confidence. Based on this idea, several approaches
have been proposed for c-ANN [2, 4, 5, 12, 19]. However,
it is obvious that Equation 1 exhibits poor performance in
filtering irrelevant points, as many pairs, which are distant
from each other, may share the same hash value under a sin-
gle hash function as Equation 1. In other words, numerous
false positives may be returned. To remove the irrelevan-
t points (i.e. false positives), a compound LSH function
G = (h
1
, h
2
, . . . , h
m
) is employed so as to improve the dis-
tinguishing capacity. Note that each element of a compound
LSH function, h
i
, is randomly selected as defined in Equa-
tion 1. Only points sharing all the m hash values with the
query point are taken into account as candidate points, as
suggested by the basic LSH [2]. However, c-ANN search al-
gorithms should ensure that data points having similar com-
pound hash keys to the query point are taken into account as
candidates. Hence, a distance measure over compound hash
keys is required. In the following, we propose a novel mea-
sure to evaluate the distance between a pair of compound
hash keys.
3.2 Distance over Compound Hash Keys
Given a compound LSH function G and two points p
1
, p
2
∈ D, we have compound hash keys K
1
= G(p
1
) and K
2
=
G(p
2
), where both K
1
and K
2
are tuples containing m hash
values. Let k
1,i
(resp. k
2,i
) be the i-th element of K
1
(resp.
K
2
), that means k
1,i
= h
i
(p
1
) and k
2,i
= h
i
(p
2
).
Definition 2. (Prefix of a Compound Hash Key)
Given a point p ∈ D and its compound hash key K =
G(p) = (k
1
, k
2
, . . . , k
m
). The l-length prefix of K, de-
noted as pref (K, l), consisting of the first l elements of K
where 1 ≤ l ≤ m, is formally defined as follows.
pref(K, l) = (k
1
, k
2
, . . . , k
l
) (3)
Particularly, we denote pref(K, 0) as K
∅
, which is actually
an empty hash key.
Here, we are inspired by the prefix of a character string
and treat a compound hash key as a string of elements.
Therefore, its prefix is the substring constituted by its first
several elements. For example, for a compound hash key
K = (1, 2, 3, 4), pref (K, 3) = (1, 2, 3), pref(K, 2) = (1, 2)
and pref (K, 0) = K
∅
.
Definition 3. (Non-prefix Length of Compound Hash
Keys) Given two compound hash keys K
1
= (k
1,1
, k
1,2
, . . . , k
1,m
)
and K
2
= (k
2,1
, k
2,2
, . . . , k
2,m
), if pref(K
1
, l) = pref(K
2
, l)
and pref (K
1
, l + 1) 6= pref(K
2
, l + 1), where 0 ≤ l < m,
then the non-prefix length between K
1
and K
2
, denoted as
KL(K
1
, K
2
), is formally defined as follows:
KL(K
1
, K
2
) = m − l (4)
If pref(K
1
, m)=pref(K
2
, m), then KL(K
1
, K
2
) = 0.
A smaller non-prefix distance between two compound hash
keys indicates that they share a longer common prefix with
each other. For instance, given two compound hash keys
K
1
= (1, 2, 3, 4) and K
2
= (1, 2, 3, 5), KL(K
1
, K
2
) = 1 since
pref(K
1
, 3) = pref(K
2
, 3) and pref (K
1
, 4) 6= pref(K
2
, 4).
Definition 4. ((l + 1)-th Element Distance of Com-
pound Hash Keys) Given two compound hash keys K
1
=
(k
1,1
, k
1,2
, . . . , k
1,m
) and K
2
= (k
2,1
, k
2,2
, . . . , k
2,m
), if
KL(K
1
, K
2
) = m − l, where 0 ≤ l < m, then the (l + 1)-th
element distance between K
1
and K
2
is defined as the ab-
solute value of the distance between their (l + 1)-th elements
as follows:
KD(K
1
, K
2
) = |k
1,l+1
− k
2,l+1
| (5)
If l = m, we denote KD(K
1
, K
2
) as 0 by default.
Though the notion of non-prefix length can be used to
measure the distance between two compound hash keys, we
employ the (l + 1)-th element distance to further distinguish
two compound hash keys. Let us consider the following three
compound hash keys, K
1
= (1, 2, 3, 4), K
2
= (1, 2, 3, 5) and
747