where c > 1 and p
1
> p
2
. For ease of reference, p
1
and
p
2
are called positively-colliding probability and negatively-
colliding probability, respectively.
A query-oblivious LSH family is an LSH family H =
{h : R
d
→ Z} where each hash function h exploits query-
oblivious bucket partition, i.e., buckets in the hash table of
h are statically determined before any query arrives. Nor-
mally, for a query-oblivious LSH function h, two objects o
and q collide under h means h(o) = h(q), where h(o) identi-
fies the bucket of o. A typical query-oblivious LSH function
is formally defined as follows [2].
h
~a,b
(o) =
~a · ~o + b
w
, (1)
where ~o is a d-dimensional Euclidean vector representing
object o, ~a is a d-dimensional random vector with each en-
try drawn independently from standard normal distribution
N (0, 1). w is the pre-specified bucket width, and b is a real
number uniformly drawn from [0, w).
For two objects o
1
and o
2
, and a uniformly randomly cho-
sen hash function h
~a,b
, let s = ko
1
, o
2
k, and then their col-
lision probability is computed as follows [2]:
ξ(s) = P r
~a,b
[h
~a,b
(o
1
) = h
~a,b
(o
2
)]
=
R
w
0
1
s
f
2
(
t
s
)(1 −
t
w
) dt
(2)
where f
2
(x) =
2
√
2π
e
−
x
2
2
. For a fixed w, ξ(s) decreases
monotonically as s increases. With ξ
1
= ξ(r) and ξ
2
= ξ(cr),
the family of hash functions h
~a,b
is (r, cr, ξ
1
, ξ
2
)-sensitive.
Specifically, if we set r = 1 and cr = c, we have Lemma 1 as
follows [2] :
Lemma 1. The query-oblivious LSH family identified by
Equation 1 is (1, c, ξ
1
, ξ
2
)-sensitive, where ξ
1
= ξ(1) and
ξ
2
= ξ(c).
3. QUERY-AWARE LSH FAMILY
In this section we first introduce the concept of query-
aware LSH functions. Then we make a computational com-
parison of positively- and negatively-colliding probabilities
between query-oblivious and query-aware LSH families. Fi-
nally, we show that query-aware LSH family is able to sup-
port virtual rehashing in a simple and quick manner.
3.1 (1, c, p
1
, p
2
)-sensitive LSH Family
Constructing LSH functions in a query-aware manner con-
sists of two steps: random projection and query-aware bucket
partition. Formally, a query-aware hash function h
~a
(o) :
R
d
→ R maps a d-dimensional object ~o to a number along
the real line identified by a random vector ~a, whose entries
are drawn independently from N (0, 1). For a fixed ~a, the
corresponding hash function h
~a
(o) is defined as follows:
h
~a
(o) = ~a · ~o (3)
For all the data objects, their projections along the ran-
dom line ~a are computed in the pre-processing step. When
a query object q arrives, we obtain the query projection by
computing h
~a
(q). Then, we use the query as the “anchor”
to locate the anchor bucket with width w (defined by h
~a
(·)),
i.e., the interval [h
~a
(q) −
w
2
, h
~a
(q) +
w
2
]. If the projection
of an object o (i.e., h
~a
(o)), falls in the anchor bucket with
width w, i.e., |h
~a
(o) − h
~a
(q)| ≤
w
2
, we say o collides with q
under h
~a
.
We now show that the family of hash functions h
~a
(o) cou-
pled with query-aware bucket partition is locality-sensitive.
In this sense, each h
~a
(o) in the family is said to be a query-
aware LSH function. For objects o and q, let s = ko, qk.
Due to the stability of standard normal distribution N (0, 1),
we have that (~a · ~o − ~a · ~q) is distributed as sX, where
X is a random variable drawn from N (0, 1) [2]. Let ϕ(x)
be the probability density function (PDF) of N (0, 1), i.e.,
ϕ(x) =
1
√
2π
e
−
x
2
2
. The collision probability between o and
q under h
~a
is computed as follows:
p(s) = P r
~a
[|h
~a
(o) − h
~a
(q)| ≤
w
2
] = P r[|sX| ≤
w
2
]
= Pr[−
w
2s
≤ X ≤
w
2s
] =
R
w
2s
−
w
2s
ϕ(x) dx
(4)
Accordingly, we have Lemma 2 as follows:
Lemma 2. The query-aware hash family of all the hash
functions h
~a
(o) that are identified by Equation 3 and coupled
with query-aware bucket partition is (1, c, p
1
, p
2
)-sensitive,
where p
1
= p(1) and p
2
= p(c).
Proof. Referring to Equation 4 , a simple calculation
shows that p(s) = 1 − 2norm(−
w
2s
), where norm(x) =
R
x
−∞
ϕ(t) dt. Note that norm(x) is simply the cumulative
distribution function (CDF) of N (0, 1), which increases mono-
tonically as x increases. For a fixed w, norm(−
w
2s
) in-
creases monotonically as s increases, and hence p(s) de-
creases monotonically as s increases. Therefore, according
to Definition 1, the query-aware hash family identified by
Equation 3, is (1, c, p
1
, p
2
)-sensitive, where p
1
= p(1) and
p
2
= p(c), respectively.
3.2 Comparison of Colliding Probabilities
The effectiveness of an (r, cr, p
1
, p
2
)-sensitive hash family
depends on the difference between the positively-colliding
probability and negatively-colliding probability, i.e., (p
1
−
p
2
), since the difference measures the degree that positively-
colliding data objects of a query q can be discriminated
from negatively-colliding ones. We now show that the novel
query-aware hash family leads to larger (p
1
−p
2
) under typi-
cal settings of bucket width w. For query-aware LSH family,
from the proof of Lemma 2, we have p
1
= 1 − 2norm(−
w
2
)
and p
2
= 1 − 2norm(−
w
2c
). For query-oblivious LSH fam-
ily, we have ξ
1
= 1 − 2norm(−w) −
2
√
2πw
(1 − e
−(w
2
/2)
) and
ξ
2
= 1 − 2norm(−w/c) −
2
√
2πw/c
(1 − e
−(w
2
/2c
2
)
) [2].
Bucket width w is a critical parameter of an LSH function.
While E2LSH and LSB-Forest manually set w = 4.0, C2LSH
manually sets w = 1.0. For w in the range [0, 10], starting
from 0.5 and with a step of 0.5, we show the variations of the
colliding probabilities p
1
, p
2
, ξ
1
, and ξ
2
for two different c
values in Figure 3. We find that all the colliding probabilities
monotonically increase as w increases, and get very close to 1
as w gets close to 10. In addition, p
1
and p
2
are consistently
larger than ξ
1
and ξ
2
, respectively. Thus, we also show
the two differences (p
1
− p
2
) and (ξ
1
− ξ
2
) with respect to
w in Figure 4. We have two interesting observations: (1)
(p
1
−p
2
) is larger than (ξ
1
−ξ
2
) under typical bucket widths,
namely w = 4.0 and w = 1.0. (2) Both (p
1
− p
2
) and
(ξ
1
−ξ
2
) tend to have maximum values in the w range [0, 10].
Observation (1) indicates that our novel query-aware LSH
family can be used to improve the performance of query-
oblivious LSH schemes such as C2LSH by leveraging a larger
(p
1
− p
2
). Observation (2) inspires us to automatically set
3