Asymptotically, then, the p erformance is the same as the original scheme. However, since
1
k
m
n
1
1
m
kn
;
the probability of a false p ositive is actually always slightly higher with this division. Since the
dierence is small, this approach maybe still be useful for implementation reasons; for example,
dividing the bits among the hash functions maymake parallelization of array accesses easier.
Suppose we are given
m
and
n
and we wish to optimize for the number of hash functions. There
are two comp eting forces: using more hash functions gives us more chances to nd a 0 bit for an
element that is not a member of
S
, but using fewer hash functions increases the fraction of 0 bits
in the array. The optimal number of hash functions that minimizes
f
as a function of
k
is easily
found by taking the derivative. More conveniently,notethat
f
equals exp(
k
ln(1
e
kn=m
)). Let
g
=
k
ln(1
e
kn=m
). Minimizing the false positiverate
f
is equivalent to minimizing
g
with respect
to
k
. Wend
dg
dk
=ln
1
e
kn
m
+
kn
m
e
kn
m
1
e
kn
m
:
It is easy to check that the derivativeis0when
k
=ln2
(
m=n
); further eorts reveal that this is a
global minimum. Alternatively,using
p
=e
kn=m
,wend
g
=
m
n
ln(
p
)ln(1
p
)
;
from which symmetry reveals that the minimum value for
g
occurs when
p
=1
=
2, or equivalently
k
=ln2
(
m=n
). In this case the false p ositiverate
f
is (1
=
2)
k
=(0
:
6185)
m=n
:
In practice, of course,
k
must be an integer, and smaller
k
might be preferred since they reduce the amount of computation
necessary.
2.2 Hashing vs. Blo om lters
Another natural way to represent a set is to use hashing. Each item of the set can be hashed into
(log
n
) bits, and a (sorted) list of hash values then represents the set. This approach yields very
small error probabilities. For example, using 2 log
2
n
bits p er set element, the probability that two
distinct elements obtain the same hash value is 1
=n
2
. Hence the probabilitythatany elementnot
in the set matches some hash value in the set is at most
n=n
2
=1
=n
by the standard union b ound.
Bloom lters can be interpreted as a natural generalization of hashing that allows more interest-
ing tradeos b etween the numb er of bits used per set element and the probability of false positives.
(Indeed, a Bloom lter with just one hash function is equivalent to hashing.) Blo om lters yield a
constant false p ositive probabilityeven if a constantnumber of bits are used per set element. For
example, when
m
=8
n
, the false positive probabilityis just over 0
:
02. For most theoretical analyses,
this tradeo is not interesting; using hashing yields an asymptotically vanishing probability of error
with only (log
n
) bits per element. Bloom lters have therefore received little attention in the
theoretical community. In contrast, for practical applications the price of a constant false positive
probabilitymaywell b e worthwhile to reduce the necessary space.
2.3 Standard Bloom lter tricks
The simple structure of Bloom lters makes certain operations very easy to implement. For example,
suppose one has two Blo om lters representing sets
S
1
and
S
2
with the same number of bits and
using the same number of hash functions. Then a Bloom lter that represents the union of two sets
can be obtained by taking the OR of the two bit vectors of the original Blo om lters.
Another nice feature is that Bloom lters can easily b e halved in size. Supp ose the size of the
lter is a power of 2. If one wants to half the size of the lter, just OR the rst and second halves
together. When hashing, the high order bit can b e masked.