Bloom Filter：网络应用与概率数据结构

3星 · 超过75%的资源需积分: 0 163 浏览量更新于2024-11-07 收藏 230KB PDF 举报

"这篇文档是关于Bloom Filter的网络应用调查，由Andrei Broder和Michael Mitzenmacher撰写，详细介绍了Bloom Filter这一数据结构在大规模网络应用中的复兴，如共享网页缓存、查询路由和副本定位等，并探讨了其现代变体和背后的数学原理。" Bloom Filter是一种高效的空间节省数据结构，由Burton Bloom在1970年为拼写检查设计。它的主要功能是近似地判断一个元素是否属于集合，以此来支持成员资格查询。Bloom Filter通过使用多个独立的哈希函数将元素映射到固定大小的位数组上，从而实现对集合的紧凑表示。这种表示方法使得存储空间大大减少，但同时也引入了误报（false positive）的可能性，即可能会将不属于集合的元素误判为在集合中。在早期，Bloom Filter主要用于数据库优化，但在过去几十年间，随着网络应用的发展，它的重要性日益凸显。例如，在共享网页缓存中，Bloom Filter可以用来快速检测一个网页是否已经被缓存，避免不必要的网络请求；在查询路由中，它可以预判目标服务器是否可能有请求的数据，从而优化网络流量；在副本定位中，Bloom Filter可以帮助快速识别数据的可能存储位置，提高数据检索效率。 Bloom Filter的现代变体包括压缩型Bloom Filter、Counting Bloom Filter、Cuckoo Filter等，它们分别在处理动态集合、计数需求和更小的错误率等方面进行了改进。数学基础包括概率论和哈希函数的设计，这些理论确保了Bloom Filter能够在保证一定准确率的同时，尽可能降低误报率。这篇调查报告详尽列举了Bloom Filter近年来的各种应用实例，不仅向读者展示了这个古老数据结构的广泛应用，还希望能够激发更多的创新应用。Bloom Filter因其高效和简洁，已经成为数据科学、网络工程和分布式系统领域的重要工具，对于理解和利用它进行问题解决具有重要的参考价值。

Asymptotically, then, the p erformance is the same as the original scheme. However, since















;

the probability of a false p ositive is actually always slightly higher with this division. Since the

dierence is small, this approach maybe still be useful for implementation reasons; for example,

dividing the bits among the hash functions maymake parallelization of array accesses easier.

Suppose we are given

and

and we wish to optimize for the number of hash functions. There

are two comp eting forces: using more hash functions gives us more chances to nd a 0 bit for an

element that is not a member of

, but using fewer hash functions increases the fraction of 0 bits

in the array. The optimal number of hash functions that minimizes

as a function of

is easily

found by taking the derivative. More conveniently,notethat

equals exp(

ln(1



kn=m

)). Let

ln(1



kn=m

). Minimizing the false positiverate

is equivalent to minimizing

with respect

. Wend

=ln









It is easy to check that the derivativeis0when

=ln2



(

m=n

); further eorts reveal that this is a

global minimum. Alternatively,using



kn=m

,wend



ln(

)ln(1



)

;

from which symmetry reveals that the minimum value for

occurs when

2, or equivalently

=ln2



(

m=n

). In this case the false p ositiverate

is (1

=(0

6185)

m=n

In practice, of course,

must be an integer, and smaller

might be preferred since they reduce the amount of computation

necessary.

2.2 Hashing vs. Blo om lters

Another natural way to represent a set is to use hashing. Each item of the set can be hashed into

(log

) bits, and a (sorted) list of hash values then represents the set. This approach yields very

small error probabilities. For example, using 2 log

bits p er set element, the probability that two

distinct elements obtain the same hash value is 1

. Hence the probabilitythatany elementnot

in the set matches some hash value in the set is at most

n=n

by the standard union b ound.

Bloom lters can be interpreted as a natural generalization of hashing that allows more interest-

ing tradeos b etween the numb er of bits used per set element and the probability of false positives.

(Indeed, a Bloom lter with just one hash function is equivalent to hashing.) Blo om lters yield a

constant false p ositive probabilityeven if a constantnumber of bits are used per set element. For

example, when

, the false positive probabilityis just over 0

02. For most theoretical analyses,

this tradeo is not interesting; using hashing yields an asymptotically vanishing probability of error

with only (log

) bits per element. Bloom lters have therefore received little attention in the

theoretical community. In contrast, for practical applications the price of a constant false positive

probabilitymaywell b e worthwhile to reduce the necessary space.

2.3 Standard Bloom lter tricks

The simple structure of Bloom lters makes certain operations very easy to implement. For example,

suppose one has two Blo om lters representing sets

and

with the same number of bits and

using the same number of hash functions. Then a Bloom lter that represents the union of two sets

can be obtained by taking the OR of the two bit vectors of the original Blo om lters.

Another nice feature is that Bloom lters can easily b e halved in size. Supp ose the size of the

lter is a power of 2. If one wants to half the size of the lter, just OR the rst and second halves

together. When hashing, the high order bit can b e masked.

剩余10页未读，继续阅读

batong0711

粉丝: 0
资源: 1

Bloom Filter：网络应用与概率数据结构

Bloom Filter概念和原理

leveldb中bloomfilter的优化.pdf

BloomFilter

Bloom Filter

bloomfilter

BloomFilter .NET:BloomFilter .NET-Bloom Filter的.NET实现-开源

BloomFilter:用于文本文档的 BloomFilter

bloomfilter:一个简单的Bloom Filter实现

java-bloomfilter:Java SE 8 的 BloomFilter 计数

bloomfilter-rust:在Rust中实现的Bloomfilter

最新资源