哈希搜索与素描：信息时代的新搜索技术

需积分: 9 58 浏览量更新于2024-07-27 1 收藏 889KB PDF 举报

"Hashing, Searching, and Sketching in the Information Age: A Comprehensive Study on Exact and Fuzzy Search Algorithms" 在当今的信息时代，信息检索的方式变得前所未有的多样。搜索需求可以是精确的，即输入查询需精确匹配目标对象（例如数据库查询），也可以是模糊的，如图像搜索、新闻搜索和相似文档查找，这使得搜索问题的复杂性显著增加。Hashing作为一种简单而有效的工具，利用随机哈希函数将数据项映射到桶（或称槽）中，形象地说就是“把球扔进篮子”。本书深入研究了基于哈希和sketching（数据压缩表示）的各种搜索算法，并探讨了这些方法在实现精确和模糊搜索中的应用及潜在局限。对于精确搜索，章节会讨论如何通过变种的“球与篮子”过程设计出空间效率高的哈希表维护方法。例如，布隆过滤器(Bloom Filters)就是一种常见的哈希技术，它能在有限的空间内判断一个元素是否在一个集合中，但允许一定概率的误判，这对于大规模数据预查非常有效。模糊搜索方面，将聚焦于一种特殊的哈希技术——局部敏感哈希(LSH)，它能够在保持线性空间复杂度的同时，提高搜索性能，尤其是在近似匹配场景下。LSH通过构造具有特定相似度敏感性的哈希函数，将相似的数据映射到相近的桶中，从而加速搜索过程。这种思想也被应用到kd树(k-dimensional tree)等数据结构中，进一步优化了查询效率。此外，书中还揭示了一些这些方法的理论极限，通过展示性能下限来揭示它们在某些情况下的局限性和最优实践。这些下界分析有助于我们理解在特定资源限制下，哪些哈希和sketching策略能提供最高效和可行的解决方案。 Rina Panigrahy的博士论文探讨了哈希、搜索和sketching在信息时代的交叉应用，不仅介绍了实际操作的算法，还涵盖了理论基础和性能评估，为理解和优化现代信息检索提供了深入的洞察。

CHAPTER 1. INTRODUCTION

space that preserve essential properties of the original objects – such a resulting bitmap is

often called a ‘sketch’ of the object. Sketch similarity can be used to determine likeness

of the original objects they represent. In many cases, given a database of n objects, it is

possible to produce a sketch of size O(log n) for each object that is suﬃcient to estimate

their similarity with a high degree of conﬁdence. Since sketches often preserve some of

the information in the original object, in some applications they may be used to perform

approximate computations without storing the objects explicitly.

Bloom ﬁlter: A Bloom ﬁlter is a space eﬃcient succinct sketch of a set that can be used

to answer set membership, set containment, or set intersection queries with a high degree of

conﬁdence. It is an inexact representation of a set that allows for a small probability of false

positives when queried; that is, it can sometimes say that an element is in the set when it is

not. In return, a Bloom ﬁlter oﬀers very compact storage: less than 10 bits per element are

required for a 1% false positive probability, independent of the size or number of elements

in the set. Since bloom ﬁlters are compact, they can be stored closer to the processer and

can be used for a quick membership check before resorting to secondary memory access

and only if required. There has recently been a surge in the popularity of Bloom ﬁlters and

its variants, especially in networking [16].

Locality Sensitive Hashing: Sketching and hashing can also be used for similarity

search in a collection of high-dimensional objects; the idea is to use locality-sensitive hash

functions that tend to hash similar objects to the same hash value (bucket) and dissimilar

objects to diﬀerent hash values. Contrast this with sketching where similar objects map to

the similar bit maps.

Streaming: Hashing and sketching are also commonly used techniques in a stream

computation where a large collection of objects arrive in sequence and a certain quantity,

say the most frequent element, may need to be computed without storing all the objects

explicitly; the idea is to use minimal space while scanning the stream.

In this thesis, we study algorithms for diﬀerent kinds of search, hashing and sketching,

and also the fundamental limits of what can be realized using some of these approaches.

For exact search, we show how variants of balls-and-bins processes can be used to derive

space eﬃcient methods for maintaining hash tables, resulting in high space utilization. We

will also see some new results on balls-and-bins processes when bins are chosen according

to a given underlying graph structure and study the distribution of balls into bins in terms of

graph properties. Balls-and-bins processes also result in more compact version of Bloom

ﬁlters for approximate set operations.

We will also study the similarity search problem by viewing it as a nearest neighbor

search in a high dimensional space. We will see a variant of locality sensitive hashing

that uses linear space by searching for points in the neighborhood of the query point. This

method is also applicable to the kd-tree data structure and results in an improved rate of

success in ﬁnding an approximate nearest neighbor. We will also see tight lower bounds

on the performance of the locality sensitive hashing technique for nearest neighbor search.

In the area of streaming algorithms, we show tight lower bounds on the space required for

ﬁnding frequent elements in a stream. We will conclude with a sketching algorithm for

similarity search on hierarchical data represented as trees.

Speciﬁcally we obtain the following results:

1. Exact Search Using Hashing: An important objective in hashing is to reduce the

number of collisions that result in many items landing in one bucket thus increasing

the search time. A time space tradeoﬀ is obtained by using larger hash tables that

increase space overhead but result in fewer collisions. For exact search, we show

how hash tables can be maintained with 83% space utilization, while requiring only

a few (2) memory accesses per hash lookup.

2. Balls-and-bins in graphs: Balls-and-bins processes are central to the analysis of

hashing. It is well known that if n balls are randomly inserted into n bins, then

with high probability, the bin with maximum load contains (1 + o(1)) log n/ log log n

balls. Azar, Broder, Karlin, and Upfal [2] showed that instead of choosing one bin,

if d ≥ 2 bins are chosen at random and the ball inserted into the least loaded of the

d bins, the maximum load reduces drastically to log log n/ logd + O(1). We study

the two-choice balls-and-bins process when balls are not allowed to choose any two

random bins, but only bins that are connected by an edge in an underlying graph. We

show that for n balls and n bins, if the graph is almost regular with degree n

, where ǫ

is not too small, the previous bounds on the maximum load continue to hold. Further

this does not hold for non-regular graphs even if the minimum degree is high.

CHAPTER 1. INTRODUCTION

3. Counting Bloom Filters based on Multiple Choice Hashing: A counting Bloom ﬁlter

(CBF) generalizes a Bloom ﬁlter data structure so as to allow membership queries on

a set that can be changing dynamically via insertions and deletions. As with a Bloom

ﬁlter, a CBF obtains space savings by allowing false positives. Because of its use of

counters, however, CBFs can in practice be prohibitively expensive in terms of space

usage. Alternative constructions with the same functionality have been suggested,

but they appear much more complex, and have not been tested in practice. We pro-

vide a simple hashing-based alternative that oﬀers the same functionality as a CBF,

but with much better space usage, generally reducing the space required by a factor

of two or more for very practical settings.

4. Hashing for Nearest-Neighbor Search: We show a variant of locality-sensitive hash-

ing for approximate nearest neighbor search in a Euclidean space that can compute

the c-approximate nearest neighbor can be in time

O(n

) and near linear space where

ρ ≈ 2.06/c. A c-approximate nearest neighbor is a neighbor that is within distance

at most c times the distance to the nearest neighbor.

5. Lower bounds on Nearest-Neighbor Search using Hashing: Locality sensitive hash-

ing is a hash based method for nearest neighbor search that maps points in the space

to a discrete space where nearby points out likely to get hashed to the same value

and far oﬀ points are likely to get hashed to diﬀerent values. More precisely, given

parameter m that denotes an upper bound on the probability that two points at most r

apart hash to the same bucket, and g a lower bound on the probability that two points

more than cr apart hash to the same bucket, it is known [69] that such a hash function

can ﬁnd a c-approximate nearest neighbor in

O(d + n

) time using a data structure

of size

O(n

1+ρ

) where ρ = log(1/m)/log(1/g). We show lower bounds on the best

possible values for ρ in l

and l

norm; for the l

norm we show that it is impossible

to achieve ρ ≤

and for the l

norm we show a bound of

6. Nearest-neighbor Search using kd-Trees: Kd-trees are one of the most commonly

used methods for nearest neighbor search. For low dimensions this data structure can

be used for answering nearest neighbor queries in logarithmic time and linear space.

However the performance seems to degrade as a number of dimensions becomes

larger than two. We show that a simple modiﬁcation to the search algorithm on a kd-

tree can be used to ﬁnd the nearest neighbor in high d-dimensions. The modiﬁcation

consists of simply perturbing the query point before traversing the tree, and repeating

this for a few iterations. For a certain database of random points, we show that kd-tree

succeeds in ﬁnding a c-approximate nearest neighbor only with probability e

−Ω(d/c)

where as the modiﬁed algorithm gives a much higher rate of success.

7. Space lower bounds on ﬁnding frequent element in a stream: It is well known that

the space complexity of ﬁnding the frequent element in a stream of n elements is

Ω(n) in the worst case. However in practice, often the element frequencies follow a

well known distribution such as the zipﬁan or a normal distribution. We show, for

any given distribution, tight lower bounds on the space complexity in terms of the

moments of the distribution. Speciﬁcally we show a space lower bound of Ω(F

∞

)

for ﬁnding the most frequent element in a stream.

8 Sketching Algorithms for Trees: We study sketching algorithms for computing sim-

ilarities between hierarchical data. Speciﬁcally, we look at data objects that can be

represented using leaf-labeled trees denoting a set of elements at the leaves orga-

nized in a hierarchy. We measure distance between such trees using the best possible

super-imposition of the trees that minimizes the number of mismatched leaf labels.

Our distance measure is also equivalent to an Earth Mover’s Distance(EMD) mea-

sure since a leaf-labeled tree of height one can be viewed as a set. This deﬁnition can

be recursively extended to multi-level trees. We propose algorithms to sketch trees

and analyze them in the context of locality-sensitive hashing (LSH) where the prob-

ability of two sketches matching is high when two trees are similar and low when

the two trees are far under the given distance measure. Speciﬁcally, we compute

sketches of sets by propagating min-hash computations up the tree. Furthermore, we

show that while propagating one min-hash per level gives poor results, propagating

two min-hashes produces good sketches.

剩余153页未读，继续阅读

独孤过儿

粉丝: 586
资源: 38

哈希搜索与素描：信息时代的新搜索技术

最新资源