没有合适的资源?快使用搜索试试~ 我知道了~
首页哈希搜索与素描:信息时代的新搜索技术
哈希搜索与素描:信息时代的新搜索技术
需积分: 9 20 下载量 58 浏览量
更新于2024-07-27
1
收藏 889KB PDF 举报
"Hashing, Searching, and Sketching in the Information Age: A Comprehensive Study on Exact and Fuzzy Search Algorithms"
在当今的信息时代,信息检索的方式变得前所未有的多样。搜索需求可以是精确的,即输入查询需精确匹配目标对象(例如数据库查询),也可以是模糊的,如图像搜索、新闻搜索和相似文档查找,这使得搜索问题的复杂性显著增加。Hashing作为一种简单而有效的工具,利用随机哈希函数将数据项映射到桶(或称槽)中,形象地说就是“把球扔进篮子”。本书深入研究了基于哈希和sketching(数据压缩表示)的各种搜索算法,并探讨了这些方法在实现精确和模糊搜索中的应用及潜在局限。
对于精确搜索,章节会讨论如何通过变种的“球与篮子”过程设计出空间效率高的哈希表维护方法。例如,布隆过滤器(Bloom Filters)就是一种常见的哈希技术,它能在有限的空间内判断一个元素是否在一个集合中,但允许一定概率的误判,这对于大规模数据预查非常有效。
模糊搜索方面,将聚焦于一种特殊的哈希技术——局部敏感哈希(LSH),它能够在保持线性空间复杂度的同时,提高搜索性能,尤其是在近似匹配场景下。LSH通过构造具有特定相似度敏感性的哈希函数,将相似的数据映射到相近的桶中,从而加速搜索过程。这种思想也被应用到kd树(k-dimensional tree)等数据结构中,进一步优化了查询效率。
此外,书中还揭示了一些这些方法的理论极限,通过展示性能下限来揭示它们在某些情况下的局限性和最优实践。这些下界分析有助于我们理解在特定资源限制下,哪些哈希和sketching策略能提供最高效和可行的解决方案。
Rina Panigrahy的博士论文探讨了哈希、搜索和sketching在信息时代的交叉应用,不仅介绍了实际操作的算法,还涵盖了理论基础和性能评估,为理解和优化现代信息检索提供了深入的洞察。
4
CHAPTER 1. INTRODUCTION
space that preserve essential properties of the original objects – such a resulting bitmap is
often called a ‘sketch’ of the object. Sketch similarity can be used to determine likeness
of the original objects they represent. In many cases, given a database of n objects, it is
possible to produce a sketch of size O(log n) for each object that is sufficient to estimate
their similarity with a high degree of confidence. Since sketches often preserve some of
the information in the original object, in some applications they may be used to perform
approximate computations without storing the objects explicitly.
Bloom filter: A Bloom filter is a space efficient succinct sketch of a set that can be used
to answer set membership, set containment, or set intersection queries with a high degree of
confidence. It is an inexact representation of a set that allows for a small probability of false
positives when queried; that is, it can sometimes say that an element is in the set when it is
not. In return, a Bloom filter offers very compact storage: less than 10 bits per element are
required for a 1% false positive probability, independent of the size or number of elements
in the set. Since bloom filters are compact, they can be stored closer to the processer and
can be used for a quick membership check before resorting to secondary memory access
and only if required. There has recently been a surge in the popularity of Bloom filters and
its variants, especially in networking [16].
Locality Sensitive Hashing: Sketching and hashing can also be used for similarity
search in a collection of high-dimensional objects; the idea is to use locality-sensitive hash
functions that tend to hash similar objects to the same hash value (bucket) and dissimilar
objects to different hash values. Contrast this with sketching where similar objects map to
the similar bit maps.
Streaming: Hashing and sketching are also commonly used techniques in a stream
computation where a large collection of objects arrive in sequence and a certain quantity,
say the most frequent element, may need to be computed without storing all the objects
explicitly; the idea is to use minimal space while scanning the stream.
In this thesis, we study algorithms for different kinds of search, hashing and sketching,
and also the fundamental limits of what can be realized using some of these approaches.
For exact search, we show how variants of balls-and-bins processes can be used to derive
space efficient methods for maintaining hash tables, resulting in high space utilization. We
will also see some new results on balls-and-bins processes when bins are chosen according
5
to a given underlying graph structure and study the distribution of balls into bins in terms of
graph properties. Balls-and-bins processes also result in more compact version of Bloom
filters for approximate set operations.
We will also study the similarity search problem by viewing it as a nearest neighbor
search in a high dimensional space. We will see a variant of locality sensitive hashing
that uses linear space by searching for points in the neighborhood of the query point. This
method is also applicable to the kd-tree data structure and results in an improved rate of
success in finding an approximate nearest neighbor. We will also see tight lower bounds
on the performance of the locality sensitive hashing technique for nearest neighbor search.
In the area of streaming algorithms, we show tight lower bounds on the space required for
finding frequent elements in a stream. We will conclude with a sketching algorithm for
similarity search on hierarchical data represented as trees.
Specifically we obtain the following results:
1. Exact Search Using Hashing: An important objective in hashing is to reduce the
number of collisions that result in many items landing in one bucket thus increasing
the search time. A time space tradeoff is obtained by using larger hash tables that
increase space overhead but result in fewer collisions. For exact search, we show
how hash tables can be maintained with 83% space utilization, while requiring only
a few (2) memory accesses per hash lookup.
2. Balls-and-bins in graphs: Balls-and-bins processes are central to the analysis of
hashing. It is well known that if n balls are randomly inserted into n bins, then
with high probability, the bin with maximum load contains (1 + o(1)) log n/ log log n
balls. Azar, Broder, Karlin, and Upfal [2] showed that instead of choosing one bin,
if d ≥ 2 bins are chosen at random and the ball inserted into the least loaded of the
d bins, the maximum load reduces drastically to log log n/ logd + O(1). We study
the two-choice balls-and-bins process when balls are not allowed to choose any two
random bins, but only bins that are connected by an edge in an underlying graph. We
show that for n balls and n bins, if the graph is almost regular with degree n
ǫ
, where ǫ
is not too small, the previous bounds on the maximum load continue to hold. Further
this does not hold for non-regular graphs even if the minimum degree is high.
6
CHAPTER 1. INTRODUCTION
3. Counting Bloom Filters based on Multiple Choice Hashing: A counting Bloom filter
(CBF) generalizes a Bloom filter data structure so as to allow membership queries on
a set that can be changing dynamically via insertions and deletions. As with a Bloom
filter, a CBF obtains space savings by allowing false positives. Because of its use of
counters, however, CBFs can in practice be prohibitively expensive in terms of space
usage. Alternative constructions with the same functionality have been suggested,
but they appear much more complex, and have not been tested in practice. We pro-
vide a simple hashing-based alternative that offers the same functionality as a CBF,
but with much better space usage, generally reducing the space required by a factor
of two or more for very practical settings.
4. Hashing for Nearest-Neighbor Search: We show a variant of locality-sensitive hash-
ing for approximate nearest neighbor search in a Euclidean space that can compute
the c-approximate nearest neighbor can be in time
˜
O(n
ρ
) and near linear space where
ρ ≈ 2.06/c. A c-approximate nearest neighbor is a neighbor that is within distance
at most c times the distance to the nearest neighbor.
5. Lower bounds on Nearest-Neighbor Search using Hashing: Locality sensitive hash-
ing is a hash based method for nearest neighbor search that maps points in the space
to a discrete space where nearby points out likely to get hashed to the same value
and far off points are likely to get hashed to different values. More precisely, given
parameter m that denotes an upper bound on the probability that two points at most r
apart hash to the same bucket, and g a lower bound on the probability that two points
more than cr apart hash to the same bucket, it is known [69] that such a hash function
can find a c-approximate nearest neighbor in
˜
O(d + n
ρ
) time using a data structure
of size
˜
O(n
1+ρ
) where ρ = log(1/m)/log(1/g). We show lower bounds on the best
possible values for ρ in l
1
and l
2
norm; for the l
1
norm we show that it is impossible
to achieve ρ ≤
1
2c
and for the l
2
norm we show a bound of
1
2c
2
6. Nearest-neighbor Search using kd-Trees: Kd-trees are one of the most commonly
used methods for nearest neighbor search. For low dimensions this data structure can
be used for answering nearest neighbor queries in logarithmic time and linear space.
However the performance seems to degrade as a number of dimensions becomes
7
larger than two. We show that a simple modification to the search algorithm on a kd-
tree can be used to find the nearest neighbor in high d-dimensions. The modification
consists of simply perturbing the query point before traversing the tree, and repeating
this for a few iterations. For a certain database of random points, we show that kd-tree
succeeds in finding a c-approximate nearest neighbor only with probability e
−Ω(d/c)
,
where as the modified algorithm gives a much higher rate of success.
7. Space lower bounds on finding frequent element in a stream: It is well known that
the space complexity of finding the frequent element in a stream of n elements is
Ω(n) in the worst case. However in practice, often the element frequencies follow a
well known distribution such as the zipfian or a normal distribution. We show, for
any given distribution, tight lower bounds on the space complexity in terms of the
moments of the distribution. Specifically we show a space lower bound of Ω(F
2
/F
2
∞
)
for finding the most frequent element in a stream.
8 Sketching Algorithms for Trees: We study sketching algorithms for computing sim-
ilarities between hierarchical data. Specifically, we look at data objects that can be
represented using leaf-labeled trees denoting a set of elements at the leaves orga-
nized in a hierarchy. We measure distance between such trees using the best possible
super-imposition of the trees that minimizes the number of mismatched leaf labels.
Our distance measure is also equivalent to an Earth Mover’s Distance(EMD) mea-
sure since a leaf-labeled tree of height one can be viewed as a set. This definition can
be recursively extended to multi-level trees. We propose algorithms to sketch trees
and analyze them in the context of locality-sensitive hashing (LSH) where the prob-
ability of two sketches matching is high when two trees are similar and low when
the two trees are far under the given distance measure. Specifically, we compute
sketches of sets by propagating min-hash computations up the tree. Furthermore, we
show that while propagating one min-hash per level gives poor results, propagating
two min-hashes produces good sketches.
Chapter 2
Efficient Hashing with Lookups in two
Memory Accesses
2.1 Introduction
Hashing is a simple and effective method for exact search. An important objective in hash-
ing is to reduce the number of collisions since they cause many items to land in one bucket
thus increasing the search time. A time space tradeoff is obtained by using larger hash
tables that increase space overhead but result in fewer collisions. We show how hash tables
can be used with high space utilization and few memory accesses per hash lookup.
The study of hashing is closely related to the analysis of balls and bin. One of the classi-
cal results in this area is that, asymptotically, if n balls are thrown into n bins independently
and randomly then the largest bin has (1 + o(1)) ln n/ lnln n balls, with high probability.
Azar et. al. [2] showed that instead of using a single hash function, if we randomly hash a
ball into two bins and place it in the smaller of the two, then this dramatically lowers the
maximum load on bins. This leads to the concept of two-way hash functions (figure 2.1)
where the largest bucket contains O(log log n) balls. The hash look up will now search in
both the buckets an item hashes to. The intuition behind this exponential improvement in
the maximum load is as follows: if p
i
denotes the fraction of bins with load at least i, then
observe that at any step the probability that a ball ends up in a bin with load at least i (to
create a bin of at least i + 1) is p
2
i
as this can happen only if both bins have height at least
8
剩余153页未读,继续阅读
独孤过儿
- 粉丝: 586
- 资源: 38
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 磁性吸附笔筒设计创新,行业文档精选
- Java Swing实现的俄罗斯方块游戏代码分享
- 骨折生长的二维与三维模型比较分析
- 水彩花卉与羽毛无缝背景矢量素材
- 设计一种高效的袋料分离装置
- 探索4.20图包.zip的奥秘
- RabbitMQ 3.7.x延时消息交换插件安装与操作指南
- 解决NLTK下载停用词失败的问题
- 多系统平台的并行处理技术研究
- Jekyll项目实战:网页设计作业的入门练习
- discord.js v13按钮分页包实现教程与应用
- SpringBoot与Uniapp结合开发短视频APP实战教程
- Tensorflow学习笔记深度解析:人工智能实践指南
- 无服务器部署管理器:防止错误部署AWS帐户
- 医疗图标矢量素材合集:扁平风格16图标(PNG/EPS/PSD)
- 人工智能基础课程汇报PPT模板下载
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功