构建与评估简单空间高效的最小完美哈希函数

171 浏览量更新于2024-08-25 收藏 166KB PDF 举报

"这篇论文是关于简单且空间高效的最小完美哈希函数的，由Fabiano C. Botelho、Rasmus Pagh和Nivio Ziviani撰写，发表于2007年的WADS（Workshop on Algorithms and Data Structures）会议。论文主要探讨了如何构建和评估一种特殊类型的哈希函数，即完美哈希函数（Perfect Hash Function, PHF），其目的是在给定键集S的情况下，将S中的所有键映射到唯一的、非重复的值。" 在计算机科学领域，哈希函数是数据结构和算法设计中的关键组成部分，它们能够快速地将任意大小的输入（如字符串或数字）转换为固定大小的输出（通常是一个整数）。完美哈希函数（Perfect Hash Function, PHF）是一种特殊的哈希函数，它确保对特定集合的所有键进行哈希时，每个键都会被映射到一个唯一的位置，不会出现冲突。这在需要高效查找和存储无重复元素的数据结构中非常有用，例如在关联数组和数据库索引中。论文指出，构建一个最小完美哈希函数所需的存储空间大约是1.44n^2/m位，其中n是键集S的大小，m是哈希表的大小。然而，该论文提出了一种新的算法，能够在n=m约为1.23n的情况下，构建和评估具有以下特性的PHF： 1. **常数时间评估**：对于已构建的完美哈希函数，执行哈希操作的时间复杂度为O(1)，这意味着无论输入大小如何，执行哈希计算都非常快。 2. **线性时间构建与评估**：新算法的构建和评估过程都在线性时间内完成，即时间复杂度为O(n)，这显著提高了效率，尤其是在处理大量数据时。 3. **接近理论最小空间需求**：所需存储空间仅比信息理论的最小值大一个因子2。这意味着在保证高效性能的同时，尽可能减少了内存占用。据作者所知，这是首次有算法同时满足以上三个条件。以往文献中满足第三条件的算法要么需要指数时间来构建和评估，要么依赖于近似最优的解决方案。这篇论文对计算机科学，特别是数据结构和算法设计领域的贡献在于提供了一种既简单又高效的最小完美哈希函数构造方法，能够在实际应用中实现快速查找和存储，同时保持较低的内存需求。这对于处理大规模数据集和优化内存敏感的应用程序至关重要。

2.1 Theoretical Results

Fredman and K oml´os [9] proved that at least n log e+log lo g u−O(log n) bits are

required to represent a MPHF (in the worst case over all sets of size n), provided

that u ≥ n

for some α > 2. Logarithms are in base 2. Note that the two last

terms are negligible under the assumption log u ≪ n. In general, for m > n the

space required to represent a PHF is around (1 + (m/n − 1) ln(1 − n/m)) n log e

bits. A simpler proof of this was later given by Radhakrishnan [18].

Mehlhorn [15] showed that the Fr e dman-Koml´os bound is almost tight by

providing an algorithm that constructs a MPHF that can be represented with

at most n log e + log log u + O(log n) bits. However, his algorithm is far from

practice because its construction and evaluation time is ex po nential in n.

Schmidt and Siegel [19] proposed the ﬁrst a lgorithm for co nstructing a MPHF

with constant evaluation time and description size O(n + log log u) bits. Their

algorithm, as well as all other alg orithms we will consider, is for the Word RAM

model of computation [10]. In this model an element of the universe U ﬁts into

one machine word, and arithmetic o perations and memory a c c esses have unit

cost. From a practical point of view, the algorithm of Schmidt and Siegel is

not attractive. The scheme is complicated to implement and the cons tant of the

space bound is large: For a set of n keys, at least 29n bits are used, which means a

space usage s imila r in practice to the be st schemes using O(n log n) bits. Though

it seems that [19] aims to describe its algorithmic ideas in the clearest possible

way, not trying to optimize the constant, it appears hard to improve the space

usage signiﬁcantly.

More recently, Hager up and Tholey [11] have come up with the best theo-

retical result we know of. The MPHF obtained can be evaluated in O(1) time

and stored in n log e + log log u + O(n(log log n)

/ log n + log log log u) bits. The

construction time is O(n+log log u) using O(n) words of spac e . Again, the terms

involving u are negligible. In spite of its theoretical importance, the Hagerup and

Tholey [11] algorithm is also not practical, as it e mphasizes asymptotic space

complexity only. (It is also very complicated to implement, but we will not go

into that.) For n < 2

150

the scheme is not well-deﬁned, as it relies on splitting the

key set into buckets of size ˆn ≤ log n/(21 log log n). If we ﬁx this by letting the

bucket size b e at least 1, then buckets of size one will be used for n < 2

300

, which

means that the spac e usage will be at least (3 log log n + log 7) n bits. For a set of

a billion keys, this is more than 17 bits per element. Thus, the Ha gerup-Tholey

MPHF is not space eﬃcient in practical situations. While we believe that their

algorithm has been optimized for simplicity of exposition, rather than constant

factors, it seems diﬃcult to signiﬁcantly reduce the space usage based on their

approach.

2.2 Practical Results

We now describe some of the ma in “practical” results that our work is based on.

They are characterized by simplicity and (provably) low constant factors.

剩余12页未读，继续阅读

weixin_38732343

粉丝: 5
资源: 909

构建与评估简单空间高效的最小完美哈希函数

Hash and Displace - Efficient Evaluation of Minimum Perfect Hash Functions - 1999 (10.1.1.148.7694)-计算机科学

CentOS-7-x86-64-Minimal-2009.iso

Minimal Perfect Hash Functions Made Simple - 1980 (p17-cichelli)-计算机科学

Finding Minimal Perfect Hash Functions - 1986 (10.1.1.144.9650)-计算机科学

An Approach for Minimal Perfect Hash Functions for Very Large Databases (tr06)-计算机科学

An Optimal Algorithm for Generating Minimal Perfect Hash Functions - 1992 (10.1.1.51.5566)-计算机科学

Fast Scalable Construction of (Minimal Perfect Hash) Functions-计算机科学

Theory and Practice of Monotone Minimal Perfect Hashing-计算机科学

A Practical Minimal Perfect Hashing Method (2005)-计算机科学

Fast and Scalable Minimal Perfect Hashing for Massive Key Sets - 2017 (1702.03154)-计算机科学

最新资源