两路哈希查找：降低内存访问与高效存储

146 浏览量更新于2024-08-25 收藏 159KB PDF 举报

标题 "Efficient Hashing with Lookups in two Memory Accesses" 是一篇于2018年发表的计算机科学领域的论文，作者是Rina Panigrahy。该研究关注的是在计算机数据结构和算法中，特别是哈希表设计中的优化策略，尤其是针对减少查找时的内存访问次数。传统的哈希函数通常将数据项映射到一个单一的存储位置，然而，这篇论文引入了一种称为“二向哈希”的概念。在二向哈希中，作者参考了Azar等人的一项工作，他们指出通过随机地将数据项（球）映射到两个独立的桶（或槽）中，并将球放置在较小的那个桶里，可以显著降低单个桶的最大负载。这是因为这种策略减少了最拥挤桶的概率分布，使得最大负载下降到了对数对数级别，即O(log log n)的概率下，这对于大型数据集来说是非常有效的。相比于传统的单向哈希，二向哈希的查找过程涉及两个桶，而不是一个。这增加了查找的复杂性，但也可能带来潜在的优势。在插入操作中，作者提出了一种策略，通过在插入过程中动态调整，可以在支持哈希更新的同时，保持最大负载在线情况下不超过2，这意味着空间利用率极高。即使每个桶预先分配了空间来存放两份数据，这种方法仍然允许存储超过n的数据项，从而提高硬件实现时的记忆效率。论文的主要贡献包括： 1. **内存访问优化**：通过二向哈希，降低了查找操作的平均内存访问次数，这对于性能敏感的应用非常重要。 2. **动态调整**：通过在插入过程中进行数据移动，维护了低负载，保证了高效性能。 3. **空间利用**：即使有硬件预分配的双倍空间，仍能存储超过n的数据项，提高了存储效率。 4. **实用性**：提出了一个简单且实用的哈希算法，适用于实际应用，尤其是在对内存访问效率有高要求的场景。这篇论文深入探讨了如何通过巧妙的设计改进哈希表，特别是在查找操作和空间管理上，旨在提升现代计算系统的整体性能。它对数据结构理论和哈希表优化有着重要的贡献。

2 Overview of Techniques

Viewing buckets as bins and items as balls, we can look at the hashing process as if m balls are

being assigned to n bins. For each ball two bins are chosen at random. If the bins are imagined

to be the vertices of a graph, the two bins for a ball can be represented by an edge. This gives

us a random graph G on n ver tices containing m edges. By making this graph directed, we could

use the direction of an edge to indicate the choice of the bin among the two for placing the ball.

The direction of each edge is chosen online by a certain procedure. The load of a vertex (bucket) is

equal to its in-degree. For each edge (item) insertion, the two-way h ash algorithm directs the edge

towards the vertex with th e lower in-degree. During the hash process, say U is on e of the vertices

a b all gets hashed to. Observe that if V U is a directed ed ge, and if the load on V is signiﬁcantly

lower, we could perform a move from U to V , thus freeing up a position in U. Es sentially, in terms

of load, the new ball could be added to either U or V , whichever has a lower load. This p rinciple

could be generalized to the case where there is a directed path from V to U , and would result in

performing moves and ﬂipping the directions along all the edges on the path. If there is a directed

sub-tree rooted at U , with all edges leading to the r oot, we could choose the least loaded vertex in

this tree to incur the load of the new ball. With this understanding, we will say that W is a child

of X if XW is a directed edge. So, our hash insert algorithm looks as follows.

• Compute the two bins U

and U

that the new item to be inserted hashes to.

• Explore vertices that can be reached from U

or U

by traversing along directed edges in the

reverse direction.

• Among such vertices, ﬁnd one, V , with low load that can be reached s ay from U

• Add the new item to U

and perform moves along the path from U

to V so that only the

load on V increases by one.

Let s = 2m/n denote the average degree of the und irected random graph G. Note that the

same graph G can be viewed as a directed or an undirected graph. Throughout the paper G refers

to the undirected version unless stated otherwise or clear to be so from the context. Throughout

the paper we will assume that s is a constant. It turns out that the success of our algorithm in

maintaining low maximum load depends on the absence of dense sub grap hs in this random graph.

We show that such dense subgraphs are absent when s < 3.35, giving an algorithm that works with

bucket size at m ost 2 and requiring at most log log n + O(1) moves for inserts with high probability

(section 3). Note that the bound of 3.35 for s may not tight but is provably no more than 3.72. We

then analyze the trade oﬀ between number of moves during inserts and maximum bucket size using

the technique of witness trees [5] [16] [2], making signiﬁcant adaptations to our problem (section

4).

3 Constant Maximum Bucket Size

In this section we show that for s < 3.35 by perf orming at most log log n + O(1) moves, we can

ensure that with high probability no bucket gets more than 2 items.

For an insert, we search backwards fr om a given node in bfs order, traversing directed edges in

reverse direction, looking for a node with load at most one. To simplify the analysis, we assume

that during the backward search, the algorithm visits only 2 children for each node even if more

may be present. We will show that by searching to a depth of log log n+O (1), with high prob ab ility,

we ﬁnd a node with load at most one. First, we show th at if the backward search is allowed to

剩余11页未读，继续阅读

weixin_38508549

粉丝: 5
资源: 917

两路哈希查找：降低内存访问与高效存储

hashing-baseline-for-image-retrieval-master

Monotone Minimal Perfect Hashing - Searching a Sorted Table with O(1) Accesses - 2014 (MonotoneMinimalPerfectHashing)-计算机科学

Monotone Minimal Perfect Hashing - Searching a Sorted Table with O(1) Accesses (1496770.1496856)-计算机科学

Data-Parallel Hashing Techniques for GPU Architectures - 11 Jul 2018 (1807.04345)-计算机科学

视频图matlab代码-single-modal-supervised-hashing-baseline-for-image-retrieva

Fast Search in Hamming Space with Multi-Index Hashing-计算机科学

matlab源码下载-hashing-baseline-for-image-retrieval:各种用于图像检索的哈希方法，并用作基准

Perfect Hashing for Data Management Applications - 2007 (0702159)-计算机科学

Regular and Almost Universal Hashing - An Efficient Implementation - 2016 (1609.09840)-计算机科学

Near-Optimal Space Perfect Hashing Algorithms - PhD Thesis (2008)-计算机科学

最新资源