Real-Time Parallel Hashing on the GPU
Dan A. Alcantara Andrei Sharf Fatemeh Abbasinejad Shubhabrata Sengupta Michael Mitzenmacher
?
John D. Owens
University of California, Davis
Nina Amenta
?
Harvard University
Figure 1: Overview of our construction for a voxelized Lucy model, colored by mapping x, y, and z coordinates to red, green, and blue
respectively (far left). The 3.5 million voxels (left) are input as 32-bit keys and placed into buckets of ≤ 512 items, averaging 409 each
(center). Each bucket then builds a cuckoo hash with three sub-tables and stores them in a larger structure with 5 million entries (right).
Close-ups follow the progress of a single bucket, showing the keys allocated to it (center; the bucket is linear and wraps around left to right)
and each of its completed cuckoo sub-tables (right). Finding any key requires checking only three possible locations.
Abstract
We demonstrate an efficient data-parallel algorithm for building
large hash tables of millions of elements in real-time. We consider
two parallel algorithms for the construction: a classical sparse per-
fect hashing approach, and cuckoo hashing, which packs elements
densely by allowing an element to be stored in one of multiple pos-
sible locations. Our construction is a hybrid approach that uses both
algorithms. We measure the construction time, access time, and
memory usage of our implementations and demonstrate real-time
performance on large datasets: for 5 million key-value pairs, we
construct a hash table in 35.7 ms using 1.42 times as much mem-
ory as the input data itself, and we can access all the elements in
that hash table in 15.3 ms. For comparison, sorting the same data
requires 36.6 ms, but accessing all the elements via binary search
requires 79.5 ms. Furthermore, we show how our hashing methods
can be applied to two graphics applications: 3D surface intersection
for moving data and geometric hashing for image matching.
Keywords: GPU computing, hash tables, cuckoo hashing, parallel
hash tables, parallel data structures
1 Introduction
The advent of programmable graphics hardware allows highly par-
allel graphics processors (GPUs) to compute and use data repre-
sentations that diverge from the traditional list of triangles. For
instance, researchers have recently demonstrated efficient parallel
constructions for hierarchical spatial data structures such as k-d
trees [Zhou et al. 2008b] and octrees [DeCoro and Tatarchuk 2007;
Sun et al. 2008; Zhou et al. 2008a]. In general, the problem of defin-
ing paralle l-friendly data structures that can be efficiently created,
updated, and accessed is a significant research challenge [Lefohn
et al. 2006]. The toolbox of efficient data structures and their as-
sociated algorithms on scalar architectures like the CPU remains
significantly larger than on parallel architectures like the GPU.
In this paper we concentrate on the problem of implementing a
parallel-friendly data structure that allows efficient random access
of millions of elements and can be both constructed and accessed at
interactive rates. Such a data structure has numerous applications
in computer graphics, centered on applications that need to store
a sparse set of items in a dense representation. On the CPU, the
most common data structure for such a task is a hash table. How-
ever, the usual serial algorithms for building and accessing hash
tables— such as chaining, in which collisions are resolved by stor-
ing a linked list of items per bucket—do not translate naturally to
the highly parallel environment of the GPU, for three reasons:
Synchronization Algorithms for populating a traditional hash ta-
ble tend to involve sequential operations. Chaining, for in-
stance, requires multiple items to be added to each linked list,
which would require serialization of access to the list structure
on the GPU.
Variable work per access The number of probes required to look
up an item in typical sequential hash tables varies per query,
e.g. chaining, requires traversing the linked lists, which vary
in length. This would lead to inefficiency on the GPU, where
the SIMD cores force all threads to wait for the worst-case
number of probes.
Sparse storage A hash table by nature exhibits little locality in ei-
ther construction or access, so caching and computational hi-
erarchies have little ability to improve performance.
While common sequential hash table constructions such as chain-
ing have expected constant look-up time, the lookup time for some
item in the table is Ω(lg lg n) with high probability. The influen-
tial work of Lefebvre and Hoppe [2006], among the first to use the
GPU to access a hash table, addressed the issue of variable lookup
time by using a perfect hash table. In this paper we define a perfect
hash table to be one in which an item can be accessed in worst-