CUDA加速实时GPU并行哈希表构建

需积分: 10 162 浏览量更新于2024-09-07 收藏 11.44MB PDF 举报

在"Real-Time Parallel Hashing on the GPU"这篇论文中，作者探讨了如何利用CUDA（Compute Unified Device Architecture）技术实现实时、大规模的哈希表构建。哈希表是一种高效的数据结构，用于存储和快速查找数据，其核心在于通过散列函数将键值映射到表中的特定位置。在传统的计算机系统中，构建大型哈希表可能耗时，尤其是在处理数百万甚至上千万元素时。然而，该研究者提出了一种数据并行的方法，结合CUDA的并行计算能力，显著提高了构建过程的效率。论文首先介绍了两种并行哈希构造算法：经典的稀疏完美哈希法和Cuckoo哈希。稀疏完美哈希法试图找到一个确定的、无冲突的哈希函数，而Cuckoo哈希则通过使用多个哈希函数和动态调整策略来分散冲突，通常能实现更快的查找性能。在CUDA环境下，研究人员针对GPU的特点设计了一个高效的算法。他们将32位的键值（如像素坐标）输入到GPU的内存中，然后并行地分配这些键到桶中，每个桶的大小限制在最多512个元素。这个过程中，他们采用了voxelized（分块的）Lucy模型，通过将数据分布于三维空间来优化存储。每组512个键值被进一步分配到三个子表中，形成Cuckoo哈希结构，这样查找任何键时只需检查三个位置，大大减少了搜索时间。中心部分展示了单个桶的构建过程，以及其子表的完成状态，显示了CUDA如何通过硬件并行性加速哈希表的构建和查询操作。这种方法使得实时处理数百万级别的哈希表成为可能，这对于大数据处理、数据库索引、网络路由等应用场景具有重要意义。这篇文章的主要贡献是展示了如何通过CUDA技术在GPU上实现实时、大规模的哈希表构建，以及如何结合稀疏完美哈希和Cuckoo哈希策略优化存储和查找性能。这对于提升现代IT系统中数据处理的效率，特别是在云计算和大数据分析领域，具有深远的影响。

Real-Time Parallel Hashing on the GPU

Dan A. Alcantara Andrei Sharf Fatemeh Abbasinejad Shubhabrata Sengupta Michael Mitzenmacher

John D. Owens

University of California, Davis

Nina Amenta

Harvard University

Figure 1: Overview of our construction for a voxelized Lucy model, colored by mapping x, y, and z coordinates to red, green, and blue

respectively (far left). The 3.5 million voxels (left) are input as 32-bit keys and placed into buckets of ≤ 512 items, averaging 409 each

(center). Each bucket then builds a cuckoo hash with three sub-tables and stores them in a larger structure with 5 million entries (right).

Close-ups follow the progress of a single bucket, showing the keys allocated to it (center; the bucket is linear and wraps around left to right)

and each of its completed cuckoo sub-tables (right). Finding any key requires checking only three possible locations.

Abstract

We demonstrate an efﬁcient data-parallel algorithm for building

large hash tables of millions of elements in real-time. We consider

two parallel algorithms for the construction: a classical sparse per-

fect hashing approach, and cuckoo hashing, which packs elements

densely by allowing an element to be stored in one of multiple pos-

sible locations. Our construction is a hybrid approach that uses both

algorithms. We measure the construction time, access time, and

memory usage of our implementations and demonstrate real-time

performance on large datasets: for 5 million key-value pairs, we

construct a hash table in 35.7 ms using 1.42 times as much mem-

ory as the input data itself, and we can access all the elements in

that hash table in 15.3 ms. For comparison, sorting the same data

requires 36.6 ms, but accessing all the elements via binary search

requires 79.5 ms. Furthermore, we show how our hashing methods

can be applied to two graphics applications: 3D surface intersection

for moving data and geometric hashing for image matching.

Keywords: GPU computing, hash tables, cuckoo hashing, parallel

hash tables, parallel data structures

1 Introduction

The advent of programmable graphics hardware allows highly par-

allel graphics processors (GPUs) to compute and use data repre-

sentations that diverge from the traditional list of triangles. For

instance, researchers have recently demonstrated efﬁcient parallel

constructions for hierarchical spatial data structures such as k-d

trees [Zhou et al. 2008b] and octrees [DeCoro and Tatarchuk 2007;

Sun et al. 2008; Zhou et al. 2008a]. In general, the problem of deﬁn-

ing paralle l-friendly data structures that can be efﬁciently created,

updated, and accessed is a signiﬁcant research challenge [Lefohn

et al. 2006]. The toolbox of efﬁcient data structures and their as-

sociated algorithms on scalar architectures like the CPU remains

signiﬁcantly larger than on parallel architectures like the GPU.

In this paper we concentrate on the problem of implementing a

parallel-friendly data structure that allows efﬁcient random access

of millions of elements and can be both constructed and accessed at

interactive rates. Such a data structure has numerous applications

in computer graphics, centered on applications that need to store

a sparse set of items in a dense representation. On the CPU, the

most common data structure for such a task is a hash table. How-

ever, the usual serial algorithms for building and accessing hash

tables— such as chaining, in which collisions are resolved by stor-

ing a linked list of items per bucket—do not translate naturally to

the highly parallel environment of the GPU, for three reasons:

Synchronization Algorithms for populating a traditional hash ta-

ble tend to involve sequential operations. Chaining, for in-

stance, requires multiple items to be added to each linked list,

which would require serialization of access to the list structure

on the GPU.

Variable work per access The number of probes required to look

up an item in typical sequential hash tables varies per query,

e.g. chaining, requires traversing the linked lists, which vary

in length. This would lead to inefﬁciency on the GPU, where

the SIMD cores force all threads to wait for the worst-case

number of probes.

Sparse storage A hash table by nature exhibits little locality in ei-

ther construction or access, so caching and computational hi-

erarchies have little ability to improve performance.

While common sequential hash table constructions such as chain-

ing have expected constant look-up time, the lookup time for some

item in the table is Ω(lg lg n) with high probability. The inﬂuen-

tial work of Lefebvre and Hoppe [2006], among the ﬁrst to use the

GPU to access a hash table, addressed the issue of variable lookup

time by using a perfect hash table. In this paper we deﬁne a perfect

hash table to be one in which an item can be accessed in worst-

下载后可阅读完整内容，剩余8页未读，立即下载

何智锋

粉丝: 0
资源: 2

CUDA加速实时GPU并行哈希表构建

AliceVision三维重建框架源码

Real Time 3D Reconstruction 实时三维重建

Python-Meshroom一个基于AliceVision框架的免费开源3D重建软件

GTC 2017 - Parallel Depth First on GPU - Slides (s7469-maxim-naumov-parallel-depth-first-on-gpu)-计算机科学

Data-Parallel Hashing Techniques for GPU Architectures - 11 Jul 2018 (1807.04345)-计算机科学

A fine-grained parallel multi-objective test case prioritization on GPU

smallpt-parallel-bvh-gpu:smallpt 的 GPU 实现（http

Real-Time KD-Tree Construction on Graphics Hardware

eetop.cn_OpenCL.Parallel Computing on the GPU and CPU.siggraph.munshi.pdf

Real-time brain extraction method from cerebral MRI volume based on graphic processing units

最新资源