GPU上的内核流压缩技术及其在多内核数据可视化中的应用

94 浏览量更新于2024-08-25 收藏 8.14MB PDF 举报

"InK-Compact- In-Kernel Stream Compaction and Its Application to Multi-Kernel Data Visualization on General-Purpose GPUs" 是一篇2013年的计算机科学领域的研究论文，由D. M. Hughes、I. S. Lim、M. W. Jones、A. Knoll和B. Spencer等人撰写。该论文探讨了在通用GPU上进行内核级流压缩（In-Kernel Stream Compaction）及其在多内核数据可视化中的应用。流压缩是并行计算的一个关键原语，其核心功能是从包含有效和无效元素的输入流中生成一个仅包含有效元素的缩减输出流。通过处理这个压缩后的流，而不是原始混合输入流，可以提高性能、实现更好的负载平衡以及减少内存占用。这种技术在许多领域都有广泛应用，例如延迟着色、等值面提取以及计算机图形学和可视化中的表面体素化。论文提出了一种新颖的内核级流压缩方法（In-Kernel Stream Compaction），其特点是压缩操作在离开运行内核之前就已完成。这与传统的并行压缩方法不同，传统方法需要退出当前内核，执行前缀和内核，然后进行散列操作。这种方法的创新之处在于它优化了流程，减少了上下文切换和数据传输的开销，从而可能进一步提升效率。在多内核数据可视化的应用中，流压缩可以显著改善大规模数据处理的能力，尤其是在GPU这样的并行处理平台上。由于GPU的并行计算能力强大，通过内核级的流压缩，可以更有效地处理大量数据，同时减少无效数据对计算资源的占用，从而提高数据处理速度和图形渲染质量。此外，文章可能还深入讨论了实现这一方法的具体算法、优化策略以及性能评估。通过实验结果，作者们可能展示了In-Kernel Stream Compaction相比于传统方法在性能、内存效率和实时性方面的优势，并可能探讨了这种方法在其他并行计算任务中的潜在应用。这篇论文对于理解并行计算中的流压缩技术，以及如何利用这些技术改进GPU上的多内核数据处理和可视化具有重要意义。它不仅提供了新的算法设计思路，也为未来在相关领域的研究和开发提供了参考。

D. M. Hughes, I. S. Lim, M. W. Jones, A. Knoll & B. Spencer / In-Kernel Stream Compaction 3

our compaction technique with the compaction method in

the Thrust library.

Nobari [NLKB11] used the scan-scatter method by Horn

[Hor05] to accelerate generation of random graphs from

databases. Stream compaction was used in work by Hissoiny

[HOBD11] to speed up dosimetric computations for radio-

therapy, using Monte Carlo methods; speciﬁcally they com-

pacted computations on photons that worked longer than

others. Rather than having threads idle, computations on

photons are limited to a user constant, after which the stream

is compacted to remove completed items.

Schwarz [SS10] used compaction during voxelization of

surfaces and solids. The work is notable for employing a

multiple-kernel pipeline (to alleviate under-utilization of the

GPU) where compaction is used to ensure good ordering of

triangles ready for further processing. Tang et al. [TMLT11]

employed stream compaction in kernel. In their method a

ﬁxed amount of space is reserved, then each block writes to

its own private part of the total array. A second pass com-

pacts the private arrays. This preﬁx sum can be executed as

part of the second kernel. van Antwerpen [vA11] also in-

corporates the compaction within the kernel, but does not

guarantee order preservation.

Billeter et al. [BOA09] suggested an approach that makes

use of the popcount bit counter and masking on the bit-array

to reduce workload by a factor of 32. However, they did

not completely implement this algorithm, and thus no re-

sults were reported. In contrast, our InK-Compact method

makes use of new functionality that allows each thread in a

warp to know the predicate of all threads in the warp. More

importantly, our approach is to operate compaction in the

same kernel that is outputting a stream, i.e., we complete the

compaction before leaving the kernel. This ensures that no

memory needs to be written/cleared for invalid elements. Fi-

nally, novel use of new synchronization functions leads InK-

Compact to be a simple and optimized compaction method.

At the time of writing, we are unaware of any further devel-

opments with Billeter, et al’s research, nor with their imple-

mentation (Chag::PP) [BOA09].

For isosurface rendering, one approach is extraction via

marching cubes [LC87] and rasterization of the resulting

mesh. Wilhelms and Van Gelder [WVG92] employed a min-

max octree for skipping empty cells, accelerating the ex-

traction process. This approach was improved with sev-

eral extensions, including view-dependent culling [LH98].

Sramek [Sra94] demonstrated direct ray casting of isosur-

faces using a distance ﬁeld to accelerate via ray jumping.

Parker et al. [PPL

∗

99] achieved interactive isosurface ren-

dering from large volume data using a parallel ray tracer on

a shared-memory supercomputer, employing a hierarchical

grid acceleration structure. Similar implementations exploit-

ing SIMD arithmetic and packet traversal achieved interac-

tive performance on single desktops and workstations, using

min-max kd-trees [WFM

∗

05] and octrees [KWH09]. On the

GPU, Hadwiger et al. [HSS

∗

05] employed a multi-pass ras-

terization pipeline and an efﬁcient secant solver for efﬁcient

isosurface ray casting. Hughes and Lim [HL09] employed an

optimized min-max kd-tree traversal in CUDA, and achieve

real-time rendering rates. The work also raised the issue of

keeping an acceleration structure simple and rely more on

ray stepping and texture caching. Gobbetti et al. [GMIG08]

generate view and isovalue-dependent cuts of an octree out-

of-core, then traverse the cut octree directly within a single-

pass GPU shader. They achieve interactive framerates for a

reduced gigavoxel data.

2. In-Kernel Stream Compaction

Modern GPGPU applications make use of Compute Lan-

guages (for example CUDA) that signiﬁcantly simplify pro-

gramming for massively-parallel systems. Code executes in

parallel within kernels. Each kernel is divided up into blocks

of warps, where each block is automatically (and indepen-

dently) scheduled by the hardware to run on one of the

many multi-processor cores. A warp is deﬁned as a group

of threads (typically 32) that operate at the same time on

the hardware, i.e. they are implicitly synchronized at each

instruction. For this work we assume each thread will have

one input (e.g. pixel, ray, data-element), perform an action

and produce an output. For a Kernel K we deﬁne it to have

B number of blocks, where each block has T threads.

Stream compaction is the process of producing (in par-

allel) an output array Y

[0...M−1]

, after an operation on

[0...N−1]

inputs, of which only M elements are valid. In

ray-tracing, for example, there will be M valid rays which

hit geometry and only these M valid rays will need to be

shaded. We typically deﬁne valid elements as those that pass

a predicate test. For each valid element, the main challenge

is determining the offset in the output array in relation to

other valid elements. In other words an offset into the array

is needed for each thread, which is equal to the number of

prior threads with a valid element.

Conceptually, our InK-Compact method consists of three

steps: computation of the thread offset t

(u)

within its warp,

the warp offset w

(u)

within its block, and the block offset

(u)

within its kernel. Our approach to per-warp preﬁx is the

same as discussed in Billeter [BOA09] and Harris [Hwu11].

Unlike Harris [Hwu11], however, we use bit-decomposition

and balloting to achieve the inter-warp scan, rather than use

shared memory scan. Finally, our main original contribu-

tion is computing the block-offset through the use of block-

sections, while maintaining the input-output data-ordering,

and without leaving the operating kernel.

2.1. Thread Offset

Within a block, currently 32 threads are grouped together to

make a warp. The threads in a warp run in lock-step with

one another and special warp-vote functions are available to

submitted to COMPUTER GRAPHICS Forum (4/2013).

剩余11页未读，继续阅读

weixin_38550834

粉丝: 4
资源: 964

GPU上的内核流压缩技术及其在多内核数据可视化中的应用

藏经阁-HBase In-Memory Compaction.pdf

Study on Vibration Friction Mechanism of Vibration Compaction-Soil System with Granules and Its Vibration Response Analysis

COPY_ON_WRITE和MERGE_ON_READ

给出leveldb源码当中comparator.cc中的void FindShortestSeparator()的调用关系和调用次数

kafka如何配置自动清理

在 perfconfigstore.xml中enable app compaction有什么用

什么是SDT、DCT、COCT和CHCT？分别用于描述什么？

make menuconfig配置项中的 [ ] Allow for memory compaction 介绍

oracle11g 压缩表空间

最新资源