D. M. Hughes, I. S. Lim, M. W. Jones, A. Knoll & B. Spencer / In-Kernel Stream Compaction 3
our compaction technique with the compaction method in
the Thrust library.
Nobari [NLKB11] used the scan-scatter method by Horn
[Hor05] to accelerate generation of random graphs from
databases. Stream compaction was used in work by Hissoiny
[HOBD11] to speed up dosimetric computations for radio-
therapy, using Monte Carlo methods; specifically they com-
pacted computations on photons that worked longer than
others. Rather than having threads idle, computations on
photons are limited to a user constant, after which the stream
is compacted to remove completed items.
Schwarz [SS10] used compaction during voxelization of
surfaces and solids. The work is notable for employing a
multiple-kernel pipeline (to alleviate under-utilization of the
GPU) where compaction is used to ensure good ordering of
triangles ready for further processing. Tang et al. [TMLT11]
employed stream compaction in kernel. In their method a
fixed amount of space is reserved, then each block writes to
its own private part of the total array. A second pass com-
pacts the private arrays. This prefix sum can be executed as
part of the second kernel. van Antwerpen [vA11] also in-
corporates the compaction within the kernel, but does not
guarantee order preservation.
Billeter et al. [BOA09] suggested an approach that makes
use of the popcount bit counter and masking on the bit-array
to reduce workload by a factor of 32. However, they did
not completely implement this algorithm, and thus no re-
sults were reported. In contrast, our InK-Compact method
makes use of new functionality that allows each thread in a
warp to know the predicate of all threads in the warp. More
importantly, our approach is to operate compaction in the
same kernel that is outputting a stream, i.e., we complete the
compaction before leaving the kernel. This ensures that no
memory needs to be written/cleared for invalid elements. Fi-
nally, novel use of new synchronization functions leads InK-
Compact to be a simple and optimized compaction method.
At the time of writing, we are unaware of any further devel-
opments with Billeter, et al’s research, nor with their imple-
mentation (Chag::PP) [BOA09].
For isosurface rendering, one approach is extraction via
marching cubes [LC87] and rasterization of the resulting
mesh. Wilhelms and Van Gelder [WVG92] employed a min-
max octree for skipping empty cells, accelerating the ex-
traction process. This approach was improved with sev-
eral extensions, including view-dependent culling [LH98].
Sramek [Sra94] demonstrated direct ray casting of isosur-
faces using a distance field to accelerate via ray jumping.
Parker et al. [PPL
∗
99] achieved interactive isosurface ren-
dering from large volume data using a parallel ray tracer on
a shared-memory supercomputer, employing a hierarchical
grid acceleration structure. Similar implementations exploit-
ing SIMD arithmetic and packet traversal achieved interac-
tive performance on single desktops and workstations, using
min-max kd-trees [WFM
∗
05] and octrees [KWH09]. On the
GPU, Hadwiger et al. [HSS
∗
05] employed a multi-pass ras-
terization pipeline and an efficient secant solver for efficient
isosurface ray casting. Hughes and Lim [HL09] employed an
optimized min-max kd-tree traversal in CUDA, and achieve
real-time rendering rates. The work also raised the issue of
keeping an acceleration structure simple and rely more on
ray stepping and texture caching. Gobbetti et al. [GMIG08]
generate view and isovalue-dependent cuts of an octree out-
of-core, then traverse the cut octree directly within a single-
pass GPU shader. They achieve interactive framerates for a
reduced gigavoxel data.
2. In-Kernel Stream Compaction
Modern GPGPU applications make use of Compute Lan-
guages (for example CUDA) that significantly simplify pro-
gramming for massively-parallel systems. Code executes in
parallel within kernels. Each kernel is divided up into blocks
of warps, where each block is automatically (and indepen-
dently) scheduled by the hardware to run on one of the
many multi-processor cores. A warp is defined as a group
of threads (typically 32) that operate at the same time on
the hardware, i.e. they are implicitly synchronized at each
instruction. For this work we assume each thread will have
one input (e.g. pixel, ray, data-element), perform an action
and produce an output. For a Kernel K we define it to have
B number of blocks, where each block has T threads.
Stream compaction is the process of producing (in par-
allel) an output array Y
[0...M−1]
, after an operation on
X
[0...N−1]
inputs, of which only M elements are valid. In
ray-tracing, for example, there will be M valid rays which
hit geometry and only these M valid rays will need to be
shaded. We typically define valid elements as those that pass
a predicate test. For each valid element, the main challenge
is determining the offset in the output array in relation to
other valid elements. In other words an offset into the array
is needed for each thread, which is equal to the number of
prior threads with a valid element.
Conceptually, our InK-Compact method consists of three
steps: computation of the thread offset t
(u)
within its warp,
the warp offset w
(u)
within its block, and the block offset
b
(u)
within its kernel. Our approach to per-warp prefix is the
same as discussed in Billeter [BOA09] and Harris [Hwu11].
Unlike Harris [Hwu11], however, we use bit-decomposition
and balloting to achieve the inter-warp scan, rather than use
shared memory scan. Finally, our main original contribu-
tion is computing the block-offset through the use of block-
sections, while maintaining the input-output data-ordering,
and without leaving the operating kernel.
2.1. Thread Offset
Within a block, currently 32 threads are grouped together to
make a warp. The threads in a warp run in lock-step with
one another and special warp-vote functions are available to
submitted to COMPUTER GRAPHICS Forum (4/2013).