Onesweep: A Faster Least-Significant-Digit Radix Sort for GPUs
CUB library has increased from r = 16 (eight digit-binning
iterations) to r = 128 (five digit-binning iterations) for large
32-bit sorting problems [12].
In contrast, comparison-based sorting methods have
super-linear O(nlogn) asymptotic work complexity.
Furthermore, practicable strategies for reducing the number
passes through global memory have been elusive. For
example, state-of-the-art GPU merge sorting
implementations perform binary (radix-2) merging [5,8]. As
such, a merge sort of 16M 32-bit keys will require ~10
passes through global memory (i.e., one pass to bootstrap
32K sublists followed by nine binary merging passes),
whereas our Onesweep LSD radix sort only requires five
(i.e., one histogram pass followed by four binning
iterations).
Prefix sum. An exclusive prefix sum across a list of
numbers produces a corresponding list in which the i
th
output is the sum of the first i-1 inputs. It is a useful
construct for allocation-like behavior in parallel settings
(partitioning, binning, compaction, queuing, etc.), where the
offset for the i
th
thread to write its output is the sum of the
number of output items being produced by the prior i-1
threads.
Prefix sum plays two roles in digit binning. The first is
determining the starting location of each bin in the output
buffer such that no space is wasted. For example, an
exclusive prefix sum across the radix r = 8 histogram
<8,6,7,5,3,0,9,2> of digit counts produces the
corresponding list of offsets <0,8,14,21,26,29,29,38>
indicating the locations of the 0s, 1s, 2s, etc. bins in the
output buffer.
Prefix sum is also used to compute the bin-relative
offsets for scattering keys into their destination bins.
Specifically, a prefix sum across a list of binary flags
indicating which keys contain a given digit will produce a
corresponding list of scatter offsets within that digit’s output
bin. Given the list of keys <17,8,24,5>, for example, the
flag list <0,1,1,0> corresponds to the presence of a 0s digit
in the least-significant 3-bit digit-place. A prefix sum of
these flags produces the scatter offsets <0,0,1,2> for
relocating the flagged keys relative to the start of the 0s bin.
Thus, each digit-binning iteration entails computing r prefix
flag sums, one for each bin.
Figure 1. Three-kernel reduce-then-scan parallelization among G thread blocks (~3n global data movement) [12]
Figure 2. Single-pass adaptive look-back prefix scan among G thread blocks (~2n global data movement) [12]
reduce
reduce
x
0
x
b-1
x
b
…
B
0
…
reduce reduce
reduce
…
reduce
…
B
1
reduce reduce
reduce
…
reduce
…
B
2
reduce reduce
reduce
…
reduce
…
B
G-1
reduce reduce
reduce
…
reduce
…
B
1
…
scan
…
…
reduce
…
B
2
…
scan scan
scan
…
…
reduce
x
0
x
b-1
x
b
…
B
0
…
scan scan
scan
…
…
y
0
y
b-1
y
b
B
0
reduce
…
B
G-1
…
scan scan
scan
…
…
scan
…
upsweep
pass
downsweep
pass
root scan
scan scan