124 L. WAN ET AL.
Step 3: Evenly assign the picked block pairs to k threads.
Step 4: For all P
i
where 1 6 i 6 k do in parallel
Thread P
i
performs the search routine of Horowitz and Sahni’s two-list algorithm.
2.3. The optimal parallel merging algorithm
This section introduces the optimal parallel merging algorithm without memory conflicts for the
exclusive read exclusive write parallel random access machine proposed by Akl et al. [24]. Here,
for the sake of consistency, we describe the merging algorithm with a slight modification. Given two
sorted vectors U D Œu
1
, u
2
, , u
m
and V D Œv
1
, v
2
, , v
m
, according to the following steps, U
and V can be merged without memory conflicts into a new vector of length 2m.
Step 1: Use k threads to divide U and V each into k subvectors ŒU
1
, U
2
, , U
k
and
ŒV
1
, V
2
, , V
k
in parallel,
j
U
i
j
C
j
V
i
j
D 2m=k,for1 6 i 6 k, and all elements in U
i
and V
i
are smaller than those of all elements in U
iC1
and V
iC1
,for1 6 i 6 k 1,where
1 6 k 6 2m and k is a power of 2.
Step 2: Use thread P
i
to merge U
i
and V
i
,for1 6 i 6 k.
Step 1 can be efficiently implemented using the selection algorithm presented in [24]. The
total time required for Step 1 is O.log k log 2m/; Step 2 requires each thread to merge at
most 2m=k elements; therefore, the total time complexity of this parallel merging algorithm is
O.2m=k C log k log 2m/.
3. THE PROPOSED GPU IMPLEMENTATION
In this section, firstly, we briefly introduce the CUDA architecture. Then, on the basis of the parallel
two-list algorithm of Li et al. [17], we describe how to effectively implement the three stages of the
algorithm on a GPU.
3.1. The CUDA architecture
This section gives a brief description of the CUDA architecture, which is based on the single
instruction multiple threads mode of programming, as shown in Figure 1.
Parallel code executed on the GPU (the so-called device) is interleaved with serial code executed
on the CPU (the so-called host). A thread block is a batch of threads that can cooperate by sharing
Figure 1. The CUDA architecture.
Copyright © 2014 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2015; 27:119–145
DOI: 10.1002/cpe