GPU加速的并行两列表算法：子集和问题的有效解决方案

35 浏览量更新于2024-07-15 收藏 874KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

124 L. WAN ET AL.

Step 3: Evenly assign the picked block pairs to k threads.

Step 4: For all P

where 1 6 i 6 k do in parallel

Thread P

performs the search routine of Horowitz and Sahni’s two-list algorithm.

2.3. The optimal parallel merging algorithm

This section introduces the optimal parallel merging algorithm without memory conﬂicts for the

exclusive read exclusive write parallel random access machine proposed by Akl et al. [24]. Here,

for the sake of consistency, we describe the merging algorithm with a slight modiﬁcation. Given two

sorted vectors U D Œu

, u

,  , u

 and V D Œv

, v

,  , v

, according to the following steps, U

and V can be merged without memory conﬂicts into a new vector of length 2m.

Step 1: Use k threads to divide U and V each into k subvectors ŒU

, U

, , U

 and

ŒV

, V

,  , V

 in parallel,

D 2m=k,for1 6 i 6 k, and all elements in U

and V

are smaller than those of all elements in U

iC1

and V

iC1

,for1 6 i 6 k  1,where

1 6 k 6 2m and k is a power of 2.

Step 2: Use thread P

to merge U

and V

,for1 6 i 6 k.

Step 1 can be efﬁciently implemented using the selection algorithm presented in [24]. The

total time required for Step 1 is O.log k  log 2m/; Step 2 requires each thread to merge at

most 2m=k elements; therefore, the total time complexity of this parallel merging algorithm is

O.2m=k C log k  log 2m/.

3. THE PROPOSED GPU IMPLEMENTATION

In this section, ﬁrstly, we brieﬂy introduce the CUDA architecture. Then, on the basis of the parallel

two-list algorithm of Li et al. [17], we describe how to effectively implement the three stages of the

algorithm on a GPU.

3.1. The CUDA architecture

This section gives a brief description of the CUDA architecture, which is based on the single

instruction multiple threads mode of programming, as shown in Figure 1.

Parallel code executed on the GPU (the so-called device) is interleaved with serial code executed

on the CPU (the so-called host). A thread block is a batch of threads that can cooperate by sharing

Figure 1. The CUDA architecture.

DOI: 10.1002/cpe

剩余26页未读，继续阅读

weixin_38713061

粉丝: 2
资源: 939

GPU加速的并行两列表算法：子集和问题的有效解决方案

子集和问题的并行两列表算法的新颖CPU-GPU协作实现

基于GPU的并行最小生成树算法的设计与实现.pdf

请为子集和问题（是否存在和为t的子集）设计一个拉斯维加斯算法

verilog 并行排序算法

0/1背包问题并行算法

请为子集和问题（是否存在和为t的子集）设计一个拉斯维加斯算法，给出c++代码

贪心算法的子集和问题

回溯法解决子集和问题的实验结论

回溯法解决子集和问题的算法

如何用gpu实现kdtree查找点云

用python语言实现“子集和数”问题的分支限界算法

java实现子集和问题

简述串行排序算法与并行排序算法

子集和问题回溯法难点分析

N皇后问题解决步骤和子集和数问题的解决步骤

参考课本装载问题Java实现子集和问题

java子集和数问题回溯法算法_子集和数问题——回溯法

使用c语言通过子集树和排列树两种方式解决子集和问题并给出相应思路和树图以及时间复杂度，并输出所有解的情况给出相应代码以及注释，子集和问题就是n个数xn相加等于k，

C语言编程实现子集遍历算法

利用FFT来求解子集和数问题

最新资源