CPU-GPU异构平台上的方向优化广度优先搜索性能提升

154 浏览量更新于2024-08-26 收藏 351KB PDF 举报

本文主要探讨了"Direction-Optimizing Breadth-First Search (BFS)"在CPU-GPU异构平台上的优化策略。随着图形处理单元（GPU）的强大并行计算能力在近年来的崛起，许多图遍历和分析算法都寻求利用这种并行性来提高性能。作者Dan Zou、Yong Dou和Qiang Wang来自国防科技大学的并行与分布式处理国家实验室，他们针对这一问题提出了创新的方法。传统的BFS算法通常采用自顶向下的顺序执行或单线程处理，然而，这种方法无法充分利用多核CPU和GPU的协同优势。为了克服这一局限，他们的研究提出了一种动态优化策略。在CPU-GPU异构平台上，他们设计了一种混合执行模式，根据每个BFS层级的具体情况，动态选择最有效的实现方式： 1. CPU的顺序执行：当处理较小的顶点前沿时，顺序执行在CPU上保持高效。 2. CPU的并行执行：对于较大规模的顶点前沿，通过多核心并行处理，提升处理速度。 3. CPU和GPU的合作执行：当处理规模适中的数据时，通过将工作负载分解并合作完成，GPU能够执行大量的计算密集任务，同时CPU负责控制和协调。这种灵活性使得算法能够适应运行时的动态变化，最大化每层BFS的探索效率，避免了在最坏情况下性能下降的问题。通过与已发表的最高性能相比，该优化后的BFS实现显示出了显著的速度提升，平均达到了1.37到1.44倍的加速效果。这项研究的重要性在于，它不仅提升了广度优先搜索的性能，还展示了如何在异构平台上实现算法的动态优化，这对于处理大规模图数据和实时应用具有实际价值。此外，这种方法也为其他需要频繁访问内存和计算密集任务的算法提供了可借鉴的优化思路。这篇研究论文对CPU-GPU协同编程和图算法的并行化实践具有重要的理论和实践指导意义。

Direction-Optimizing Breadth-First Search

on CPU-GPU heterogeneous platforms

Dan Zou, Yong Dou, Qiang Wang

National Laboratory for Parallel and Distribution Processing

National University of Defense Technology

Changsha, China

Email: zoudan@nudt.edu.cn

Jinbo Xu, Baofeng Li

College of Computer

National University of Defense Technology

Changsha, China

Abstract—Breadth-First Search (BFS) is a basis for many

graph traversal and analysis algorithms. In this paper, we

present a direction-optimizing BFS implementation on CPU-GPU

heterogeneous platforms to fully exploit the computing power of

both the multi-core CPU and GPU. For each level of the BFS

algorithm, we dynamically choose the best implementation from:

a sequential top-down execution on CPU, a parallel top-down

execution on CPU, and a cooperative bottom-up execution on

CPU and GPU. By adapting to the runtime variability of vertex

frontiers, such a hybrid approach provides the best performance

for the exploration of each BFS level while avoiding poor worst-

case performance. Our implementation demonstrates speedups of

1.37 to 1.44 compared to the highest published performance for

shared memory systems.

Keywords—Breadth-First Search; heterogeneous platform;

GPU

I. INTRODUCTION

In recent years, heterogeneous parallel architecture has

become an important development trend of high-performance

scientific computing architecture, which usually consists of two

types of processing elements: the multi-core CPU and the

algorithm accelerator. The algorithm accelerator involves in the

calculation as a coprocessor under the control of CPU.

Compared with the multi-core CPU, the algorithm

accelerator such as GPU has superior computing power and

memory bandwidth. The GPU is capable to execute much more

threads than the CPU and has been succeeded in accelerating a

large number of scientific and engineering computing

algorithms. The GPU is suitable for compute-intensive regular

algorithms, but is suboptimal for data-intensive irregular

algorithms. The substantial discrete data access and execution

path divergence serialize the GPU threads, thus degrading the

actual performance.

Breadth-First Search (BFS) is one of the typical data-

intensive irregular algorithms. As the basis for many graph

traversal and analysis algorithms, the BFS algorithm is widely

used to evaluate the processing capability for data-intensive

applications of the computing systems, and has become the

core algorithm of many benchmarks, such as Parboil [1],

Rodinia [2] and Graph500 [3]. BFS algorithm has data-driven

computations dictated by the irregularity of the graphs, leading

to fine-grained random memory accesses with poor spatial and

temporal locality. In addition, BFS algorithm tends to explore

the structure of the graph while performing a relatively small

amount of computations, leading to execution times dominated

by memory access time. Therefore, BFS algorithm usually

obtains suboptimal performance on conventional cache-based

processors with high peak performance.

A series of efficient parallel algorithms have been presented

and implemented on GPU and multi-core CPU platforms. For

the GPU platforms, Harish presents the first GPU based BFS

algorithm [4]. Hong proposes a scan-based BFS

implementation, improving the efficiency of memory access by

memory coalescing and variable virtual warp. On this basis,

Hong deploys the BFS algorithm on the CPU-GPU

heterogeneous platform by mapping the exploration task of

each BFS level to CPU or GPU, according to the degree of

parallelism [5]. Merrill puts forward a prefix-sum based

method to improve the load balancing of massive GPU threads,

and presents a multi-GPU based BFS implementation [6]. For

the multi-core CPU platforms, Agarwal proposes a scalable

BFS implementation based on a bitmap approach and data

partitioning algorithm [7]. Xia reduces the race condition of the

shared queue by using multiple queues [8]. Chhugani reduces

the access overhead of the visitation status of graph vertices by

eliminating the atomic operations [9]. Beamer designs a

direction-optimizing BFS algorithm, which reduces the

memory access by combing the top-down and bottom-up

algorithms and obtains a superior performance to other related

works [10].

In this paper, we present a direction-optimizing BFS

implementation on CPU-GPU heterogeneous platforms. First

we deploy the bottom-up BFS algorithm to the GPU to

improve the exploration performance of large vertex frontiers.

Then we design a CPU-GPU collaborative bottom-up BFS

algorithm to fully exploit the computing power of the

heterogeneous platform. Finally we present the direction-

optimizing BFS implementation on CPU-GPU heterogeneous

platforms to provide the best performance for the exploration

of each BFS level. Our contributions are list as follows:

 We deploy the bottom-up BFS algorithm to the GPU.

By coalescing the memory access and balancing the

workload in the thread warp, our implementation

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38567956

粉丝: 1
资源: 944

CPU-GPU异构平台上的方向优化广度优先搜索性能提升

基于CPU-GPU异构平台的性能优化及多核并行编程模型的研究1

基于CPU-GPU异构体系结构的并行字符串相似性连接方法.docx

CPU-GPU系统全局性能优化：基于剖分的3级策略

藏经阁-FLASH_大规模分布式图计算引擎及应用.pdf

CPU流水线技术详解：指令执行效率的飞跃

图聚类算法性能优化秘籍：5个技巧提升效率

【AI算法性能理论基础】：复杂度理论与性能优化实践

C#并行编程性能优化：资源分配与负载平衡的高级策略

C++编译器优化高级技巧：向量化和并行计算，性能的突破点

Asahi Linux性能提升秘籍：ARM架构下的极致优化（附案例分析）

最新资源