Direction-Optimizing Breadth-First Search
on CPU-GPU heterogeneous platforms
Dan Zou, Yong Dou, Qiang Wang
National Laboratory for Parallel and Distribution Processing
National University of Defense Technology
Changsha, China
Email: zoudan@nudt.edu.cn
Jinbo Xu, Baofeng Li
College of Computer
National University of Defense Technology
Changsha, China
Abstract—Breadth-First Search (BFS) is a basis for many
graph traversal and analysis algorithms. In this paper, we
present a direction-optimizing BFS implementation on CPU-GPU
heterogeneous platforms to fully exploit the computing power of
both the multi-core CPU and GPU. For each level of the BFS
algorithm, we dynamically choose the best implementation from:
a sequential top-down execution on CPU, a parallel top-down
execution on CPU, and a cooperative bottom-up execution on
CPU and GPU. By adapting to the runtime variability of vertex
frontiers, such a hybrid approach provides the best performance
for the exploration of each BFS level while avoiding poor worst-
case performance. Our implementation demonstrates speedups of
1.37 to 1.44 compared to the highest published performance for
shared memory systems.
Keywords—Breadth-First Search; heterogeneous platform;
GPU
I. INTRODUCTION
In recent years, heterogeneous parallel architecture has
become an important development trend of high-performance
scientific computing architecture, which usually consists of two
types of processing elements: the multi-core CPU and the
algorithm accelerator. The algorithm accelerator involves in the
calculation as a coprocessor under the control of CPU.
Compared with the multi-core CPU, the algorithm
accelerator such as GPU has superior computing power and
memory bandwidth. The GPU is capable to execute much more
threads than the CPU and has been succeeded in accelerating a
large number of scientific and engineering computing
algorithms. The GPU is suitable for compute-intensive regular
algorithms, but is suboptimal for data-intensive irregular
algorithms. The substantial discrete data access and execution
path divergence serialize the GPU threads, thus degrading the
actual performance.
Breadth-First Search (BFS) is one of the typical data-
intensive irregular algorithms. As the basis for many graph
traversal and analysis algorithms, the BFS algorithm is widely
used to evaluate the processing capability for data-intensive
applications of the computing systems, and has become the
core algorithm of many benchmarks, such as Parboil [1],
Rodinia [2] and Graph500 [3]. BFS algorithm has data-driven
computations dictated by the irregularity of the graphs, leading
to fine-grained random memory accesses with poor spatial and
temporal locality. In addition, BFS algorithm tends to explore
the structure of the graph while performing a relatively small
amount of computations, leading to execution times dominated
by memory access time. Therefore, BFS algorithm usually
obtains suboptimal performance on conventional cache-based
processors with high peak performance.
A series of efficient parallel algorithms have been presented
and implemented on GPU and multi-core CPU platforms. For
the GPU platforms, Harish presents the first GPU based BFS
algorithm [4]. Hong proposes a scan-based BFS
implementation, improving the efficiency of memory access by
memory coalescing and variable virtual warp. On this basis,
Hong deploys the BFS algorithm on the CPU-GPU
heterogeneous platform by mapping the exploration task of
each BFS level to CPU or GPU, according to the degree of
parallelism [5]. Merrill puts forward a prefix-sum based
method to improve the load balancing of massive GPU threads,
and presents a multi-GPU based BFS implementation [6]. For
the multi-core CPU platforms, Agarwal proposes a scalable
BFS implementation based on a bitmap approach and data
partitioning algorithm [7]. Xia reduces the race condition of the
shared queue by using multiple queues [8]. Chhugani reduces
the access overhead of the visitation status of graph vertices by
eliminating the atomic operations [9]. Beamer designs a
direction-optimizing BFS algorithm, which reduces the
memory access by combing the top-down and bottom-up
algorithms and obtains a superior performance to other related
works [10].
In this paper, we present a direction-optimizing BFS
implementation on CPU-GPU heterogeneous platforms. First
we deploy the bottom-up BFS algorithm to the GPU to
improve the exploration performance of large vertex frontiers.
Then we design a CPU-GPU collaborative bottom-up BFS
algorithm to fully exploit the computing power of the
heterogeneous platform. Finally we present the direction-
optimizing BFS implementation on CPU-GPU heterogeneous
platforms to provide the best performance for the exploration
of each BFS level. Our contributions are list as follows:
We deploy the bottom-up BFS algorithm to the GPU.
By coalescing the memory access and balancing the
workload in the thread warp, our implementation