Computer Physics Communications 182 (2011) 266–269
Contents lists available at ScienceDirect
Computer Physics Communications
www.elsevier.com/locate/cpc
Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU
clusters
✩
Chao-Tung Yang
∗
, Chih-Lin Huang, Cheng-Fang Lin
Department of Computer Science, Tunghai University, Taichung City, 40704, Taiwan
article info abstract
Article history:
Received 1 March 2010
Received in revised form 18 June 2010
Accepted 25 June 2010
Availableonline16July2010
Keywords:
CUDA
GPU
MPI
OpenMP
Hybrid
Parallel programming
Nowadays, NVIDIA’s CUDA is a general purpose scalable parallel programming model for writing highly
parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory,
and barrier synchronization. This model has proven quite successful at programming multithreaded many
core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are
already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we
propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which
partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists
of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by
CUDA run by the processor cores in the same computational node.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
Nowadays, NVIDIA’s CUDA [1] is a general purpose scalable
parallel programming model for writing highly parallel applica-
tions. It provides several key abstractions – a hierarchy of thread
blocks, shared memory, and barrier synchronization. This model
has proven quite successful at programming multithreaded many
core GPUs and scales transparently to hundreds of cores: scientists
throughout industry and academia are already using CUDA [1] to
achieve dramatic speedups on production and research codes.
This paper proposes a solution to not only simplify the use
of hardware acceleration in conventional general purpose applica-
tions, but also to keep the application code portable. In this paper,
we propose a parallel programming approach using hybrid CUDA,
OpenMP and MPI [3] programming, which partition loop iterations
according to the performance weighting of multicore [4] nodes in
a cluster. Because iterations assigned to one MPI process are pro-
cessed in parallel by OpenMP threads run by the processor cores in
the same computational node, the number of loop iterations allo-
cated to one computational node at each scheduling step depends
on the number of processor cores in that node.
In this paper, we propose a general approach that uses perfor-
mance functions to estimate performance weights for each node.
To verify the proposed approach, a cluster with hybrid CUDA was
✩
This work is supported in part by the National Science Council, Taiwan, under
grants Nos. NSC 98-2220-E-029-004- and NSC 99-2220-E-029-004-.
*
Corresponding author. Tel.: +886 4 23590415; fax: +886 4 23591567.
E-mail address: ctyang@thu.edu.tw (C.-T. Yang).
built in our implementation. Empirical results show that in the hy-
brid CUDA clusters environments, the proposed approach improved
performance over all previous schemes.
The rest of this paper is organized as follows. In Section 2,
we introduce several typical and well-known parallel programming
schemes. In Section 3, we define our model and describe our ap-
proach. Our system configuration is then specified in Section 4,
and experimental results for three types of application program
are presented. Concluding remarks and future work are given in
Section 5.
2. Background review
2.1. CUDA programming
CUDA (an acronym for Compute Unified Device Architecture) is
a parallel computing [2] architecture developed by NVIDIA. CUDA
is the computing engine in NVIDIA graphics processing units or
GPUs that is accessible to software developers through industry
standard programming languages. CUDA architecture supports a
range of computational interfaces including OpenGL [9] and Direct
Compute. CUDA’s parallel programming model is designed to over-
come this challenge while maintaining a low learning curve for
programmers familiar with standard programming languages such
as C. At its core are three key abstractions – a hierarchy of thread
groups, shared memories, and barrier synchronization – that are
simply exposed to the programmer as a minimal set of language
extensions.
0010-4655/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.cpc.2010.06.035
评论7