W. Yang et al. / J. Parallel Distrib. Comput. 104 (2017) 49–60 51
processing unit. In March 2010, NVIDIA corporation introduced
Fermi architecture, and GF100 with Fermi architecture has 4
graphics processing clusters (GPC), 16 SMs and 512 cores. For
Fermi architecture, each SM has 32 cores, 12KB L1 cache and 2-
warps scheduling. In May 2012, NVIDIA corporation introduced
Kepler architecture. For Kepler architecture, each SM contains
192 scalar processors (SP) and 32 special function units (SFU). In
addition, each SM contains 64K shared memory for threads to share
data or communicate in the block. Using the model explicitly to
access memory, the access speed of the shared memory is close
to that of register without bank conflict. Each SM contains some
registers, which are allocated by each thread in the execution.
A graphics processing cluster (GPC) is composed of 2 SMs. Two
SMs share one GPC and L1 and texture cache. Only four GPCs
share the L2 cache. All SMs share the global memory [21]. NVIDIA
launched Maxwell architecture in 2014. This architecture provides
substantial application performance improvements over prior
architectures by featuring large dedicated shared memory, shared
memory atomics, and more active thread blocks per SM. NVIDIA
launched Pascal architecture in 2016. NVIDIA’s new NVIDIA Tesla
P100 accelerator using the groundbreaking new NVIDIA Pascal
GP100 GPU takes GPU computing to the next level. GP100 is
composed of an array of Graphics Processing Clusters (GPCs). Each
GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and
four texture units. With 60 SMs, GP100 has a total of 3840 single
precision CUDA Cores and 240 texture units. Tesla P100 features
NVIDIA’s new high-speed interface, NVLink, that provides GPU-to-
GPU data transfers at up to 160 Gigabytes/second of bidirectional
bandwidth which is 5 times the bandwidth of PCIe Gen 3 x16.
3.2. CPU–GPU hybrid parallel programming
With the rapid development of multicore technology, the
number of cores in CPU has been increasing. The CPUs with 4-cores,
6-cores, 8-cores, and more cores enter the general computing
environment to improve rapidly the parallel processing power. A
heterogeneous computing environment can be built up with GPU
and multicore CPU.
The GPU does not have a process control capability as a device
in CUDA, which is controlled by CPU. The data are transported from
host memory to the global memory of GPU. Then CPU invokes the
calculation process of GPU by calling the kernel function [23].
OpenMP provides a simple and easy-to-use parallel comput-
ing capability of multi-threading on multicore CPUs [8]. A het-
erogeneous programming model can be established by combining
OpenMP and CUDA for a CPU–GPU heterogeneous computing en-
vironment. OpenMP dedicates one thread for controlling the GPU,
while the other threads are used to share the workload among the
remaining CPU cores. Fig. 1 shows the CPU–GPU heterogeneous
parallel computing model.
Initially, the data must be divided into two sets which are
assigned to CPU and GPU respectively. Then, two groups of threads
are created in the parallel section of OpenMP, where a single thread
is dedicated to controlling the GPU while other threads undertake
the CPU workload by utilizing the remaining CPU cores [29].
4. Sparse matrix partitioning for CPU–GPU parallel computing
4.1. The distribution function of sparse matrices
A is a sparse matrix. N is the number of rows in A and M is
the number of columns in A. Define a distribution function (DF)
f : Ω
A
→ B, where Ω
A
= {R
1
, R
2
, . . . , R
M
} is domain and R
i
represents row vector set (RVS) in which each row has i non-zero
elements. B = {b
1
, b
2
, . . . , b
M
} is range, where b
i
represents the
number of rows with i non-zero elements in A. So Ω
A
and B meet
the following properties.
f (R
i
) = b
i
, R
i
∈ Ω
A
, b
i
∈ B,
A =
M
i=1
R
i
, R
i
∩ R
j
= φ, i = j,
M
i=1
b
i
= N.
(1)
4.2. The hybrid format for sparse matrices
HYB has better performance when a matrix has a small number
of non-zero elements per row, and most rows have nearly the same
number of non-zero elements but there may be a few irregular
rows with much more non-zero elements. The matrix is split into
two parts, i.e., ELL (or DIA) and COO, such that the most rows
which are nearly equal are stored by ELL (stored by DIA for the
quasi diagonal matrix) and the other few irregular rows with much
more non-zero elements are stored by COO. The coordinate (COO)
format is a particularly simple storage scheme with tuples of (row,
column, value). The arrays row, column, and value store the row
indices, column indices, and values of the non-zero elements in a
matrix respectively. For an N-by-M matrix with a maximum of K
non-zeros per row, the ELL format stores the non-zero values in a
dense N-by-K data array, where rows with less than K non-zeros
are zero-padded. Similarly, the corresponding column indices are
stored in a dense N-by-K index array, again with a sentinel value
used for padding. The DIA format is formed by two arrays, i.e., data
stores the non-zero values with N-by-K matrix and offset array
stores the offset of each diagonal with respect to the main diagonal.
HYB is a hybrid format of COO and ELL (or DIA). Given a
threshold K , the part exceeding K non-zeros in a row is extracted
to be stored by COO and the other part is stored by ELL (or DIA)
in order to minimize zero-padding. A sparse matrix can be divided
into two parts, i.e., COO and ELL (or DIA), by threshold K. Let us
consider the following example:
A =
3 0 0 0
0 1 4 0
6 0 2 8
0 5 0 7
.
Assume that K = 2. Then, we have
COO :
row =
3
,
column =
4
,
value =
8
.
ELL :
data =
3 0
1 4
6 2
5 7
,
index =
1 ∗
2 3
1 3
2 4
.
or
COO :
row =
3 4
,
column =
1 2
,
value =
6 5
.
DIA :
data =
3 0
1 4
2 8
7 0
,
offset =
0 1
.
While the ELL format is well-suited for vector architectures, its
efficiency rapidly degrades when the number of non-zeros per row
varies. DIA is suitable for compression and storage of a diagonal
matrix. If the data of a sparse matrix do not concentrate on the
diagonal and have more dispersed distribution area, the more