CPU-GPU异构系统上的SpMV混合并行计算优化

11 浏览量更新于2024-08-27 1 收藏 1.44MB PDF 举报

"这篇研究论文探讨了一种在CPU-GPU异构计算系统上进行SpMV（Sparse Matrix-Vector Multiplication，稀疏矩阵向量乘法）的混合计算方法，旨在提高计算性能。作者通过采用CPU-GPU混合并行编程模型，开发了基于分布函数的稀疏矩阵分区算法，使得SpMV计算可以同时利用CPU和GPU的计算能力。此外，他们还发现了一种优化策略，以最小化整个SpMV操作在CPU-GPU异构系统上的并行计算时间。文章经过多次修订后于2016年12月24日被接受发表。" 详细说明: SpMV是高性能计算领域中的一个关键操作，尤其是在科学计算、机器学习和数据分析中。它涉及一个稀疏矩阵和一个向量的乘法，由于稀疏矩阵通常在大规模问题中出现，其高效计算对于提升整体计算效率至关重要。然而，由于稀疏矩阵的非零元素分布不均匀，这使得并行计算的挑战性增大。在CPU-GPU异构计算系统中，CPU擅长处理控制逻辑和内存管理，而GPU则在执行大量数据并行运算时表现出卓越性能。论文提出的混合计算方法正是为了充分发挥这种硬件架构的优势。他们采用了一个CPU-GPU混合并行编程模型，该模型能够有效地将任务分配到适当的处理器上，以实现计算负载的平衡。论文中开发的稀疏矩阵分区算法基于一个分布函数，这个函数考虑了稀疏矩阵的结构特点，如非零元素的分布，来决定哪些部分应由CPU处理，哪些部分应由GPU处理。这样的分区策略能够确保数据的局部性和减少数据传输的开销，从而提高整体计算效率。进一步，作者提出了一种优化策略，目标是最大化并行计算的效率，减少总体计算时间。这可能涉及到调度算法、通信优化、内存访问模式改进等多个方面。在实际应用中，这样的策略对于适应不同的硬件配置和工作负载，以及调整计算资源分配以适应不断变化的需求至关重要。这篇研究论文对提高CPU-GPU异构系统上SpMV操作的性能进行了深入探索，为高性能计算领域提供了有价值的理论和实践指导。通过创新的并行编程模型和矩阵分区算法，以及优化策略，研究者们展示了如何更有效地利用现有硬件资源，加速科学计算中的关键步骤。

W. Yang et al. / J. Parallel Distrib. Comput. 104 (2017) 49–60 51

processing unit. In March 2010, NVIDIA corporation introduced

Fermi architecture, and GF100 with Fermi architecture has 4

graphics processing clusters (GPC), 16 SMs and 512 cores. For

Fermi architecture, each SM has 32 cores, 12KB L1 cache and 2-

warps scheduling. In May 2012, NVIDIA corporation introduced

Kepler architecture. For Kepler architecture, each SM contains

192 scalar processors (SP) and 32 special function units (SFU). In

addition, each SM contains 64K shared memory for threads to share

data or communicate in the block. Using the model explicitly to

access memory, the access speed of the shared memory is close

to that of register without bank conflict. Each SM contains some

registers, which are allocated by each thread in the execution.

A graphics processing cluster (GPC) is composed of 2 SMs. Two

SMs share one GPC and L1 and texture cache. Only four GPCs

share the L2 cache. All SMs share the global memory [21]. NVIDIA

launched Maxwell architecture in 2014. This architecture provides

substantial application performance improvements over prior

architectures by featuring large dedicated shared memory, shared

memory atomics, and more active thread blocks per SM. NVIDIA

launched Pascal architecture in 2016. NVIDIA’s new NVIDIA Tesla

P100 accelerator using the groundbreaking new NVIDIA Pascal

GP100 GPU takes GPU computing to the next level. GP100 is

composed of an array of Graphics Processing Clusters (GPCs). Each

GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and

four texture units. With 60 SMs, GP100 has a total of 3840 single

precision CUDA Cores and 240 texture units. Tesla P100 features

NVIDIA’s new high-speed interface, NVLink, that provides GPU-to-

GPU data transfers at up to 160 Gigabytes/second of bidirectional

bandwidth which is 5 times the bandwidth of PCIe Gen 3 x16.

3.2. CPU–GPU hybrid parallel programming

With the rapid development of multicore technology, the

number of cores in CPU has been increasing. The CPUs with 4-cores,

6-cores, 8-cores, and more cores enter the general computing

environment to improve rapidly the parallel processing power. A

heterogeneous computing environment can be built up with GPU

and multicore CPU.

The GPU does not have a process control capability as a device

in CUDA, which is controlled by CPU. The data are transported from

host memory to the global memory of GPU. Then CPU invokes the

calculation process of GPU by calling the kernel function [23].

OpenMP provides a simple and easy-to-use parallel comput-

ing capability of multi-threading on multicore CPUs [8]. A het-

erogeneous programming model can be established by combining

OpenMP and CUDA for a CPU–GPU heterogeneous computing en-

vironment. OpenMP dedicates one thread for controlling the GPU,

while the other threads are used to share the workload among the

remaining CPU cores. Fig. 1 shows the CPU–GPU heterogeneous

parallel computing model.

Initially, the data must be divided into two sets which are

assigned to CPU and GPU respectively. Then, two groups of threads

are created in the parallel section of OpenMP, where a single thread

is dedicated to controlling the GPU while other threads undertake

the CPU workload by utilizing the remaining CPU cores [29].

4. Sparse matrix partitioning for CPU–GPU parallel computing

4.1. The distribution function of sparse matrices

A is a sparse matrix. N is the number of rows in A and M is

the number of columns in A. Define a distribution function (DF)

f : Ω

→ B, where Ω

= {R

, R

, . . . , R

} is domain and R

represents row vector set (RVS) in which each row has i non-zero

elements. B = {b

, b

, . . . , b

} is range, where b

represents the

number of rows with i non-zero elements in A. So Ω

and B meet

the following properties.

f (R

) = b

, R

∈ Ω

, b

∈ B,

A =



i=1

, R

∩ R

= φ, i = j,



i=1

= N.

(1)

4.2. The hybrid format for sparse matrices

HYB has better performance when a matrix has a small number

of non-zero elements per row, and most rows have nearly the same

number of non-zero elements but there may be a few irregular

rows with much more non-zero elements. The matrix is split into

two parts, i.e., ELL (or DIA) and COO, such that the most rows

which are nearly equal are stored by ELL (stored by DIA for the

quasi diagonal matrix) and the other few irregular rows with much

more non-zero elements are stored by COO. The coordinate (COO)

format is a particularly simple storage scheme with tuples of (row,

column, value). The arrays row, column, and value store the row

indices, column indices, and values of the non-zero elements in a

matrix respectively. For an N-by-M matrix with a maximum of K

non-zeros per row, the ELL format stores the non-zero values in a

dense N-by-K data array, where rows with less than K non-zeros

are zero-padded. Similarly, the corresponding column indices are

stored in a dense N-by-K index array, again with a sentinel value

used for padding. The DIA format is formed by two arrays, i.e., data

stores the non-zero values with N-by-K matrix and offset array

stores the offset of each diagonal with respect to the main diagonal.

HYB is a hybrid format of COO and ELL (or DIA). Given a

threshold K , the part exceeding K non-zeros in a row is extracted

to be stored by COO and the other part is stored by ELL (or DIA)

in order to minimize zero-padding. A sparse matrix can be divided

into two parts, i.e., COO and ELL (or DIA), by threshold K. Let us

consider the following example:

A =







3 0 0 0

0 1 4 0

6 0 2 8

0 5 0 7







Assume that K = 2. Then, we have

COO :







row =





column =





value =





ELL :











data =







3 0

1 4

6 2

5 7







index =







1 ∗

2 3

1 3

2 4







COO :







row =



3 4



column =



1 2



value =



6 5



DIA :











data =







3 0

1 4

2 8

7 0







offset =



0 1



While the ELL format is well-suited for vector architectures, its

efficiency rapidly degrades when the number of non-zeros per row

varies. DIA is suitable for compression and storage of a diagonal

matrix. If the data of a sparse matrix do not concentrate on the

diagonal and have more dispersed distribution area, the more

剩余11页未读，继续阅读

weixin_38721565

粉丝: 3
资源: 916

CPU-GPU异构系统上的SpMV混合并行计算优化

spmv算法的代码

LightSpMV:基于GPU的轻量级稀疏矩阵矢量乘法（SpMV）-开源

spmvaccsim:用于探索 SpMV 硬件加速器设计空间的 SystemC + DRAMSim2 模拟器

写代码实现sve和sve2加速spmv

spmm和spmv区别

举例说明sve和sve2加速spmv

sve和sve2实现spmv

如何用sve加速spmv

neon和sve实现spmv的代码

spmv存储格式是csr 用C语言和mpi 其中包括mpi的通信实现 完整代码

最新资源

spmv存储格式是csr 用C语言和mpi 其中包括mpi的通信实现完整代码