CPU-GPU协同：并行处理块三对角方程的高效方法

51 浏览量更新于2024-07-15 1 收藏 1.11MB PDF 举报

本文档探讨了一种针对CPU-GPU异构计算系统中的块三对角线方程的并行求解方法。在许多科学和工程问题的数值模拟中，块三对角矩阵是一个关键挑战，其特点是主对角线上方和下方的子矩阵通常具有较少的非零元素，而主对角线上的子矩阵（块）则包含大量的非零数据。这使得直接求解所有子矩阵可能效率低下，特别是对于大规模问题。传统的三对角线矩阵可以采用直接求解方法如Thomas算法或Gaussian消元法，但当处理大规模块三对角矩阵时，这些方法可能会受到计算资源限制，特别是当CPU和GPU这种硬件加速器被利用时，性能瓶颈可能出现在数据传输和子任务分配上。因此，研究者提出了一种混合直接与迭代方法的策略。在该方法中，主对角线上的块被选择性地用直接解法（如LU分解或者QR分解）来处理，这些方法能够高效地处理密集矩阵。与此同时，非主对角线的稀疏子矩阵则通过迭代方法，如预条件共轭梯度法（PCG）或GMRES进行求解，这样可以减少计算量，并充分利用GPU的并行处理能力。由于GPU的并行计算能力在处理大量稀疏数据时表现出色，这种方法可以在保持精度的同时显著提高整体求解速度。为了实现这一混合策略，文章可能涉及以下步骤： 1. 矩阵分解：将块三对角矩阵分解为密集主对角线部分和稀疏非主对角线部分。 2. 任务并行化：在CPU上处理密集块，而在GPU上并行执行迭代求解稀疏部分。 3. 通信优化：通过高效的内存管理和数据复制策略，减少CPU和GPU之间的数据交换开销。 4. 预处理与后处理：可能包括预计算矩阵的倒数或者存储之前所需的中间结果，以加快后续迭代。 5. 迭代收敛控制：监控迭代进度，根据收敛情况调整迭代次数或改变预条件器。该论文为解决CPU-GPU异构环境中块三对角线方程提供了一种新颖且有效的策略，通过结合直接和迭代技术，它不仅提高了计算效率，还适应了现代高性能计算环境的需求。这对于科学计算、工程仿真以及大规模数据分析等领域具有实际应用价值。

1762 W. Yang et al.

processes. Because the approximate solutions obtained by the direct methods are closer

to the exact solutions, the convergence speed of solving the block-tridiagonal system

of linear equations can be improved. Some direct methods have good performance

in solving small-scale equations, and the sub-equations can be solved in parallel. We

present an improved algorithm to solve the sub-equations by thread blocks on GPU,

and the intermediate data are stored in shared memory, so as to signiﬁcantly reduce

the latency of memory access. Therefore, the computational complexity of the hybrid

method is not increased and the convergence speed can be accelerated.

According to our experiments on ten test cases, the performance improvement using

our algorithm is very effective and noticeable. The average number of iterations is

reduced by 283.15 and 18.34 % using our method compared with CG and BiCGSTAB

of PARALUTION library, respectively, and the performance using our method is better

than those of the commonly used iterative and direct methods, and the performance

of solving the test cases on GPU is improved by 26.98, 11.52, and 9.25 % using our

method compared with CG, BiCGSTAB of PARALUTION library, and cuSolverSP

of CUDA.

The remainder of the paper is organized as follows. In Sect. 2, we review the related

research on solving block-tridiagonal system of linear equations. In Sect. 3, we present

an introduction to CUDA. In Sect. 4, we develop the method of solution for solving

block-tridiagonal matrices. In Sect. 4.6, we describe parallel implementation of our

method on GPU. In Sect. 5, we demonstrate the performance comparison results in

our extensive experiments. In Sect. 6, we conclude the paper.

2 Related work

Block-tridiagonal matrices have a central diagonal and two adjacent, which are located

at a distance m from the center, and m is the size of the block matrix. There has been

considerable work in developing solution algorithms for block-tridiagonal matrices.

For a block-tridiagonal matrix A, it is possible to obtain an exact inverse (direct solu-

tion) with no ﬁll-in using the well-known Thomas [2] serial algorithm, which is easily

generalized for block sizes m = 1. While this is the fastest algorithm on a serial

computer, it is not parallelizable, since each solution step in the algorithm depends

on the preceding one. Many authors have considered efﬁcient parallel block solvers

for scalar block (m = 1) matrices based on cyclic reduction [3]. Cyclic reduction

was ﬁrst described by Heller [4] for block-tridiagonal systems, although an efﬁcient

parallel code was not described. The new code BCYCLIC described here ﬁlls a soft-

ware gap in the available codes for solving tridiagonal systems with large (m = 1),

dense blocks [5]. Reference [6] analyzed the projection of four known parallel tridi-

agonal system solvers: cyclic reduction [7,8], recursive doubling [9], Bondeli’s divide

and conquer algorithm [10], and Wang’s partition method [11]. Cyclic reduction and

recursive doubling focus on a line grain of parallelism, where each processor computes

only one equation of the system. Reference [11] developed a partition method with a

coarser grain of parallelism. ScaLAPACK [12] provided another method for efﬁciently

solving dense block-tridiagonal systems. Block factorization and solution based on

ScaLAPACK are currently implemented in the SIESTA [13] MHD equilibrium code.

123

Author's personal copy

剩余23页未读，继续阅读

weixin_38665629

粉丝: 4
资源: 958

CPU-GPU协同：并行处理块三对角方程的高效方法

Parallel Computing on Heterogeneous Networks

Packt.Hands-On.GPU.Computing.with.Python.1789341078.epub

Parallel Smith-Waterman Algorithm for Pairwise Sequence Alignment on CPU-GPU heterogeneous platform

GTC 2017 - Parallel Depth First on GPU - Slides (s7469-maxim-naumov-parallel-depth-first-on-gpu)-计算机科学

Efficient CPU-GPU cooperative computing for solving the subset-sum problem

Parallel-Based-on-Cloud-Computing-to-Achieve-Larg_cloud_cloud co

A novel parallel-series hybrid meta-heuristic method for solving a hybrid unit commitment problem, Knowledge-Based Systems, 2017, 134:13-30

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU - 2011 (pact11-hong)-计算机科学

Parallel recovery method in shared-nothing spatial database cluster system (2004年)

Real-Time Parallel Hashing on the GPU

最新资源