1762 W. Yang et al.
processes. Because the approximate solutions obtained by the direct methods are closer
to the exact solutions, the convergence speed of solving the block-tridiagonal system
of linear equations can be improved. Some direct methods have good performance
in solving small-scale equations, and the sub-equations can be solved in parallel. We
present an improved algorithm to solve the sub-equations by thread blocks on GPU,
and the intermediate data are stored in shared memory, so as to significantly reduce
the latency of memory access. Therefore, the computational complexity of the hybrid
method is not increased and the convergence speed can be accelerated.
According to our experiments on ten test cases, the performance improvement using
our algorithm is very effective and noticeable. The average number of iterations is
reduced by 283.15 and 18.34 % using our method compared with CG and BiCGSTAB
of PARALUTION library, respectively, and the performance using our method is better
than those of the commonly used iterative and direct methods, and the performance
of solving the test cases on GPU is improved by 26.98, 11.52, and 9.25 % using our
method compared with CG, BiCGSTAB of PARALUTION library, and cuSolverSP
of CUDA.
The remainder of the paper is organized as follows. In Sect. 2, we review the related
research on solving block-tridiagonal system of linear equations. In Sect. 3, we present
an introduction to CUDA. In Sect. 4, we develop the method of solution for solving
block-tridiagonal matrices. In Sect. 4.6, we describe parallel implementation of our
method on GPU. In Sect. 5, we demonstrate the performance comparison results in
our extensive experiments. In Sect. 6, we conclude the paper.
2 Related work
Block-tridiagonal matrices have a central diagonal and two adjacent, which are located
at a distance m from the center, and m is the size of the block matrix. There has been
considerable work in developing solution algorithms for block-tridiagonal matrices.
For a block-tridiagonal matrix A, it is possible to obtain an exact inverse (direct solu-
tion) with no fill-in using the well-known Thomas [2] serial algorithm, which is easily
generalized for block sizes m = 1. While this is the fastest algorithm on a serial
computer, it is not parallelizable, since each solution step in the algorithm depends
on the preceding one. Many authors have considered efficient parallel block solvers
for scalar block (m = 1) matrices based on cyclic reduction [3]. Cyclic reduction
was first described by Heller [4] for block-tridiagonal systems, although an efficient
parallel code was not described. The new code BCYCLIC described here fills a soft-
ware gap in the available codes for solving tridiagonal systems with large (m = 1),
dense blocks [5]. Reference [6] analyzed the projection of four known parallel tridi-
agonal system solvers: cyclic reduction [7,8], recursive doubling [9], Bondeli’s divide
and conquer algorithm [10], and Wang’s partition method [11]. Cyclic reduction and
recursive doubling focus on a line grain of parallelism, where each processor computes
only one equation of the system. Reference [11] developed a partition method with a
coarser grain of parallelism. ScaLAPACK [12] provided another method for efficiently
solving dense block-tridiagonal systems. Block factorization and solution based on
ScaLAPACK are currently implemented in the SIESTA [13] MHD equilibrium code.
123