*Corresponding Author.
Supported by the National Natural Science Foundation of China (No. 61572025),
and the Hunan Science and Technology Innovation Plan Project (No. 2018XK2102)
Efficient Large-scale 1D FFT Vectorization on
Multi-core Vector Accelerator
Zhong Liu
College of Computer
National University of
Defense Technology
Changsha 410073,
China
zhongliu@nudt.edu.cn
Xi Tian
College of Computer
National University of
Defense Technology
Changsha 410073,
China
simon86tx@163.com
Xiaowen Chen*
College of Computer
National University of
Defense Technology
Changsha 410073,
China
xwchen@nudt.edu.cn
Yuanwu Lei
College of Computer
National University of
Defense Technology
Changsha 410073,
China
yuanwulei@nudt.edu.cn
Man Liao
College of Computer
National University of
Defense Technology
Changsha 410073,
China
13607436045@139.com
Abstract—The Matrix2 Accelerator is a high-performance
multi-core vector processor for high-density computing that
supports fused multiply-add instructions. We propose an
efficient large-scale 1D FFT vectorization method according to
the architecture characteristics of Matrix2. (1) An FFT
vectorization method based on fused multiply-add instruction is
proposed to accelerate FFT computation. By transforming the
operation flow of FFT butterfly computation, the independent
multiplication and addition operations in the traditional FFT
computation method are combined into a smaller number of
fused multiply-add operations. It reduces the number of the real
floating-point operations in radix-2 FFT butterfly computation
from original 10 multiplication/addition operations to 6 fused
multiply-add operations, and reduces the number of the real
floating-point operations in radix-4 FFT butterfly computation
from original 34 multiplication/addition operations to 24 fused
multiply-add operations. (2) An FFT vectorization method
based on matrix Fourier algorithm is designed, which converts
1D FFT computation into 2D FFT computation. It contains
three steps: column FFT computation, multiplication of the
column FFT computation result and a factor matrix, row FFT
computation. These three steps are all vectorized. (3) A factor
matrix data layout and updating method is proposed, which can
greatly reduce the memory capacity for factor matrix. It can
avoid multiple data transmissions between the vector array
memory and the global cache by combining the column FFT
computation with the factor matrix multiplication, thus
significantly improving the computational efficiency of FFT. (4)
A double buffering DMA mechanism is adopted to optimize and
smooth the data transmission between the multi-level storage
structures, and the data transmission time is overlapped with
the computation time so as to reduce the total computation time.
The experimental results on Matrix2 show that the proposed
vectorization method improves the computational efficiency of
large-scale 1D FFT by an average of 5.56 times.
Keywords—Large-scale Fast Fourier Transform; Fused
Multiply-Add; Vectorization; Vector Processor.
I. INTRODUCTION
Discrete Fourier Transform (DFT) is one of the most
important algorithms in scientific computing, especially in the
field of signal processing systems such as radar signal
processing, underwater acoustic signal processing, spectrum
analysis, video image algorithms, speech recognition, etc.
These signal processing applications require high real-time
computation. The higher the DFT computational efficiency is,
the better the real-time performance of signal processing is.
How to improve the DFT computational efficiency has
been a research hotspot. Fast Fourier Transform (FFT)
algorithm proposed by Cooley and Turkey [1] significantly
reduces the computational complexity of DFT algorithm, it
reduces the computational complexity of N-points DFT from
O(N
2
) to O(Nlog
2
N). Linzer[2] ǃ Goedecker[3] ǃ and
Karner[4] respectively studied how to use Fused Multiply-
Add (FMA) instructions to reduce the number of
multiplication and addition operations, in order to optimize
FFT computation performance on processors providing FMA
instructions. Liu[5] proposes a vectorization method using
FMA instructions to accelerate FFT calculation on the vector
processor with FMA structure.
These research methods mainly reduce the number of
multiplication and addition operations in DFT computation,
and can greatly improve the computation performance when
the data are fed timely. However, more and more applications
require large-scale, even very large-scale FFT computation.
Usually, on-chip memory capacity does not accommodate all
the computational data of a large-scale FFT, and it is necessary
to carry out multiple data transfers between multiple levels of
storages in the processor to complete the FFT computation.
Obviously, the time of transferring data from external storage
can exceed the computation time due to the well-known
"memory wall" problem. Moreover, the computational
performance of large-scale FFTs is not only related to the peak
performance of the processor, but more importantly depends
on the data layout and migration methods. Therefore, we need
to improve the computational performance of large-scale
FFTs by combining the architecture characteristics and
hardware features of the processor.
For example, Sharon [6] proposed a large-scale FFT
optimization method for IBM's Cell processor multi-core
structure, and Popovici[7] implemented a double-buffering
mechanism to overlap the data movement from memory and
the computation on multicore CPUs for Intel and AMD
systems. He Tao [8] proposed a large-scale FFT optimization
method for GPU architecture. FFTW[9] is a widely used FFT
math library on general-purpose CPU platforms. It has good
portability and can search for the optimal FFT implementation
according to the processor architecture characteristics.
Voronenko[11] proposed a general method for converting
linear transforms such as discrete Fourier transform and
discrete cosine transform into FMA algorithm based on
directed acyclic graph. Loberiras[12] proposes an
optimization technique for small-point FFTs on GPU, which
is limited by the size of texture memory, and the performance
of large-point FFT is not ideal. Li[13] proposed an FFT
adaptive optimization framework MPFFT for GPU.
Matrix2 is a high-performance multi-core vector processor
developed by us for high-density computing[10]. As shown in
Fig.1, Matrix2 is designed as a Very Long Instruction Word
(VLIW) architecture, including a Scalar Processing Unit (SPU)
and a Vector Processing Unit (VPU). The SPU is responsible
for scalar computing and flow control task, and the VPU is
484
2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th
International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems
978-1-7281-2058-4/19/$31.00 ©2019 IEEE
DOI 10.1109/HPCC/SmartCity/DSS.2019.00078