多核向量处理器上大规模1D FFT矢量化优化

研究论文

148 浏览量更新于2024-08-26 收藏 945KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文探讨了在多核矢量加速器上实现高效的大规模一维快速傅立叶变换（1D FFT）矢量化技术。随着高性能计算需求的增长，Matrix2 Accelerator作为一种高密度计算平台，其多核心且支持融合乘加指令的特性对于处理大量数据的1D FFT至关重要。本文的贡献在于针对该特定硬件架构特性，提出了一种优化的算法设计。首先，作者介绍了矩阵2加速器作为多核向量处理器的背景，这种处理器特别适用于需要大量并行计算任务的场景，如信号处理、数据分析等。1D FFT在这些领域中扮演着关键角色，特别是在频域分析和滤波操作中。为了实现高效矢量化，论文详细阐述了方法论。它可能包括了以下步骤： 1. **架构理解**：深入理解Matrix2 Accelerator的指令集、流水线设计以及内存访问模型，以便充分利用其并行性和硬件加速能力。 2. **算法改进**：针对1D FFT的数学性质，可能采用分治策略或基于位运算的技巧来分解计算任务，减少数据移动和提高计算密集度。 3. **负载均衡**：通过动态调度和任务分割，确保多个核心能有效地协同工作，避免资源浪费。 4. **向量化技术**：利用硬件提供的融合乘加指令，对复数运算进行优化，减少指令条数，提升执行效率。 5. **性能评估**：通过实验验证矢量化方法在实际应用中的性能提升，可能包括基准测试与现有算法的比较。研究者还提到了资金支持，来自中国国家自然科学基金（No.61572025）和湖南省科技创新计划项目（No.2018XK2102），这表明这项工作得到了国家级科研项目的认可，进一步证明了其学术价值和实用意义。文章的作者团队由来自国防科技大学计算机学院的多名研究人员组成，他们共同合作，分享了他们的研究成果和联系方式，这表明了跨学科的合作以及对研究领域的深入探索。总结来说，这篇研究论文提供了一个关键的解决方案，如何在多核矢量加速器上实现大规模1D FFT的高效矢量化，这对高性能计算和数据科学应用具有实际应用价值。通过深入理解硬件特性和优化算法设计，该工作有望在提升计算性能和解决大规模数据处理问题上发挥重要作用。

资源详情

资源推荐

*Corresponding Author.

Supported by the National Natural Science Foundation of China (No. 61572025),

and the Hunan Science and Technology Innovation Plan Project (No. 2018XK2102)

Efficient Large-scale 1D FFT Vectorization on

Multi-core Vector Accelerator

Zhong Liu

College of Computer

National University of

Defense Technology

Changsha 410073,

China

zhongliu@nudt.edu.cn

Xi Tian

College of Computer

National University of

Defense Technology

Changsha 410073,

China

simon86tx@163.com

Xiaowen Chen*

College of Computer

National University of

Defense Technology

Changsha 410073,

China

xwchen@nudt.edu.cn

Yuanwu Lei

College of Computer

National University of

Defense Technology

Changsha 410073,

China

yuanwulei@nudt.edu.cn

Man Liao

College of Computer

National University of

Defense Technology

Changsha 410073,

China

13607436045@139.com

Abstract—The Matrix2 Accelerator is a high-performance

multi-core vector processor for high-density computing that

supports fused multiply-add instructions. We propose an

efficient large-scale 1D FFT vectorization method according to

the architecture characteristics of Matrix2. (1) An FFT

vectorization method based on fused multiply-add instruction is

proposed to accelerate FFT computation. By transforming the

operation flow of FFT butterfly computation, the independent

multiplication and addition operations in the traditional FFT

computation method are combined into a smaller number of

fused multiply-add operations. It reduces the number of the real

floating-point operations in radix-2 FFT butterfly computation

from original 10 multiplication/addition operations to 6 fused

multiply-add operations, and reduces the number of the real

floating-point operations in radix-4 FFT butterfly computation

from original 34 multiplication/addition operations to 24 fused

multiply-add operations. (2) An FFT vectorization method

based on matrix Fourier algorithm is designed, which converts

1D FFT computation into 2D FFT computation. It contains

three steps: column FFT computation, multiplication of the

column FFT computation result and a factor matrix, row FFT

computation. These three steps are all vectorized. (3) A factor

matrix data layout and updating method is proposed, which can

greatly reduce the memory capacity for factor matrix. It can

avoid multiple data transmissions between the vector array

memory and the global cache by combining the column FFT

computation with the factor matrix multiplication, thus

significantly improving the computational efficiency of FFT. (4)

A double buffering DMA mechanism is adopted to optimize and

smooth the data transmission between the multi-level storage

structures, and the data transmission time is overlapped with

the computation time so as to reduce the total computation time.

The experimental results on Matrix2 show that the proposed

vectorization method improves the computational efficiency of

large-scale 1D FFT by an average of 5.56 times.

Keywords—Large-scale Fast Fourier Transform; Fused

Multiply-Add; Vectorization; Vector Processor.

I. INTRODUCTION

Discrete Fourier Transform (DFT) is one of the most

important algorithms in scientific computing, especially in the

field of signal processing systems such as radar signal

processing, underwater acoustic signal processing, spectrum

analysis, video image algorithms, speech recognition, etc.

These signal processing applications require high real-time

computation. The higher the DFT computational efficiency is,

the better the real-time performance of signal processing is.

How to improve the DFT computational efficiency has

been a research hotspot. Fast Fourier Transform (FFT)

algorithm proposed by Cooley and Turkey [1] significantly

reduces the computational complexity of DFT algorithm, it

reduces the computational complexity of N-points DFT from

O(N

) to O(Nlog

N). Linzer[2] ǃ Goedecker[3] ǃ and

Karner[4] respectively studied how to use Fused Multiply-

Add (FMA) instructions to reduce the number of

multiplication and addition operations, in order to optimize

FFT computation performance on processors providing FMA

instructions. Liu[5] proposes a vectorization method using

FMA instructions to accelerate FFT calculation on the vector

processor with FMA structure.

These research methods mainly reduce the number of

multiplication and addition operations in DFT computation,

and can greatly improve the computation performance when

the data are fed timely. However, more and more applications

require large-scale, even very large-scale FFT computation.

Usually, on-chip memory capacity does not accommodate all

the computational data of a large-scale FFT, and it is necessary

to carry out multiple data transfers between multiple levels of

storages in the processor to complete the FFT computation.

Obviously, the time of transferring data from external storage

can exceed the computation time due to the well-known

"memory wall" problem. Moreover, the computational

performance of large-scale FFTs is not only related to the peak

performance of the processor, but more importantly depends

on the data layout and migration methods. Therefore, we need

to improve the computational performance of large-scale

FFTs by combining the architecture characteristics and

hardware features of the processor.

For example, Sharon [6] proposed a large-scale FFT

optimization method for IBM's Cell processor multi-core

structure, and Popovici[7] implemented a double-buffering

mechanism to overlap the data movement from memory and

the computation on multicore CPUs for Intel and AMD

systems. He Tao [8] proposed a large-scale FFT optimization

method for GPU architecture. FFTW[9] is a widely used FFT

math library on general-purpose CPU platforms. It has good

portability and can search for the optimal FFT implementation

according to the processor architecture characteristics.

Voronenko[11] proposed a general method for converting

linear transforms such as discrete Fourier transform and

discrete cosine transform into FMA algorithm based on

directed acyclic graph. Loberiras[12] proposes an

optimization technique for small-point FFTs on GPU, which

is limited by the size of texture memory, and the performance

of large-point FFT is not ideal. Li[13] proposed an FFT

adaptive optimization framework MPFFT for GPU.

Matrix2 is a high-performance multi-core vector processor

developed by us for high-density computing[10]. As shown in

Fig.1, Matrix2 is designed as a Very Long Instruction Word

(VLIW) architecture, including a Scalar Processing Unit (SPU)

and a Vector Processing Unit (VPU). The SPU is responsible

for scalar computing and flow control task, and the VPU is

484

2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th

International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems

DOI 10.1109/HPCC/SmartCity/DSS.2019.00078

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38680475

粉丝: 6
资源: 933

多核向量处理器上大规模1D FFT矢量化优化

最新资源