高性能CNN脉动阵列加速器设计与优化

研究论文

42 浏览量更新于2024-08-28 收藏 388KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

“本文介绍了针对卷积神经网络（CNN）设计的一种高性能脉动阵列加速器，旨在解决传统CPU计算架构无法满足CNN实际应用需求的问题。文章由Jing Shen、Haoqi Ren、Zhifeng Zhang、Jun Wu*、Wenqi Pan和Zhenyu Jiang等人共同撰写，来自同济大学电子与信息工程学院。该研究提出了一种针对CNN优化的 systolic array 加速器，通过使用 systolic multiply-accumulate (MAC) 数组来优化卷积计算模块，并设计了三种不同的卷积计算映射方式，以处理不同大小的卷积。此外，加速器实现了局部存储区数据的有效重用，减少了数据移动和带宽需求，提高了性能。” 在卷积神经网络（CNN）的快速发展中，由于其在图像识别、语音识别、自然语言处理等领域的广泛应用，对计算性能的需求日益增长。传统的中央处理器（CPU）由于其并行处理能力有限，处理大规模的CNN模型时效率较低，这导致了对专门硬件加速器的需求。脉动阵列是一种高度并行的计算架构，特别适合执行矩阵和向量运算，因此被广泛应用于CNN的加速。本论文提出的 systolic array 加速器是一种专为CNN设计的高性能计算平台。Systolic array 由一系列紧密排列的处理单元（PE，Processing Element）组成，这些单元可以同时执行乘法和加法操作，形成一个高效的MAC阵列，以加速卷积运算。MAC阵列的核心在于其流水线性质，允许数据在阵列中连续流动，提高了计算效率。为了适应不同大小的卷积层，研究者设计了三种卷积计算映射策略。这些映射方法允许阵列灵活地处理不同尺寸的滤波器和输入特征图，从而实现资源的最佳利用，减少计算的冗余和浪费。在存储管理方面，该加速器实现了数据的局部存储区重用。这意味着数据可以在相邻的PE之间高效传递，减少了外部存储访问次数，降低了内存带宽需求，这对于提高整体系统性能至关重要，因为内存带宽往往是系统性能瓶颈。这篇研究论文详细阐述了一种针对CNN优化的高性能脉动阵列加速器设计，通过优化计算模块和存储管理，有效地提升了CNN的计算效率，降低了数据传输开销，为应对AI领域日益增长的计算需求提供了新的解决方案。这一创新设计对于未来CNN的硬件加速和边缘计算领域具有重要的参考价值。

资源详情

资源推荐

*Jun Wu is corresponding author (email:wujun@tongji.edu.cn).

A High-performance Systolic Array Accelerator Dedicated for CNN

Jing Shen, Haoqi Ren, Zhifeng Zhang, Jun Wu*, Wenqi Pan and Zhenyu Jiang

College of Electronics and Information Engineering, Tongji University, 201804

Shanghai, China

e-mail: {1732983, renhaoqi, zhangzf, wujun, 1810866, 1832925}@tongji.edu.cn

Abstract—The rapid development of artificial intelligence has

made the convolutional neural network (CNN) more important.

The traditional computing architecture based on CPU can’t

meet the requirements of the practical applications. Therefore,

the development of a new hardware computing platform for

CNN becomes more urgent. This paper proposes a systolic

array accelerator dedicated to CNN. As CNN model requires a

lot of simple logic operations, we optimize the convolution

calculation module with the systolic multiply-accumulate

(MAC) array. We design three kinds of convolution calculation

mappings, which can deal with convolution with different sizes.

Our accelerator realizes efficient reuse of local storage area

data, which reduces data movement and improves computing

performance. To balance storage bandwidth and

computational speed, the convolutional layer is subdivided into

granular tasks and executed to mask the time of accessing

external storage. This accelerator also supports Winograd

convolution of 3x3 weight kernels. The heterogeneous system

consisting of the accelerator and the self-developed digital

signal processor SWIFT (SWIFT DSP) is verified on the FPGA

platform. The experimental results show that our accelerator

outperforms traditional accelerators under the same condition.

Keywords-CNN; DNN; parallel computer architecture;

accelerator

I. INTRODUCTION

With the widespread use of deep learning in many fields,

the underlying model-convolutional neural network (CNN)

has also received increasing attention. However, as the depth

of the network continues increasing, the scale of data

movement and calculation continues to expanding, and the

computational workload exponentially increases. However,

the traditional processors are not designed for calculating the

workload of neural networks, which are the basis of many

applications. Therefore, new architectures are required to

meet the growing computing demands.

Accelerators for specific CNNs (such as AlexNet,

ZynqNet) are not in common use and therefore not suitable

for mass production. Since the TPU is not for sale, only GPU

and FPGA are available to be the training platform for the

neural network, whose power consumption is large and the

price is high. Since GPU is not a specially designed AI chip,

only a small part of its function is suitable for deep machine

learning. Besides, GPU's future target will still not be AI

either. FPGAs have too much unnecessary flexibility to

support common functions, and are less efficient in power

efficiency and area efficiency for deep neural network (DNN)

calculations. The market demands and application prospects

of deep learning accelerators for DNN are very broad.

NVIDIA proposed a AI chip architecture--NVIDIA Deep

Learning Accelerator (NVDLA). The NVDLA is a free and

open architecture that promotes a standard way to design

deep learning inference accelerators. With its modular

architecture, NVDLA is scalable, highly configurable, and

designed to simplify integration and portability. It supports a

wide range of performance levels and readily scales for

applications ranging from smaller, cost-sensitive Internet of

Things (IoT) devices to larger performance oriented IoT

devices. The convolution module uses three parallel

convolution calculation mapping modes (DC, WG and IMG)

to optimize the analysis and design the convolution pipeline,

to realize the local storage area. Efficient reuse reduces the

number of data movements and increases the degree of

parallelism in multiple dimensions in the calculation.

Google proposed the second generation tensor processing

unit (TPU) in 2017 [1]. Compared to GPU, TPU use low-

precision (8-bit) calculations to reduce the number of

transistors used in each step. Reduced accuracy has little

effect on the accuracy of deep learning, but it can

significantly reduce power consumption and speed up

operations. At the same time, the TPU uses a systolic array

to optimize matrix multiplication and convolution operations

and reduce I/O operations. In addition, the TPU uses larger

on-chip memory to reduce access to DRAM for greater

performance. The key of the systolic array is that the data

flows in the array of processing element. It reduces the

memory accesses. Another advantage of the systolic array is

that, with the same number of multipliers, the bandwidth of

the accelerator is much smaller.

In 2017, SenseTime proposed an FPGA-based fast

Winograd algorithm [2]. The key of the Winograd algorithm

is to replace the multiplication with more additions.

In this paper, we propose a high-performance systolic

array accelerator (HSA) dedicated to convolutional networks.

In order to simultaneously solve the limited bandwidth in the

embedded applications and support Winograd optimization,

we combine systolic array (the TPU solution), broadcast

array (the NVDLA solution) and regroup multipliers into a

4x4 multiply-accumulate (MAC) cell array. Each MAC cell

contains 32 8-bit multipliers and outputs only one partial

sum. The accelerator supports 8-bit and 16-bit fixed-point

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38657290

粉丝: 5
资源: 943

高性能CNN脉动阵列加速器设计与优化

基于线性脉动阵列的卷积神经网络计算优化与性能分析

MIT的CNN加速器设计.rar

EDA/PLD中的采用FPGA实现脉动阵列

卷积神经网络CNN加速器

FPGA CNN加速器例子

spinalhdl实现cnn加速器

K210kpu加速器常见毕业答辩问题及答案

cnn加速器用c语言怎么写

卷积神经网络硬件加速器

卷积神经网络加速器的背景

卷积神经网络加速器能做什么

卷积神经网络加速器进行图像卷积

基于verilog的卷积神经网络加速器

fpga实现硬件加速CNN

语义分割算法 fpga

基于Winograd的CNN加速和基于GEMM的CNN加速各自优势

怎么提高CNN网络的性能

pynq硬件加速cnn

CNN中编码器和解码器的含义

CNN分类器和SVM分类器的区别

最新资源