*Jun Wu is corresponding author (email:wujun@tongji.edu.cn).
A High-performance Systolic Array Accelerator Dedicated for CNN
Jing Shen, Haoqi Ren, Zhifeng Zhang, Jun Wu*, Wenqi Pan and Zhenyu Jiang
College of Electronics and Information Engineering, Tongji University, 201804
Shanghai, China
e-mail: {1732983, renhaoqi, zhangzf, wujun, 1810866, 1832925}@tongji.edu.cn
Abstract—The rapid development of artificial intelligence has
made the convolutional neural network (CNN) more important.
The traditional computing architecture based on CPU can’t
meet the requirements of the practical applications. Therefore,
the development of a new hardware computing platform for
CNN becomes more urgent. This paper proposes a systolic
array accelerator dedicated to CNN. As CNN model requires a
lot of simple logic operations, we optimize the convolution
calculation module with the systolic multiply-accumulate
(MAC) array. We design three kinds of convolution calculation
mappings, which can deal with convolution with different sizes.
Our accelerator realizes efficient reuse of local storage area
data, which reduces data movement and improves computing
performance. To balance storage bandwidth and
computational speed, the convolutional layer is subdivided into
granular tasks and executed to mask the time of accessing
external storage. This accelerator also supports Winograd
convolution of 3x3 weight kernels. The heterogeneous system
consisting of the accelerator and the self-developed digital
signal processor SWIFT (SWIFT DSP) is verified on the FPGA
platform. The experimental results show that our accelerator
outperforms traditional accelerators under the same condition.
Keywords-CNN; DNN; parallel computer architecture;
accelerator
I. INTRODUCTION
With the widespread use of deep learning in many fields,
the underlying model-convolutional neural network (CNN)
has also received increasing attention. However, as the depth
of the network continues increasing, the scale of data
movement and calculation continues to expanding, and the
computational workload exponentially increases. However,
the traditional processors are not designed for calculating the
workload of neural networks, which are the basis of many
applications. Therefore, new architectures are required to
meet the growing computing demands.
Accelerators for specific CNNs (such as AlexNet,
ZynqNet) are not in common use and therefore not suitable
for mass production. Since the TPU is not for sale, only GPU
and FPGA are available to be the training platform for the
neural network, whose power consumption is large and the
price is high. Since GPU is not a specially designed AI chip,
only a small part of its function is suitable for deep machine
learning. Besides, GPU's future target will still not be AI
either. FPGAs have too much unnecessary flexibility to
support common functions, and are less efficient in power
efficiency and area efficiency for deep neural network (DNN)
calculations. The market demands and application prospects
of deep learning accelerators for DNN are very broad.
NVIDIA proposed a AI chip architecture--NVIDIA Deep
Learning Accelerator (NVDLA). The NVDLA is a free and
open architecture that promotes a standard way to design
deep learning inference accelerators. With its modular
architecture, NVDLA is scalable, highly configurable, and
designed to simplify integration and portability. It supports a
wide range of performance levels and readily scales for
applications ranging from smaller, cost-sensitive Internet of
Things (IoT) devices to larger performance oriented IoT
devices. The convolution module uses three parallel
convolution calculation mapping modes (DC, WG and IMG)
to optimize the analysis and design the convolution pipeline,
to realize the local storage area. Efficient reuse reduces the
number of data movements and increases the degree of
parallelism in multiple dimensions in the calculation.
Google proposed the second generation tensor processing
unit (TPU) in 2017 [1]. Compared to GPU, TPU use low-
precision (8-bit) calculations to reduce the number of
transistors used in each step. Reduced accuracy has little
effect on the accuracy of deep learning, but it can
significantly reduce power consumption and speed up
operations. At the same time, the TPU uses a systolic array
to optimize matrix multiplication and convolution operations
and reduce I/O operations. In addition, the TPU uses larger
on-chip memory to reduce access to DRAM for greater
performance. The key of the systolic array is that the data
flows in the array of processing element. It reduces the
memory accesses. Another advantage of the systolic array is
that, with the same number of multipliers, the bandwidth of
the accelerator is much smaller.
In 2017, SenseTime proposed an FPGA-based fast
Winograd algorithm [2]. The key of the Winograd algorithm
is to replace the multiplication with more additions.
In this paper, we propose a high-performance systolic
array accelerator (HSA) dedicated to convolutional networks.
In order to simultaneously solve the limited bandwidth in the
embedded applications and support Winograd optimization,
we combine systolic array (the TPU solution), broadcast
array (the NVDLA solution) and regroup multipliers into a
4x4 multiply-accumulate (MAC) cell array. Each MAC cell
contains 32 8-bit multipliers and outputs only one partial
sum. The accelerator supports 8-bit and 16-bit fixed-point