A Framework for Generating High Throughput CNN
Implementations on FPGAs
Hanqing Zeng
University of Southern California
Ming Hsieh Department of Electrical Engineering
zengh@usc.edu
Ren Chen
University of Southern California
Ming Hsieh Department of Electrical Engineering
renchen@usc.edu
Chi Zhang
University of Southern California
Department of Computer Science
zhan527@usc.edu
Viktor Prasanna
University of Southern California
Ming Hsieh Department of Electrical Engineering
prasanna@usc.edu
ABSTRACT
We propose a framework to generate highly ecient accelerators
for inferencing on FPGAs. Our framework consists of multiple
algorithmic optimizations for computation complexity and commu-
nication volume reduction, a mapping methodology for ecient
resource utilization, and a tool for automatic
Verilog
generation.
The algorithmic optimizations improve throughput of frequency
domain convolution so as to satisfy a given set of hardware con-
straints. While the Overlap-and-Add (OaA) technique has been
known, it performs "wasted" computation at the edges. We propose
a novel Concatenate-and-Pad (CaP) technique, which improves
OaA signicantly by reducing the "wasted" computation on the
padded pixels. The proposed CaP used in conjunction with OaA
enables us to choose a xed FFT size at design time, and achieve low
computation complexity for layers with various image sizes and
kernel window sizes. We also develop a novel frequency domain
loop tiling technique to further boost the throughput by improving
data reuse. Our mapping methodology optimizes the architecture
for the target device by fast design space exploration. We quantita-
tively categorize FPGAs by capturing their DSP resources, on-chip
memory size and external memory bandwidth into a device coe-
cient. We identify the optimal architectural parameters based on
the tradeo between computation and communication cost. Our
framework includes a tool to automatically generate fully synthe-
sizable
Verilog
. We demonstrate the framework by generating
high throughput accelerators for state-of-the-art CNN models on
Intel HARP heterogeneous platform. Using our framework, we
achieve throughput of 780
.
6
GOPS
, 669
.
1
GOPS
and 552
.
1
GOPS
for AlexNet, VGG16 and FCN-16s respectively. These correspond
to 6
.
8
×
(AlexNet) and 4
.
9
×
(VGG16) improvement compared with
the state-of-the-art implementations.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
FPGA’2018, February 25–27, 2018, Monterey, CA, USA
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5614-5/18/02... $15.00
https://doi.org/10.1145/3174243.3174265
KEYWORDS
Convolutional Neural Networks; Algorithmic Optimization; Hard-
ware Mapping; Software-Hardware Co-design; FPGA;
ACM Reference Format:
Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. 2018. A Frame-
work for Generating High Throughput CNN Implementations on FPGAs. In
FPGA’18: 2018 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, February 25–27, 2018, Monterey, CA, USA. ACM, New York, NY,
USA, 10 pages. https://doi.org/10.1145/3174243.3174265
1 INTRODUCTION
Convolutional Neural Networks (CNNs) are one of the most inuen-
tial innovations in machine learning and computer vision [
9
,
15
,
16
].
With proliferation of deep learning models, the complexity and di-
versity of state-of-the-art CNNs has increased signicantly.
Several challenges exist in accelerating CNNs on FPGAs:
•
Computation complexity: Convolution layers of CNNs perform
computationally expensive operations.
•
Hardware eciency: Eciently accelerating various convolu-
tion layers is hard, due to the large variation of CNN model
parameters across layers. The problems to be addressed are:
–
Reconguration: Hardware runtime reconguration can po-
tentially meet the diverse computational requirements of
various layers. However, time and resource overhead are
incurred to support the exibility in hardware.
–
Wasted computation: Using xed hardware for acceleration
avoids reconguration overhead. However, signicant amount
of computation can be wasted due to padding.
–
Data reuse: Given an on-chip memory of limited size, the
accelerator needs to eciently reuse on-chip data so as to
reduce the communication volume to external memory.
Motivated by the above challenges, we propose a framework to
generate high throughput accelerators for diverse CNN models. The
inputs of the framework are the CNN model parameters (image size,
kernel lter window size, number of input and output feature maps)
and the FPGA device meta data (DSP resources, on-chip memory
size and external bandwidth). The output is the automatically gen-
erated architecture on the target device specied in
Verilog
. To
address the computation complexity challenge, our framework alle-
viates the computation burden of spatial convolution by frequency
domain convolution. To address the hardware utilization challenge,
FPGA’18, February 25–27, Monterey, CA, USA