PipeCNN: An OpenCL-Based Open-Source FPGA
Accelerator for Convolution Neural Networks
Dong Wang, Ke Xu and Diankun Jiang
Institute of Information Science
Beijing Jiaotong University
Beijing 100044, China
Email: {wangdong, 17112071, 16125141}@bjtu.edu.cn
Abstract—Convolutional neural networks (CNNs) have been
employed in many applications, such as image classification, video
analysis and speech recognition. Being compute-intensive, CNNs
are widely accelerated by GPUs with high power dissipations.
Recently, studies were carried out exploiting FPGA as CNN ac-
celerator because of its reconfigurability and advantage on energy
efficiency over GPU, especially when OpenCL-based high-level
synthesis tools are now available providing fast verification and
implementation flows. In this paper, we demonstrate PipeCNN
– an efficient FPGA accelerator that can be implemented on a
variety of FPGA platforms with reconfigurable performance and
cost. The PipeCNN project is openly accessible, and thus can
be used either by researchers as a generic framework to explore
new hardware architectures or by teachers as a off-the-self design
example for any academic courses related to FPGAs.
I. INTRODUCTION
Convolutional neural network (CNN) [1], [2], as an emerg-
ing deep learning architecture, has received huge attentions
in various applications, such as video surveillance, image
searching, speech recognition, and robot vision. Currently,
GPUs are widely adopted as hardware accelerators for training
deep neuron networks. Yet they are generally energy inefficient
for embedded applications. FPGAs, which provide massive
processing elements, reconfigurable interconnections and low-
er power dissipation, are naturally suitable to implement neural
network circuits. Moreover, FPGAs are also flexible with
reduced data precision at circuit level, which will reduce the
memory footprint and bandwidth requirements, resulting in
better energy efficiency than GPUs.
Studies, such as [4], [5], have reported efficient CNN ac-
celerators on embedded FPGA platforms. However, traditional
register-transfer-level (RTL) design flows adopted in these
studies require deep background knowledge in digital circuit
design and great effort in writing complex RTL codes, prac-
ticing time-consuming simulations and compilations before
one can actually run accelerators on hardware. As the rapid
development in deep learning areas, the unfriendly features of
RTL-based design scheme hinder domain experts from utiliz-
ing FPGAs to explore new architectures for neural network
accelerators.
High-Level Synthesis (HLS) tools, which enable automatic
compilation from high-level programs (C/C++) to low-level
This work was supported by NNSF of China Grants NO.61574013,
61532005.
ŽŶǀ WŽŽůŝŶŐ
DĞŵZ
Ă
Ă
Ă
ŚĂŶŶĞůWŝƉĞƐEZĂŶŐĞ<ĞƌŶĞů
^ŝŶŐůĞͲƚŚƌĞĂĚĞĚ
<ĞƌŶĞů
ĞĞƉůLJWŝƉĞůŝŶĞĚKƉĞŶ><ĞƌŶĞůƐ
DĞŵtZ
>ZE
'ůŽďĂůDĞŵŽƌLJ
Fig. 1. The top-level architecture of PipeCNN.
RTL specifications, have became increasingly popular in both
academic and industrial fields. Compared with traditional
methodology, the HLS tools provide faster hardware develop-
ment cycle and software-friendly program interfaces that can
be easily integrated with user applications [3].
In this paper, we introduce PipeCNN, an efficient OpenCL-
based CNN accelerator on FPGAs. A set of configurable
OpenCL kernels are designed to accelerate a wide range of
neural network models. Throughput and memory bandwidth
optimization schemes are also presented and discussed. All
the design files are openly accessible and can be downloaded
from [6].
In the final demo, PipeCNN was implemented and evalu-
ated on three different FPGA platforms, including Cyclone-
V SEA5 SoC, Stratix-V GXA7 and Arria-10 AX115. CNN-
based image classification applications were accelerated by
PipeCNN. The processing speed and power consumption
were measured and demonstrated at runtime showing scalable
performance and cost that can meet different application
requirements and resource constrains.
II. A
RCHITECTURE DESIGN AND OPTIMIZATION
A. Accelerator Architecture
As shown in Fig. 1, PipeCNN consists of a group of
OpenCL kernels that are cascaded by using Altera’s OpenCL
extension Channels. Two data mover kernels, namely MemRD