FINN: A Framework for Fast, Scalable Binarized Neural
Network Inference
Yaman Umuroglu
*†
, Nicholas J. Fraser
*‡
, Giulio Gambardella
*
, Michaela Blott
*
,
Philip Leong
‡
, Magnus Jahre
†
and Kees Vissers
*
*
Xilinx Research Labs;
†
Norwegian University of Science and Technology;
‡
University of Sydney
yamanu@idi.ntnu.no
ABSTRACT
Research has shown that convolutional neural networks con-
tain significant redundancy, and high classification accuracy
can be obtained even when weights and activations are re-
duced from floating point to binary values. In this paper,
we present Finn, a framework for building fast and flexible
FPGA accelerators using a flexible heterogeneous stream-
ing architecture. By utilizing a novel set of optimizations
that enable efficient mapping of binarized neural networks
to hardware, we implement fully connected, convolutional
and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a
ZC706 embedded FPGA platform drawing less than 25 W
total system power, we demonstrate up to 12.3 million image
classifications per second with 0.31
µ
s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications
per second with 283
µ
s latency on the CIFAR-10 and SVHN
datasets with respectively 80.1% and 94.9% accuracy. To
the best of our knowledge, ours are the fastest classification
rates reported to date on these benchmarks.
1. INTRODUCTION
Convolutional Neural Networks (CNNs) have dramatically
improved in recent years, their performance now exceeding
that of other visual recognition algorithms [14], and even sur-
passing human accuracy on certain problems [23, 28]. They
are likely to play an important role in enabling ubiquitous
machine vision and intelligence on all kinds of devices, but a
significant computational challenge remains. Modern CNNs
may contain millions of floating-point parameters and require
billions of floating-point operations to recognize a single im-
age. Furthermore, these requirements tend to increase as re-
searchers explore deeper networks. For instance, AlexNet [14]
(the winning entry for ImageNet Large Scale Visual Recogni-
tion Competition (ILSVRC) [22] in 2012) required 244 MB of
parameters and 1.4 billon floating point operations (GFLOP)
per image, while VGG-16 [24] from ILSVRC 2014 required
552MB of parameters and 30.8 GFLOP per image.
To appear in the 25th International Symposium on Field-
Programmable Gate Arrays, February 2017.
While the vast majority of CNNs implementations use
floating point parameters, a growing body of research demon-
strates this approach incorporates significant redundancy.
Recently, it has been shown [5, 26, 21, 12, 31] that neu-
ral networks can classify accurately using one- or two-bit
quantization for weights and activations. Such a combina-
tion of low-precision arithmetic and small memory footprint
presents a unique opportunity for fast and energy-efficient
image classification using Field Programmable Grid Arrays
(FPGAs). FPGAs have much higher theoretical peak per-
formance for binary operations compared to floating point,
while the small memory footprint removes the off-chip mem-
ory bottleneck by keeping parameters on-chip, even for large
networks. Binarized Neural Networks (BNNs), proposed by
Courbariaux et al. [5], are particularly appealing since they
can be implemented almost entirely with binary operations,
with the potential to attain performance in the teraoperations
per second (TOPS) range on FPGAs.
In this work, we propose Finn, a framework for build-
ing scalable and fast BNN inference accelerators on FPGAs.
Finn-generated accelerators can perform millions of classi-
fications per second with sub-microsecond latency, thereby
making them ideal for supporting real-time embedded appli-
cations such as augmented reality, autonomous driving and
robotics. Compute resources can be scaled to meet a given
classification rate requirement. We demonstrate Finn’s capa-
bilities with a series of prototypes for classifying the MNIST,
SVHN and CIFAR-10 benchmark datasets. Our classification
rate results surpass the best previously published results by
over
48×
for MNIST,
2.2×
for CIFAR-10 and
8×
for SVHN.
To the best of our knowledge, this is the fastest reported
neural network inference implementation on these datasets.
The novel contributions are:
• Quantification of peak performance for BNNs on
FPGAs using a roofline model.
•
A set of novel optimizations for mapping BNNs onto
FPGA more efficiently.
•
A BNN architecture and accelerator construction tool,
permitting customization of throughput.
•
A range of prototypes that demonstrate the potential
of BNNs on an off-the-shelf FPGAs platform.
The rest of this paper is organized as follows: Section 2
provides background on CNNs, BNNs, and their hardware
implementations. Section 3 discusses BNNs accuracy and
peak performance on FPGAs. Section 4 describes Finn’s
arXiv:1612.07119v1 [cs.CV] 1 Dec 2016