Laius: An 8-bit Fixed-point CNN Hardware
Inference Engine
Zhisheng Li*
1
,Lei Wang
1
,Shasha Guo
1
,Yu Deng
1
,Qiang Dou
1
,Haifang Zhou
1
, Wenyuan Lu
2
1
School of Computer Science, National University of Defense Technology, Changsha, China
2
Xi’an Satellite Monitoring and Control Center, Xi’an, China
*lizhsh_123@163.com
Abstract—Convolutional Neural Network (CNN) is one of the
most effective neural network model for many classification
tasks, such as voice recognition, computer vision and biological
information processing. Unfortunately, Computation of CNN is
both memory-intensive and computation-intensive, which brings
a huge challenge to the design of the hardware accelerators. A
large number of hardware accelerators for CNN inference are
designed by the industry and the academia. Most of the engines
are based on 32-bit floating point matrix multiplication, where
the data precision is over-provisioned for inference job and the
hardware cost are too high.
In this paper, a 8-bit fixed-point LeNet inference engine (Laius)
is designed and implemented on FPGA. In order to reduce the
consumption of FPGA resource, we proposed a methodology to
find the optimal bit-length for weight and bias in LeNet, which
results in using 8-bit fixed point for most of the computation
and using 16-bit fixed point for other computation. The PE
(Processing Element) design is proposed. Pipelining and PE tiling
technique is use to improve the performance of the inference
engine. By theoretical analysis, we came to the conclusion that
DSP resource in FPGA is the most critical resource, it should
be carefully used during the design process. We implement
the inference engine on Xilinx 485t FPGA. Experiment result
shows that the designed LeNet inference engine can achieve
44.9 Gops throughput with 8-bit fixed-point operation after
pipelining. Moreover, with only 1% loss of accuracy, the 8-bit
fixed-point engine largely reduce 31.43% in latency, 87.01% in
LUT consumption, 66.50% in BRAM consumption, 65.11% in
DSP consumption and 47.95% reduction in power compared to
a 32-bit fixed-point inference engine with the same structure.
Index Terms—CNN accelerator, FPGA, LeNet, Inference, Im-
plementation
I. INTRODUCTION
Convolutional Neural Network (CNN) is one of the most
important model to fit complex and non-linear data, with
wide applications ranging from voice recognition, computer
vision and bio information processing. Compared to many
traditional algorithms, CNNs can always achieve superior
performance and accuracy. LeNet, a representative kind of
CNNs, is proposed for recognizing hand-written digits. Due
to its attractive feature extraction abilities, LeNet has been
widely used in many real-world applications.
Despite these advantages, LeNet, like nearly all kinds of
CNNs, is computation-intensive and memory-intensive for
computation. As the complexity of the real-world applications
increasing, the limitations of CNNs become more serious. A
enormous amount of computations have to been finished with a
limited memory, which brings a huge challenge to the existing
CNN hardware accelerator architecture.
GPU, ASIC and FPGA are three main hardware acceleration
platforms for CNN acceleration. The advantage of FPGA
is the short development cycle and high flexibility. Much
effort has been taken to make full use of FPGA. Work
[9] presents a roofline model that implements quantitative
modeling of bandwidth, resources and throughput. In work
[1] [2], the author uses the roofline model as a guide to
achieve a better trade-off among bandwidth, throughput and
resource utilization. In work [3] [8] [16], the author proposes
different computing units to improve computational efficiency
and improve throughput. In work [12] [13], a hierarchical
approach is proposed and implemented, making the allo-
cation of resources more granular and improving resource
utilization. In addition, several researches discuss that low-
precision training and inference can bring all aspects of
performance improvement with tolerant accuracy loss. In work
[4] [6], the authors propose a low-precision training and a
data compression method to reduce bandwidth pressure and
power consumption respectively. Unfortunately, the existing
work above is designed only for the convolutional layer in the
CNNs, not for the entire network mapping.
In this paper, a LeNet hardware inference engine with low-
precision operation (8-bit fixed-point) Laius is designed and
implemented to explore the tradeoff between accuracy and
resources consumption. The input of the Laius is the pixels of
a given picture. The data and weights are calculated at each
layer in order and written to the BRAM. The layers of the
engine are arranged in the order of the entire layers of LeNet.
In order to improve performance, PE tiling and the weight split
are leveraged to exploit the computational parallelization. By
utilizing the derivation of mathematical models and ping-pong
optimization, we implement a four stage pipeline to improve
the throughput.
The main contributions of this paper are,
1) It designs and implements the inference engine for LeNet
(Laius) with 8-bit fixed-point precision. With only 1% loss
of accuracy, this 8-bit fixed-point engine largely reduce
hardware resources consumption compared to 32-bit fixed-
point engine of the same structure.
2) With the help of our in-house fixed-point CNN training
framework and the one-to-one inference engine simulator,