8位定点LeNet CNN硬件推理引擎Laius：降低FPGA资源消耗

研究论文

156 浏览量更新于2024-08-26 收藏 1.44MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文介绍了一项名为"Laius"的研究，这是一种针对8位定点卷积神经网络（CNN）硬件推理引擎的设计与实现。CNN作为语音识别、计算机视觉和生物信息处理等领域中最有效的神经网络模型之一，其计算需求高且对内存和计算资源的要求较大，这在设计专用硬件加速器时构成了一大挑战。现有的许多行业和学术界设计的硬件加速器，如基于32位浮点运算的矩阵乘法，虽然精度较高，但数据的过度精度和高昂的硬件成本限制了它们在实际应用中的效率。 Laius的目标是解决这些问题，通过采用8位定点数据类型来降低硬件资源消耗，尤其是针对资源有限的现场可编程门阵列（FPGA）。为了优化FPGA资源利用，作者提出了一种创新的方法论，该方法论考虑了定点数据表示的局限性和计算效率之间的平衡。通过8位精度，Laius旨在实现更高效能的同时控制硬件成本，从而适用于资源受限的嵌入式和边缘计算环境。在设计过程中，Laius可能包括了以下几个关键步骤： 1. **量化策略**：研究了如何将32位浮点运算中的数值转换为8位定点表示，同时保持足够的精度以满足CNN的性能需求。这可能涉及量化算法的优化，以最小化精度损失。 2. **硬件架构**：设计了一种专门为8位定点CNN定制的硬件结构，包括高效的算术逻辑单元（ALUs）、存储器管理和数据流处理，以提高执行速度和吞吐量。 3. **资源优化**：通过精心的算法和硬件布线，减少了FPGA中乘法和累加器等关键模块的数量，同时保持了良好的并行性和流水线效率。 4. **性能评估**：在实验环境下，通过对比Laius与32位浮点硬件的性能，验证了8位定点设计在实际任务中的效率提升和资源节省。 5. **可扩展性**：考虑到未来可能的模型复杂度增长，Laius可能还设计了灵活的架构，可以方便地升级到更高精度或支持更多的CNN层。 6. **软件-hardware协同**：论文可能还讨论了软件层面的优化，如量化参数的选择、编译器优化等，以进一步提高整体性能。 Laius的研究成果对于推动低功耗、低成本的AI硬件发展具有重要意义，特别是在物联网和嵌入式设备上，它展示了如何通过合理的数据精度选择和硬件优化来达到高性能和经济性的平衡。

资源详情

资源推荐

Laius: An 8-bit Fixed-point CNN Hardware

Inference Engine

Zhisheng Li*

,Lei Wang

,Shasha Guo

,Yu Deng

,Qiang Dou

,Haifang Zhou

, Wenyuan Lu

School of Computer Science, National University of Defense Technology, Changsha, China

Xi’an Satellite Monitoring and Control Center, Xi’an, China

*lizhsh_123@163.com

Abstract—Convolutional Neural Network (CNN) is one of the

most effective neural network model for many classiﬁcation

tasks, such as voice recognition, computer vision and biological

information processing. Unfortunately, Computation of CNN is

both memory-intensive and computation-intensive, which brings

a huge challenge to the design of the hardware accelerators. A

large number of hardware accelerators for CNN inference are

are based on 32-bit ﬂoating point matrix multiplication, where

the data precision is over-provisioned for inference job and the

hardware cost are too high.

In this paper, a 8-bit ﬁxed-point LeNet inference engine (Laius)

is designed and implemented on FPGA. In order to reduce the

consumption of FPGA resource, we proposed a methodology to

ﬁnd the optimal bit-length for weight and bias in LeNet, which

results in using 8-bit ﬁxed point for most of the computation

and using 16-bit ﬁxed point for other computation. The PE

(Processing Element) design is proposed. Pipelining and PE tiling

technique is use to improve the performance of the inference

engine. By theoretical analysis, we came to the conclusion that

DSP resource in FPGA is the most critical resource, it should

be carefully used during the design process. We implement

the inference engine on Xilinx 485t FPGA. Experiment result

shows that the designed LeNet inference engine can achieve

44.9 Gops throughput with 8-bit ﬁxed-point operation after

pipelining. Moreover, with only 1% loss of accuracy, the 8-bit

ﬁxed-point engine largely reduce 31.43% in latency, 87.01% in

LUT consumption, 66.50% in BRAM consumption, 65.11% in

DSP consumption and 47.95% reduction in power compared to

a 32-bit ﬁxed-point inference engine with the same structure.

Index Terms—CNN accelerator, FPGA, LeNet, Inference, Im-

plementation

I. INTRODUCTION

Convolutional Neural Network (CNN) is one of the most

important model to ﬁt complex and non-linear data, with

wide applications ranging from voice recognition, computer

vision and bio information processing. Compared to many

traditional algorithms, CNNs can always achieve superior

performance and accuracy. LeNet, a representative kind of

CNNs, is proposed for recognizing hand-written digits. Due

to its attractive feature extraction abilities, LeNet has been

widely used in many real-world applications.

Despite these advantages, LeNet, like nearly all kinds of

CNNs, is computation-intensive and memory-intensive for

computation. As the complexity of the real-world applications

increasing, the limitations of CNNs become more serious. A

enormous amount of computations have to been ﬁnished with a

limited memory, which brings a huge challenge to the existing

CNN hardware accelerator architecture.

GPU, ASIC and FPGA are three main hardware acceleration

platforms for CNN acceleration. The advantage of FPGA

is the short development cycle and high ﬂexibility. Much

effort has been taken to make full use of FPGA. Work

[9] presents a rooﬂine model that implements quantitative

modeling of bandwidth, resources and throughput. In work

[1] [2], the author uses the rooﬂine model as a guide to

achieve a better trade-off among bandwidth, throughput and

resource utilization. In work [3] [8] [16], the author proposes

different computing units to improve computational efﬁciency

and improve throughput. In work [12] [13], a hierarchical

approach is proposed and implemented, making the allo-

cation of resources more granular and improving resource

utilization. In addition, several researches discuss that low-

precision training and inference can bring all aspects of

performance improvement with tolerant accuracy loss. In work

[4] [6], the authors propose a low-precision training and a

data compression method to reduce bandwidth pressure and

power consumption respectively. Unfortunately, the existing

work above is designed only for the convolutional layer in the

CNNs, not for the entire network mapping.

In this paper, a LeNet hardware inference engine with low-

precision operation (8-bit ﬁxed-point) Laius is designed and

implemented to explore the tradeoff between accuracy and

resources consumption. The input of the Laius is the pixels of

a given picture. The data and weights are calculated at each

layer in order and written to the BRAM. The layers of the

engine are arranged in the order of the entire layers of LeNet.

In order to improve performance, PE tiling and the weight split

are leveraged to exploit the computational parallelization. By

utilizing the derivation of mathematical models and ping-pong

optimization, we implement a four stage pipeline to improve

the throughput.

The main contributions of this paper are,

1) It designs and implements the inference engine for LeNet

(Laius) with 8-bit ﬁxed-point precision. With only 1% loss

of accuracy, this 8-bit ﬁxed-point engine largely reduce

hardware resources consumption compared to 32-bit ﬁxed-

point engine of the same structure.

2) With the help of our in-house ﬁxed-point CNN training

framework and the one-to-one inference engine simulator,

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38650508

粉丝: 6
资源: 938

8位定点LeNet CNN硬件推理引擎Laius：降低FPGA资源消耗

逆变器PQ控制模型、逆变器并网模型（Simulink） 直流侧电压650V～2000V均可 交流测电压为380V 有功功率和无功

SpringBoot+Vue工厂生产设备维护管理系统答辩PPT.ppt

CPA 税法 马兆瑞 基础班 第10章-本章概述-PPT-8页.pdf

CCS软件之工程模板的创建

nmpc非线性模型预测控制从原理到代码实践 含4个案例 自动泊车轨迹优化； 倒立摆上翻控制； 车辆运动学轨迹跟踪； 四旋翼无人机

SpringBoot+Vue多媒体素材管理系统答辩PPT.pptx

cmake-3.30.1-windows-arm64.msi win11 on arm 通过cmake编译&运行C++代码

基于BOSS直聘数据分析师职位信息的爬虫实现、数据分析、数据可视化和机器学习预测的综合性项目python源码.zip

图像处理之直方图均衡化与CLAHE技术详解及Python代码实现（包含详细的完整的程序和数据）

CPA 《税法》 马兆瑞 教材精讲 第十一章印花税法和企业所得税法 第一段.pdf

Python实现循环链表数据结构（包含详细的完整的程序和数据）

Django+Vue公务员考试信息管理系统答辩PPT.ppt

python 程序 截图工具

基于yolov9实现的打电话、玩手机检测系统python源码+详细运行教程+训练好的模型+评估指标曲线.zip

SpringBoot+Vue班级综合测评管理系统答辩PPT.pptx

基于Kotlin的Android开发实用工具类集合设计源码

Django+Vue学生选课系统答辩PPT.pptx

Hadoop+Django+Hive+Vue气象分析大屏可视化系统答辩PPT.pptx

Django+Vue游戏辅助和内容更新系统答辩PPT.ppt

光伏并网逆变器资料,包含原理图，pcb，源码以及元器件明细表 如下： 1) 功率接口板原理图和pcb，元器件明细表 2)

最新资源

逆变器PQ控制模型、逆变器并网模型（Simulink）直流侧电压650V～2000V均可交流测电压为380V 有功功率和无功

CPA 税法马兆瑞基础班第10章-本章概述-PPT-8页.pdf

nmpc非线性模型预测控制从原理到代码实践含4个案例自动泊车轨迹优化；倒立摆上翻控制；车辆运动学轨迹跟踪；四旋翼无人机

CPA 《税法》马兆瑞教材精讲第十一章印花税法和企业所得税法第一段.pdf

python 程序截图工具

光伏并网逆变器资料,包含原理图，pcb，源码以及元器件明细表如下： 1) 功率接口板原理图和pcb，元器件明细表 2)