FPGA上的高效CNN实现框架

需积分: 9 114 浏览量更新于2024-09-07 收藏 2.22MB PDF 举报

"本文介绍了一种用于在FPGA上实现高吞吐量卷积神经网络（CNN）的框架，该框架结合了算法优化、映射策略和自动化Verilog代码生成工具，旨在提高硬件效率和吞吐量。" 深度学习，尤其是卷积神经网络（CNN），在图像识别、语音处理等领域取得了显著成果，但其计算需求巨大，对计算资源提出了挑战。FPGA（Field-Programmable Gate Array）作为一种可重构硬件，因其灵活性和高效能而成为加速深度学习计算的理想平台。本文作者提出了一套全面的框架，包括以下几个关键组成部分： 1. **算法优化**：针对计算复杂性和通信量减少进行优化。这涉及到对CNN中核心的卷积运算的改进。传统的频率域卷积（如Overlap-and-Add, OaA）方法在边缘处理时存在“无效”计算。为解决这个问题，他们提出了一个新的Concatenate-and-Pad（CaP）技术，该技术可以更有效地利用计算资源，减少边缘处的浪费，从而提高吞吐量。 2. **映射策略**：高效资源利用是FPGA加速的关键。此框架包含一个映射方法，目的是在FPGA的逻辑单元、BRAM（Block RAM）和DSP（Digital Signal Processing）块之间合理分配CNN层的计算任务，确保硬件资源得到充分利用，同时降低数据传输延迟。 3. **自动化Verilog生成**：为了简化设计流程并降低人工编程的复杂性，作者开发了一个工具，能够自动生成Verilog代码。这种自动化工具使得设计者可以快速地将优化后的算法映射到FPGA硬件，缩短了开发周期，提高了设计效率。通过这些策略，该框架能够为FPGA上的CNN推理提供高性能解决方案，满足严格的硬件约束条件。它不仅提升了模型的执行速度，还可能降低能耗，这对于实时应用和大数据处理至关重要。此外，由于框架的通用性，它可以适应不同规模和类型的CNN模型，具有广泛的适用性。总结来说，这篇论文提出的FPGA加速框架结合了算法优化、硬件映射和自动化代码生成，旨在实现深度学习模型，特别是CNN的高效硬件实现。这种方法对于提升边缘计算设备的性能、推动人工智能在物联网和其他嵌入式系统中的应用具有重要意义。

A Framework for Generating High Throughput CNN

Implementations on FPGAs

Hanqing Zeng

University of Southern California

Ming Hsieh Department of Electrical Engineering

zengh@usc.edu

Ren Chen

University of Southern California

Ming Hsieh Department of Electrical Engineering

renchen@usc.edu

Chi Zhang

University of Southern California

Department of Computer Science

zhan527@usc.edu

Viktor Prasanna

University of Southern California

Ming Hsieh Department of Electrical Engineering

prasanna@usc.edu

ABSTRACT

We propose a framework to generate highly ecient accelerators

for inferencing on FPGAs. Our framework consists of multiple

algorithmic optimizations for computation complexity and commu-

nication volume reduction, a mapping methodology for ecient

resource utilization, and a tool for automatic

Verilog

generation.

The algorithmic optimizations improve throughput of frequency

domain convolution so as to satisfy a given set of hardware con-

straints. While the Overlap-and-Add (OaA) technique has been

known, it performs "wasted" computation at the edges. We propose

a novel Concatenate-and-Pad (CaP) technique, which improves

OaA signicantly by reducing the "wasted" computation on the

padded pixels. The proposed CaP used in conjunction with OaA

enables us to choose a xed FFT size at design time, and achieve low

computation complexity for layers with various image sizes and

kernel window sizes. We also develop a novel frequency domain

loop tiling technique to further boost the throughput by improving

data reuse. Our mapping methodology optimizes the architecture

for the target device by fast design space exploration. We quantita-

tively categorize FPGAs by capturing their DSP resources, on-chip

memory size and external memory bandwidth into a device coe-

cient. We identify the optimal architectural parameters based on

the tradeo between computation and communication cost. Our

framework includes a tool to automatically generate fully synthe-

sizable

Verilog

. We demonstrate the framework by generating

high throughput accelerators for state-of-the-art CNN models on

Intel HARP heterogeneous platform. Using our framework, we

achieve throughput of 780

GOPS

, 669

GOPS

and 552

GOPS

for AlexNet, VGG16 and FCN-16s respectively. These correspond

to 6

(AlexNet) and 4

(VGG16) improvement compared with

the state-of-the-art implementations.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

FPGA’2018, February 25–27, 2018, Monterey, CA, USA

ACM ISBN 978-1-4503-5614-5/18/02... $15.00

https://doi.org/10.1145/3174243.3174265

KEYWORDS

Convolutional Neural Networks; Algorithmic Optimization; Hard-

ware Mapping; Software-Hardware Co-design; FPGA;

ACM Reference Format:

Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. 2018. A Frame-

work for Generating High Throughput CNN Implementations on FPGAs. In

FPGA’18: 2018 ACM/SIGDA International Symposium on Field-Programmable

Gate Arrays, February 25–27, 2018, Monterey, CA, USA. ACM, New York, NY,

USA, 10 pages. https://doi.org/10.1145/3174243.3174265

1 INTRODUCTION

Convolutional Neural Networks (CNNs) are one of the most inuen-

tial innovations in machine learning and computer vision [

With proliferation of deep learning models, the complexity and di-

versity of state-of-the-art CNNs has increased signicantly.

Several challenges exist in accelerating CNNs on FPGAs:

•

Computation complexity: Convolution layers of CNNs perform

computationally expensive operations.

•

Hardware eciency: Eciently accelerating various convolu-

tion layers is hard, due to the large variation of CNN model

parameters across layers. The problems to be addressed are:

–

Reconguration: Hardware runtime reconguration can po-

tentially meet the diverse computational requirements of

various layers. However, time and resource overhead are

incurred to support the exibility in hardware.

–

Wasted computation: Using xed hardware for acceleration

avoids reconguration overhead. However, signicant amount

of computation can be wasted due to padding.

–

Data reuse: Given an on-chip memory of limited size, the

accelerator needs to eciently reuse on-chip data so as to

reduce the communication volume to external memory.

Motivated by the above challenges, we propose a framework to

generate high throughput accelerators for diverse CNN models. The

inputs of the framework are the CNN model parameters (image size,

kernel lter window size, number of input and output feature maps)

and the FPGA device meta data (DSP resources, on-chip memory

size and external bandwidth). The output is the automatically gen-

erated architecture on the target device specied in

Verilog

. To

address the computation complexity challenge, our framework alle-

viates the computation burden of spatial convolution by frequency

domain convolution. To address the hardware utilization challenge,

Session 3: Deep Learning

FPGA’18, February 25–27, Monterey, CA, USA

117

下载后可阅读完整内容，剩余9页未读，立即下载

Edward！！！

粉丝: 7
资源: 13

FPGA上的高效CNN实现框架

yolo相关论文，中英文都含有

Mock.java:This is a framework for generating fake data, similar to Mock.js. 你可以通过这个框架简便的构建一个JavaBean假数据模板并生成假数据，就像Mock.js一样的语法~

Generating_vertex_normals.pdf.zip_normals_open

Generating Realistic ISP-Level Network Topologies.pdf

A Low Complexity Algorithm for Generating Turbo.pdf

Generating Long Videos of Dynamic Scenes.pdf

An End-to-End Network for Generating Social Relationship Graphs.pdf

A hybrid approach for generating investor views in Black–Litterman model.pdf

a hybrid approach for generating investor views in b-l model.pdf

关于泊松点过程的生成方法-Generating Homogeneous Poisson Processes - PDF.pdf

最新资源