基于FPGA的卷积神经网络加速技术综述

需积分: 50 106 浏览量更新于2024-07-18 1 收藏 1.82MB PDF 举报

"FPGA 在深度学习加速中的应用和发展趋势" FPGA（Field-Programmable Gate Array，现场可编程门阵列）是最近几年来深度学习加速的热门话题。随着深度学习技术的不断发展，FPGA 作为一种可重构硬件，正在被越来越多地应用于深度学习加速领域中。下面我们将对 FPGA 在深度学习加速中的应用进行总结和分析，并对其发展趋势进行预测。 **FPGA 在深度学习加速中的优点** 相比于传统的 CPU 和 GPU，FPGA 具有以下几个优点： 1. 可重构性：FPGA 可以根据不同的应用需求进行重新配置，从而实现灵活的硬件架构。 2. 高性能：FPGA 可以实现高性能的计算，满足深度学习算法对计算资源的需求。 3. 低功耗：FPGA 的功耗相比于 GPU 和 CPU 要低很多，满足了深度学习算法对能源效率的需求。 **FPGA 在深度学习加速中的应用** FPGA 在深度学习加速中的应用主要体现在以下几个方面： 1. 卷积神经网络（Convolutional Neural Networks，CNNs）的加速：FPGA 可以用于加速 CNNs 的计算，提高深度学习算法的性能。 2. 深度学习框架的加速：FPGA 可以与深度学习框架集成，例如 TensorFlow、PyTorch 等，实现深度学习算法的加速。 3. Edge AI 的加速：FPGA 可以用于 Edge AI 应用中的深度学习加速，实现实时计算和低延迟。 **FPGA 在深度学习加速中的挑战** 虽然 FPGA 在深度学习加速中具有很多优点，但也存在一些挑战： 1. 编程难度：FPGA 的编程需要专业的知识和技能，增加了开发难度。 2. 硬件资源限制：FPGA 的硬件资源有限，限制了其在深度学习加速中的应用。 3. 能源效率：FPGA 的能源效率仍然需要进一步改善，以满足深度学习算法对能源效率的需求。 **FPGA 在深度学习加速中的发展趋势** FPGA 在深度学习加速中的发展趋势主要体现在以下几个方面： 1. FPGA 和深度学习框架的集成：FPGA 将与深度学习框架集成，实现深度学习算法的加速。 2. FPGA 的异构架构：FPGA 的异构架构将被应用于深度学习加速中，实现高性能和低功耗。 3. FPGA 在 Edge AI 中的应用：FPGA 将在 Edge AI 应用中的深度学习加速中发挥重要作用。 FPGA 在深度学习加速中的应用和发展趋势非常广阔。随着技术的不断发展，FPGA 将在深度学习加速中扮演着越来越重要的角色。

Toolflows for Mapping CNNs on FPGAs: A Survey and Future Directions 0:7

rate of neighbouring engines. In this manner, the overall architecture is tailored to the particular

network. With emphasis placed on BNNs, the computation engines dier from conventional CNN

hardware designs and are optimised for the ecient mapping of binarised layers, including dedicated

hardware for binarised convolutions, max pooling and batch normalisation [

]. Finn expresses

binarised convolutions as matrix-vector operations followed by thresholding. To this end, the

integral block of the architecture is the Matrix-Vector-Threshold Unit (MVTU) which is optimised

to perform the majority of the core binarised operations. In terms of scheduling, Finn’s approach lies

closer to fpgaConvNet’s synchronous dataow scheme and farther from DeepBurning’s dynamic

dataow, with static schedules generated at compile time. Finally, in contrast to fpgaConvNet and

DeepBurning and similarly to Haddoc2, all the binarised weights are required to be stored on-chip,

with the external memory transfers focusing only on the input and output of the network, imposing

a hard limit to the size of networks that can be addressed.

Single computation engines:

This design approach favours exibility over customisation. Such

an architecture comprises a single computation engine, typically in the form of a systolic array of

processing elements or a matrix multiplication unit, that executes the CNN layers sequentially. The

control of the hardware and the scheduling of operations is performed by software (Fig. 2). This

design paradigm consists of a xed architectural template which can be scaled based on the input

CNN and the available FPGA resources. With this scheme, each CNN corresponds to a dierent

sequence of microinstructions that are executable by the hardware. By taking this approach to

the extreme, the architecture can be congured and scaled based only on the resources of the

target FPGA without targeting a specic CNN and, as a result, after a single compilation, the same

bitstream can target many CNNs without the overhead of bitstream-level reconguration. Despite

the exibility gains, ineciencies are introduced due to control mechanisms that resemble those

of a processor [

]. Moreover, the one-size-ts-all approach can lead to high variability in the

achieved performance across CNNs with dierent workload characteristics.

1) Angel-Eye: The design principle behind the Angel-Eye framework is based on having a single

exible computation engine which can be programmed and controlled by software. The main

computational component is an array of Processing Elements (PEs) with each PE containing a bank

of convolvers, an adder tree and an optional pooling path. The input feature maps of a CONV layer

are shared across all PEs and each PE processes its inputs with a dierent set of kernels in order

to produce independent output feature maps. Within a PE, the inputs are parallelised across the

convolvers, followed by the adder tree that combines partial results to produce the output. Overall,

the organisation of Angel-Eye’s and AutoCodeGen’s hardware for CONV layers are following the

same strategy by organising convolvers into groups and tunably unrolling with respect to input

and output feature maps.

The framework’s compiler translates the input CNN to a sequence of instructions from Angel-

Eye’s custom instruction set and the computation engine executes the instructions. This process

corresponds to the sequential execution of the layers in a time-sharing manner. With dierent

CNNs mapped to dierent instruction sequences, the architecture can be reused to execute various

models without recompilation or reconguration. In this respect, the hardware design is congured

and scaled based only on the available resources of the target device and hence is CNN-independent.

2) ALAMO: In contrast to Angel-Eye, ALAMO customises the generated computation engine to

the input CNN. The architecture comprises hardware blocks for POOL, ReLU and NORM layers,

together with a 2D array of compute units which is shared between CONV and FC layers. In CONV

layers, the array exploits the parallelism within one input feature map and across multiple output

feature maps. At each time instant, each row of the array is responsible for one output feature map,

with its columns processing dierent windows of the same input feature map and combining their

ACM Computing Surveys, Vol. 0, No. 0, Article 0. Publication date: March 2018.

0:8 S. I. Venieris, A. Kouris and C. S. Bouganis

partial results synergistically. FC layers are mapped on the same hardware block, by casting them

as 1

1 CONV layers. Moreover, ALAMO includes a batch normalisation block and an elementwise

adder. These components are employed as complementary to the main blocks, with the elementwise

adder used to implement models with irregular dataow, including residual networks [33].

Overall, ALAMO’s compiler considers the layers that are present in the target CNN and instan-

tiates only the necessary hardware blocks. After the architecture has been generated, the layers

are scheduled in a sequential manner. This approach alleviates the problem of allocating resources

among dierent layers of the same type and simplies the design space to include only the scaling

of each hardware block and the scheduling of the layers. The control of the generated accelerator is

statically determined at compile time and is encoded as congurations that are loaded sequentially

on the accelerator as dierent parts of the network are executed.

3) DnnWeaver: DnnWeaver’s hardware is based on a parametrised architectural template. The

template comprises an array of coarse Processing Units (PUs). Each PU contains a datapath that

includes an array of Processing Elements (PEs) which execute CONV and FC layers, followed

by dedicated units for NORM, POOL and NONLIN layers. Within a PU, the CONV and POOL

layers are pipelined and their execution is overlapped in order to exploit the parallelism across

layers. The computation of output feature maps for CONV and POOL layers and output neurons

for FC layers are scheduled across PUs, with PEs exploiting the parallelism between dierent

elements of each output feature map. Generating a specic instance of the template requires trading

between the number of PUs and PEs per PU, which resemble the tunable parameters of Angel-Eye’s

and AutoCodeGen’s architectures. However, in contrast to Angel-Eye which considers only the

available resources of the target device, in DnnWeaver this tuning is performed at the design

space exploration stage and is tailored to the input CNN and constrained by the resources of the

target FPGA, as in the case of ALAMO.

4) Caeine: Caeine’s hardware consists of a systolic array of PEs that perform multiplication

operations. The array oers scalability in implementing convolution operations by exploiting

dierent levels of parallelism, with optional connections between the output of each PE and

dedicated blocks for ReLU and POOL layers. Moreover, support for FC layers is achieved by

transforming the matrix-vector multiplications of FC layers into batched convolutions and mapping

them to the existing convolution structure, which allows the reuse of the exact same hardware

for both layers. Given a CNN-FPGA pair, the number of parallel PEs is set after the design space

exploration phase, so that the hardware will be tailored to the target CNN.

5) FP-DNN: Drawing from the fact that CONV and FC layers as well as recurrent connections in

RNNs and the gate blocks in LSTMs can be converted to matrix multiplications, FP-DNN generates

an architecture with a single generic Matrix Multiplication (MM) engine as its core. In order to

balance the computational resources with the external memory bandwidth, tiling is applied on the

input matrices, with the tiles processed in a pipelined manner. The MM engine processes the tiles

in a vector by vector basis by means of a dot-product unit. The dot-product unit consists of an

array of multipliers, which fully unrolls all the multiplications of the dot product, followed by an

adder tree. In order to sustain a high utilisation of the computational resources and hide the latency

of the o-chip memory, FP-DNN employs double buering for the transfer of matrix tiles. The

MM engine is time-shared between layers, with nonlinearities and pooling operations applied by

separate hardware prior to writing back intermediate results to the o-chip memory. The on-chip

memory is organised as a pool of buers which can be reused by dierent data at run time in order

to sustain a high utilisation, with the allocation schedule for each buer handled as part of the

design space exploration. Finally, the layer-specic control logic and the interface with the external

memory and the host CPU are implemented with OpenCL-based modules.

ACM Computing Surveys, Vol. 0, No. 0, Article 0. Publication date: March 2018.

剩余35页未读，继续阅读

Edward！！！

粉丝: 7
资源: 13

基于FPGA的卷积神经网络加速技术综述

CNN唯一开源FPGA实现

FPGA CNN 基于FPGA的深度学习网络移植

java8stream源码-fpgaConvnetMaxeler:在maxelerDFE设备上运行基于流的卷积神经网络

Deep learning with convolutional neural networks for brain mapping

Occupancy Grid Mapping.rar_Grid Mapping_Grid 地图_entireruu_占据栅格地图

Eye feature point detection based on single convolutional neural network

Serial_Asyn_Iteration_and_Serial_Asyn_Mapping_and_Limited_Parallel_Mapping

Displacement_Mapping_on_the_GPU_-_State_of_the_Art.pdf

MFC example message mapping_MFC实例_Mfcmessage_

3D_Mapping_based_on_2D-Lidar_at_static_locations-master.zip

最新资源