深度学习优化处理：教程与调研

需积分: 13 170 浏览量更新于2024-07-19 1 收藏 5.32MB PDF 举报

"这篇论文是《深度学习国外综述论文：Efficient Processing of Deep Neural Networks: A Tutorial and Survey》，由 Vivienne Sze 等人撰写。它全面介绍了深度学习的反向理论，并探讨了如何高效处理深度神经网络，旨在提高能源效率和吞吐量，同时不牺牲性能精度或增加硬件成本，对于实现深度学习在人工智能系统中的广泛应用至关重要。" 深度学习是一种目前广泛应用于计算机视觉、语音识别和机器人等人工智能领域的技术。尽管深度神经网络（DNNs）在许多AI任务上取得了最先进的准确度，但其高计算复杂性也带来了挑战。因此，研究和开发能够有效处理DNNs的技术，以提高能源效率和吞吐量，同时保持性能准确性并控制硬件成本，成为了当前的研究重点。本文首先提供了深度神经网络的基础概述，包括其结构、训练过程和优化策略。接着，作者讨论了支持DNN运行的各种平台和架构，例如GPU、TPU以及定制化的硬件加速器，这些硬件解决方案旨在加速神经网络的运算，降低功耗。论文还着重介绍了近年来在高效处理DNNs方面的关键进展，这些技术主要通过硬件优化、模型压缩、量化和稀疏化等方法减少计算成本。硬件优化涉及设计更高效的处理器单元，如张量处理单元，来专门处理神经网络的计算密集型任务。模型压缩则包括权重剪枝、知识蒸馏等方法，以减小模型大小，同时保持性能。量化技术将浮点运算转换为整数运算，以提高计算速度和降低内存需求。稀疏化允许网络仅保留最重要的连接，进一步减少计算和存储需求。此外，论文还探讨了混合精度训练、动态调度和能量效率优化等策略，这些策略可以根据任务的需求和硬件资源灵活调整计算资源的分配。论文最后可能对未来的趋势和挑战进行了预测，包括更高效的硬件设计、自适应计算技术以及针对特定应用的优化方法。这篇综述论文是理解深度学习计算效率提升领域的宝贵资源，它不仅提供了技术概述，还深入分析了当前的解决方案和未来的发展方向，对于研究人员和工程师来说具有很高的参考价值。

the output. In these networks, some intermediate operations

generate values that are stored internally to the network and

used as inputs to other operations in conjunction with the

processing of a later input. In this article, we will focus on

feed-forward networks as to-date little attention has been given

to hardware acceleration speciﬁcally of recurrent networks.

DNNs can be composed of fully-connected (FC, also referred

to as multi-layer perceptrons) as shown in the leftmost layer

of Fig. 2(d). In a fully-connected layer, all output activations

are composed of a weighted sum of all input activations

(i.e., all outputs are connected to all inputs). This requires a

signiﬁcant amount of storage and computation. Thankfully, in

many applications, we can remove some connections between

the activations by setting the weights to zero without affecting

accuracy. This results in a sparsely-connected layer. A sparsely

connected layer is illustrated in the rightmost layer of Fig. 2(d).

We can also make the computation more efﬁcient by limiting

the number of weights that contribute to an output. This sort of

structured sparsity can arise if each output is only a function

of a ﬁxed-size window of inputs. Even further efﬁciency can

be gained if the same set of weights are used in the calculation

of every output. This weight sharing can signiﬁcantly reduce

the storage requirements for weights.

An extremely popular windowed, weight-shared network

arises by structuring the computation as a convolution, as

shown in Fig. 6(a), where the output is computed using only

a small neighborhood of activations for the weighted sum (i.e.,

the ﬁlter has a limited receptive ﬁeld, and all weights beyond

a certain distance from the input is set to zero), and where

the same set of weights are shared for every output (i.e., the

ﬁlter is space invariant). This is a form of structured sparsity

is orthogonal to the sparsity that occurs from network pruning

as described in Section

VII-B

2. Accordingly, a convolutional

neural network (CNN) is a popular form of DNN [35].

1) Convolutional Neural Networks (CNNs): CNNs are

composed of multiple convolutional layers (CONV), as shown

in Fig. 7, where each layer generates a higher-level abstraction

of the input data, called a feature map (fmap), that preserves

essential yet unique information. Modern CNNs are able

to achieve superior performance by employing a very deep

hierarchy of layers. CNN, also known as ConvNets, are

widely used in a variety of applications including image

understanding [

], speech recognition [

], game play [

robotics [

], etc. The paper will focus on its use in image

processing, speciﬁcally for the task of image classiﬁcation [

Each of the CONV layers in the CNN is primarily composed

of high-dimensional convolutions as shown in Fig. 6(b). In this

computation there are a set of 2-D input feature maps (ifmaps),

each of which is called a channel. Each channel is convolved

with a distinct 2-D ﬁlter from the stack of ﬁlters, one for

each channel. The results of the convolution at each point are

summed across all the channels. In addition, a 1-D bias can be

added to the ﬁltering results, but some recent networks [

]

remove its usage from part of the layers. The result of this

computation is one channel of output feature map (ofmap).

Additional stacks of 2-D ﬁlters can be used on the same input

to create additional output channels. Finally, multiple stacks

of input feature maps may be processed together as a batch to

filter (weights)

Partial Sum (psum)

Accumulation

input fmap output fmap

Element-wise

Multiplication

an output

activation

(a) 2-D convolution in traditional image processing

Input fmaps

Filters

Output fmaps

…

(b) High dimensional convolutions in CNNs

Fig. 6. Dimensionality of convolutions.

Modern Deep CNN: 5 – 1000 Layers

Class

Scores

Layer

CONV

Layer

Low-Level

Features

CONV

Layer

High-Level

Features

…

1 – 3 Layers

Convolu'on(

Non-linearity(

×(

Normaliza'on(

Pooling(

Optional

Fully(

Connected(

×(

Non-linearity(

CONV

Layer

Mid-Level

Features

Fig. 7. Convolutional Neural Networks.

potentially improve reuse of the ﬁlter weights.

Given the shape parameters in Table I, the computation of

a CONV layer is deﬁned as

O[z][u][x][y] = B[u] +

C−1

k=0

R−1

i=0

R−1

j=0

I[z][k][Ux + i][Uy + j] × W[u][k][i][j],

0 ≤ z < N, 0 ≤ u < M, 0 ≤ x, y < E, E = (H − R + U)/U.

(1)

and

are the matrices of the ofmaps, ifmaps, ﬁlters

and biases, respectively.

is a given stride size. Fig. 6(b)

shows a visualization of this computation (ignoring biases).

To align the terminology of CNNs with the generic DNN,

• ﬁlters are composed of weights (i.e., synapses)

• input images are composed of pixels (i.e., input neurons

to ﬁrst layer)

•

input and output feature maps (ifmaps, ofmaps) are

composed of activations (i.e., input and output neurons)

Shape Parameter Description

N batch size of 3-D fmaps

M # of 3-D ﬁlters / # of ofmap channels

C # of ifmap/ﬁlter channels

H ifmap plane width/height

R ﬁlter plane width/height (= H in FC)

E ofmap plane width/height (= 1 in FC)

TABLE I

SHAPE PARAMETERS OF A CONV/FC LAYER.

Sigmoid

-1

0 1

-1

!"#$%#&'

()

Hyperbolic Tangent

-1

0 1

-1

!"%'

)

()

*$%'

)

()

Rectified Linear Unit

(ReLU)

-1

0 1

-1

!",-)%./)*+

Leaky ReLU

-1

0 1

-1

!",-)%0)/)*+

Exponential LU

-1

0 1

-1

++++)/+++++++

++++0%'

)

(#*/+

)1.+

)2.+

!"+

α = small const. (e.g. 0.1)

Traditional

Non-Linear

Activation

Functions

Modern

Non-Linear

Activation

Functions

Fig. 8. Various forms of non-linear activation functions (Figure adopted from

Caffe Tutorial [43]).

From ﬁve [

] to even more than a thousand [

] CONV

layers are commonly used in recent CNN models. A small

number, e.g., 1 to 3, of fully-connected (FC) layers are typically

applied after the CONV layers for classiﬁcation purposes. A FC

layer also applies ﬁlters on the ifmaps as in the CONV layers,

but the ﬁlters are of the same size as the ifmaps. Therefore,

it does not have the weight sharing property of CONV layers.

Eq. (1) still holds for the computation of FC layers with a

few additional constraints on the shape parameters:

H = R

E = 1, and U = 1.

In addition to CONV and FC layers, various optional layers

can be found in a DNN such as the non-linearity (NON),

pooling (POOL), and normalization (NORM). Each of these

layers can be conﬁgured as discussed next.

2) Non-Linearity: A non-linear activation function is typ-

ically applied after each convolution or fully connected

computation. Various non-linear functions are used to introduce

non-linearity into the DNN as shown in Fig. 8. These include

conventional non-linear functions such as sigmoid or hyperbolic

tangent as well as rectiﬁed linear unit (ReLU) [

], which has

become popular in recent years due to its simplicity and its

ability to enable fast training. Variations of ReLU, such as leaky

ReLU [

], parametric ReLU [

], and exponential LU [

]

have also been explored for improved accuracy. Finally, a

non-linearity called maxout, which takes the max value of

two intersecting linear functions, has shown to be effective in

speech recognition tasks [41, 42].

3) Pooling: Pooling enables the network to be robust and

invariant to small shifts and distortions and is applied to each

channel separately. It can be conﬁgured based on the size of

9 3 5 3

10 32 2 2

1 3 21 9

2 6 11 7

2x2 pooling, stride 2

32 5

6 21

Max pooling

Average pooling

18 3

3 12

Fig. 9. Various forms of pooling (Figure adopted from Caffe Tutorial [

]).

its receptive ﬁeld (e.g., 2

2) and the type of pooling (e.g.,

max or average), as shown in Fig. 9. Typically the pooling

occurs on non-overlapping blocks (i.e., the stride is equal to

the size of the pooling). Usually a stride of greater than one

is used such that there is a reduction in the dimension of the

representation (i.e., feature map).

4) Normalization: Controlling the input distribution across

layers can help to signiﬁcantly speed up training and improve

accuracy. Accordingly, the distribution of the layer input

activations (

) are normalized such that it has a zero mean

and a unit standard deviation. In batch normalization, the

normalized value is further scaled and shifted, as shown in

Eq. (2), the parameters (

) are learned from training [



is a small constant to avoid numerical problems. Prior to this,

local response normalization [

] was used, which was inspired

by lateral inhibition in neurobiology where excited neurons

(i.e., high values activations) should subdue its neighbors (i.e.,

low value activations); however, batch normalization is now

considered standard practice in the design of CNNs.

y =

x − µ

√

+ 

γ + β

(2)

A. Popular DNN Models

Many DNN models have been developed over the past

two decades. Each of these models has a different ”network

architecture” in terms of number of layers, ﬁlter shapes (i.e.,

ﬁlter size, number of channels and ﬁlters), layer types, and

connections between layers. Understanding these variations

and trends is important for incorporating the right ﬂexibility

in any efﬁcient DNN engine.

Although the ﬁrst popular DNN, LeNet [

], was published

in the 1990s, it wasn’t until 2012 that the AlexNet [

] was

used in the ImageNet Challenge [

]. We will give an overview

of various popular DNNs that competed in and/or won the

ImageNet Challenge [

] as shown in Fig. 5, most of whose

models with pre-trained weights are publicly available for

download; the DNN models are summarized in Table II. Two

results for top-5 error results are reported. In the ﬁrst row, the

accuracy is boosted by using multiple crops from the image,

and an ensemble of multiple trained models (i.e., the DNN

needs to be run several times); these are results that are used to

compete in the ImageNet Challenge. The second row reports

the accuracy if only a single crop was used (i.e., the DNN is

run only once), which is more consistent with what would be

deployed in real applications.

LeNet [9] was one of the ﬁrst CNN approaches introduced

in 1989. It was designed for the task of digit classiﬁcation

in grayscale images of size 28

28. The most well known

剩余30页未读，继续阅读

xcb_mm

粉丝: 5
资源: 9

深度学习优化处理：教程与调研

Efficient Processing of Deep Neural Networks A Tutorial and Survey

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

深度学习的最优化：理论和算法综述论文【包含257篇文献】.zip

efficient processing of deep neural networks pdf

efficient processing of deep neural networks

meta-learning in neural networks: a survey

complex-valued neural networks: theories and applications电子版

Watermarking Deep Neural Networks

graph neural networks: a review of methods and applications

Adaptive Normalized Risk-Averting Training for Deep Neural Networks

最新资源