高效深度学习实践：模型压缩、优化与硬件加速

需积分: 9 94 浏览量更新于2024-07-09 收藏 5.53MB PDF 举报

"这篇综述论文《高效深度学习：更小、更快、更好》由Google Research的研究员Gaurav Menghani撰写，旨在探讨如何优化深度学习模型，使其在保持高性能的同时，减少参数数量、降低延迟时间和减少训练资源需求。文章深入分析了深度学习效率的重要性，并对五个核心效率领域进行了详尽的调查，包括建模技术、基础设施和硬件优化。此外，该论文还提供了一个基于实验的指导，以及代码示例，帮助实践者优化模型的训练和部署。这是深度学习效率领域的首份全面调查，覆盖了从建模技术到硬件支持的模型效率全貌。作者希望这份调查能够为深度学习社区提供有价值的参考，推动更高效的学习算法和系统的发展。" 本文首先阐述了深度学习在多个领域取得的突破性进展，但随着模型的复杂度增加，随之而来的是模型大小、运行延迟和训练成本的显著增长。这引发了对于模型效率的关注，因为仅仅追求性能的提升并不足以满足实际应用的需求。论文接下来深入探讨了模型效率问题，指出在设计和优化深度学习模型时，需要综合考虑其对计算资源、内存和时间的影响。论文的核心内容分为五个部分： 1. **建模技术**：讨论了各种轻量级模型架构，如MobileNet、ShuffleNet和 EfficientNet等，它们通过网络结构的创新，如深度可分离卷积、通道 shuffle 等，实现了模型的减小和性能的平衡。 2. **量化和压缩**：研究了权重和激活的量化技术，如二值化、低精度表示，以及模型压缩方法，如知识蒸馏，这些都能有效减少模型大小，同时保持或提高性能。 3. **优化算法**：介绍了动量优化、自适应学习率策略（如Adam）、二阶优化方法等，以及如何选择合适的优化器以提高训练速度和效果。 4. **基础设施**：涵盖了分布式训练、数据并行和模型并行的策略，以及GPU、TPU等硬件加速方案，这些都是提高训练和推理效率的关键。 5. **硬件优化**：讨论了针对特定硬件的模型定制，如针对边缘设备的优化，以及新兴的硬件技术如何支持更高效的深度学习计算。此外，论文还提供了实验指南和代码，以帮助开发者实践这些优化策略，包括模型的训练配置、超参数调整和性能基准测试。这份综述论文是深度学习从业者的重要参考资料，它总结了当前的高效深度学习研究，并为未来的工作指明了方向，强调了在追求深度学习模型性能的同时，不能忽视效率问题，必须寻求更小、更快、更好的解决方案。

8 Gaurav Menghani

the minimum weight value (

𝑥

𝑚𝑖𝑛

) in that matrix to 0, and the maximum value (

𝑥

𝑚𝑎𝑥

) to 2

𝑏

−

(where

𝑏

is the number of bits of precision, and

𝑏 <

32). Then we can linearly extrapolate all

values between them to an integer value in [0

𝑏

−

1] (Figure 5). Thus, we are able to map each

oating point value to a xed-point value where the latter requires a lesser number of bits than the

oating-point representation. This process can also be done for signed

𝑏

-bit xed-point integers,

where the output values will be in the range [-2

𝑏

−

1, 2

𝑏

−

1]. One of the reasonable values of

𝑏

8, since this would lead to a 32

reduction in space, and also because of the near-universal

support for uint8_t and int8_t datatypes.

During inference, we go in the reverse direction where we recover a lossy estimate of the original

oating point value (dequantization) using just the

𝑥

𝑚𝑖𝑛

and

𝑥

𝑚𝑎𝑥

. This estimate is lossy since we

lost 32

− 𝑏

bits of information when did the rounding (another way to look at it is that a range of

oating point values map to the same quantized value).

Fig. 5. antizing floating-point continuous values to discrete fixed-point values. The continuous values

are clamped to the range

𝑥

𝑚𝑖𝑛

𝑥

𝑚𝑎𝑥

, and are mapped to discrete values in [0, 2

𝑏

−

1] (in the above figure,

𝑏 = 3, hence the quantized values are in the range [0, 7].

[82, 90] formalize the quantization scheme with the following two constraints:

•

The quantization scheme should be linear (ane transformation), so that the precision bits

are linearly distributed.

•

0 should map exactly to a xed-point value

𝑥

𝑞

, such that dequantizing

𝑥

𝑞

gives us 0

0. This

is an implementation constraint, since 0 is also used for padding to signify missing elements

in tensors, and if dequantizing

𝑥

𝑞

leads to a non-zero value, then it might be interpreted

incorrectly as a valid element at that index.

The second constraint described above requires that 0 be a part of the quantization range, which

in turn requires updating

𝑥

𝑚𝑖𝑛

and

𝑥

𝑚𝑎𝑥

, followed by clamping

𝑥

to lie in

[𝑥

𝑚𝑖𝑛

, 𝑥

𝑚𝑎𝑥

]

. Following

this, we can quantize 𝑥 by constructing a piece-wise linear transformation as follows:

quantize(𝑥) = 𝑥

𝑞

= round



𝑥

𝑠



+ 𝑧 (1)

𝑠

is the oating-point scale value (can be thought of as the inverse of the slope, which can be

computed using

𝑥

𝑚𝑖𝑛

𝑥

𝑚𝑎𝑥

and the range of the xed-point values).

𝑧

is an integer zero-point

value which is the quantized value that is assigned to

𝑥 =

0. This is the terminology followed in

literature [82, 90] (Algorithm 2).

, Vol. 1, No. 1, Article . Publication date: June 2021.

Eicient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Beer 9

The dequantization step constructs

𝑥

, which is a lossy estimate of

𝑥

, since we lose precision

when quantizing to a lower number of bits. We can compute it as follows:

dequantize(𝑥

𝑞

) =

𝑥 = 𝑠(𝑥

𝑞

− 𝑧) (2)

Since

𝑠

is in oating-point,

𝑥

is also a oating-point value (Algorithm 3). Note that the quantization

and dequantization steps can be performed for signed integers too by appropriately changing the

value 𝑥

𝑞

𝑚𝑖𝑛

(which is the lowest xed-point value in 𝑏-bits) in Algorithm 2.

Algorithm 2:

Quantizing a given weight ma-

trix X

Data:

Floating-point tensor to compress

, number

of precision bits 𝑏 for the xed-point

representation.

Result: Quantized tensor X

1 X

𝑚𝑖𝑛

, X

𝑚𝑎𝑥

← min(X, 0), max(X, 0);

2 X ← clamp(X, X

𝑚𝑖𝑛

, X

𝑚𝑎𝑥

);

3 𝑠 ←

𝑥

𝑚𝑎𝑥

− 𝑥

𝑚𝑖𝑛

𝑏

− 1

;

4 𝑧 ← round



𝑥

𝑞

𝑚𝑖𝑛

−

𝑥

𝑚𝑖𝑛

𝑠



;

5 X

← round



𝑠



+ 𝑧;

6 return X

;

Algorithm 3:

Dequantizing a given

xed-point weight matrix X

Data: Fixed-point matrix to dequantize X

along with the scale

𝑠

, and zero-point

𝑧 values which were calculated

during quantization.

Result: Dequantized oating-point weight

matrix

X ← 𝑠 (X

− 𝑧);

2 return

We can utilize the above two algorithms for quantizing and dequantizing the model’s weight

matrices. Quantizing a pre-trained model’s weights for reducing the size is termed as post-training

quantization in literature [

]. This might be sucient for the purpose of reducing the model size

when there is sucient representational capacity in the model.

There are other works in literature [

127

] that demonstrate slightly dierent variants of

quantization. XNOR-Net [

127

], Binarized Neural Networks [

] and others use

𝑏 =

1, and thus

have weight matrices which just have two possible values 0 or 1, and the quantization function

there is simply the

sign(𝑥)

function (assuming the weights are symmetrically distributed around 0).

The promise with such extreme quantization approaches is the theoretical 32

reduction

in model size without much quality loss. Some of the works claim improvements on larger net-

works like AlexNet [

], VGG [

141

], Inception [

146

] etc., which might already be more amenable

to compression. A more informative task would be to demonstrate extreme quantization on smaller

networks like the MobileNet family [

133

]. Additionally binary quantization (and other quan-

tization schemes like ternary [

], bit-shift based networks [

127

], etc.) promise latency-ecient

implementations of standard operations where multiplications and divisions are replaced by cheaper

operations like addition, subtraction, etc. These claims need to be veried because even if these

lead to theoretical reduction in FLOPs, the implementations still need support from the under-

lying hardware. A fair comparison would be using standard quantization with

𝑏 =

8, where the

multiplications and divisions also become cheaper, and are supported by the hardware eciently

via SIMD instructions which allow for low-level data parallelism (for example, on x86 via the

SSE instruction set, on ARM via the Neon [

108

] intrinsics, and even on specialized DSPs like the

Qualcomm Hexagon [19]).

Activation Quantization

: To be able to get latency improvements with quantized networks, the

math operations have to be done in xed-point representations too. This means all intermediate

, Vol. 1, No. 1, Article . Publication date: June 2021.

剩余42页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

高效深度学习实践：模型压缩、优化与硬件加速

人工智能论文：基于深度学习的目标检测技术综述.docx

计算机python编程试题,机器学习,深度学习试题及答案.docx

迁移学习算深度学习吗？深度学习是什么？深度迁移学习和迁移学习的区别？

深度学习的背景？框架？应用？

你使用过OpenCV的深度学习模块吗？你了解哪些深度学习算法？

如何深度学习.NetCore？

机器学习和深度学习的区别？

深度学习综述.pdf

那既然这样，还为什么用深度学习去噪呢？

目前有哪些深度学习开源框架?试分别比较优缺点

最新资源