神经网络量化与训练：提升移动设备上的高效整数计算推理

需积分: 37 84 浏览量更新于2024-09-08 1 收藏 259KB PDF 举报

"Quantization and Training of Neural Networks" 是一篇关于深度学习模型在移动设备上高效、精确的量化与训练方法的研究论文。随着智能移动设备的普及和深度学习技术对计算资源的需求增加，如何在设备上实现高效的推理成为了一个重要课题。该论文发表于2017年12月，由Benoit Jacob等人来自Google Inc.团队共同完成。论文的核心内容聚焦于提出一种量化方案，旨在将神经网络的推理过程完全转换为仅使用整数运算，这对于许多普遍支持整数运算而非浮点运算的硬件来说，具有更高的执行效率。传统的浮点运算在移动设备上可能消耗大量资源，而整数运算可以显著降低能耗并加速推理速度，这对于移动设备用户来说无疑是个巨大的优势。此外，作者们还特别关注保持量化后模型的准确度。他们开发了一种协同设计的训练策略，旨在通过调整网络架构和参数优化，确保模型在量化后的性能损失最小化。这种方法对于像MobileNets这样的以运行时效率著称的模型来说尤为重要，因为它们已经在轻量级设计上取得了平衡，而这种量化和训练的优化将进一步提升其在图像分类（如ImageNet）和对象检测（如COCO）等任务中的表现。该研究的贡献在于它提供了一种新的途径，使得深度学习模型能够在保持高性能的同时，适应移动设备的硬件限制，从而推动了在设备本地进行高效、低功耗的智能应用的发展。这对于推动未来物联网(IoT)和边缘计算的发展具有深远的意义，同时也对硬件制造商和软件开发者提出了新的挑战和合作机会，共同探索如何在硬件限制下优化深度学习的部署。

A basic requirement of our quantization scheme is that it

permits efﬁcient implementation of all ar ithmetic using only

integer arithmetic o perations on the quantized values (we

eschew implemen tations requiring lookup tables because

these tend to perform po orly compared to pure arithm etic

on SIMD hardware). This is equivalent to requiring that the

quantization scheme be an afﬁne mapp ing of integer s q to

real numbers r, i.e. of the form

r = S(q − Z) (1)

for some constants S and Z. Equation (

1) is our quantiza-

tion scheme and the constants S and Z ar e o ur quantization

parameters. Our quantization sche me uses a single set of

quantization pa rameters f or all values within each activa-

tions array a nd within each weights array; separate arrays

use separate quantization parameters.

For 8-bit q uantization, q is quantized as an 8-bit integer

(for B-bit quantization, q is quantized as an B-bit integer).

Some arrays, typically b ia s vector s, are quantized as 32-b it

integers, see section

2.4.

The constant S (for “scale”) is a n arbitrary positive rea l

number. I t is typically represented in software as a ﬂoating-

point quantity, like the real values r. Section

2.2 describes

methods for avoiding the representation of such ﬂoating-

point quantities in the inference workload.

The constant Z (for “zero-point”) is of the same type

as quantized values q, and is in fact the quantized value q

correspo nding to the real value 0. This allows us to auto-

matically mee t the requirement that the re al value r = 0 be

exactly repre sentable by a qua ntized value. The motivation

for this req uirement is that efﬁcient implementation of neu-

ral network operators often requir es zero -padding of arrays

around boundaries.

Our discussion so far is su mmarized in the following

quantized buffer data structure

, with one instance of such a

buffer existing for each activations arra y and weights array

in a neural network. We use C++ syntax because it allows

the una mbiguous conveyance of types.

template<typename QType> // e.g. QType=uint8

struct QuantizedBuffer {

vector<QType> q; // the quantized values

float S; // the scale

QType Z; // the zero-point

};

2.2. Integer-arithmetic-only matrix multiplication

We now turn to the question of how to perform inference

using only integer arithmetic, i.e. how to use Equa tion (

to translate real-numbers computation into q uantized-values

The actual data structures in the TensorFlow Lite [

5] Converter are

QuantizationParams and Array in

this header ﬁle. As we discuss

in the next subsection, this data structure, which still contains a ﬂoating-

point quantity, does not appear in the actual quantized on-device inference

code.

computation, and how the latter can be designed to involve

only integer arithmetic even though the scale values S are

not integers.

Consider the multiplication of two sq uare N × N ma-

trices of real numbers, r

and r

, with their product repre-

sented by r

= r

. We denote the entries of each of these

matrices r

(α = 1, 2 or 3) as r

(i,j)

for 1 6 i, j 6 N,

and the quantization parameters with which they are qua n-

tized as (S

, Z

). We denote the quantized entries by q

(i,j)

Equation (

1) then becomes:

(i,j)

= S

(i,j)

− Z

). (2)

From the deﬁnition of matrix multiplicatio n, we have

(i,k)

− Z

) =

j=1

(i,j)

− Z

(j,k)

− Z

), (3)

which can be rewritten as

(i,k)

= Z

+ M

j=1

(i,j)

− Z

)(q

(j,k)

− Z

), (4)

where the multiplier M is deﬁned as

. (5)

In Eq uation (

4), the only non-integer is the multiplier M.

As a c onstant depending only on the quantization scales

, S

, it ca n be computed ofﬂine. We empirically ﬁn d

it to always be in the interval (0, 1), and can therefore ex-

press it in the normalized form

M = 2

−n

(6)

where M

is in the interval [0.5, 1) and n is a non-negative

integer. The normalized multiplier M

now lends itself well

to being expressed as a ﬁxed-poin t multiplier (e.g. int16 or

int32 dependin g on hardware capability). For example, if

int32 is used, the integer representing M

is the int32 value

nearest to 2

. Since M

> 0.5, this value is always at

least 2

and will therefor e always have at least 30 bits of

relative accuracy. Multiplication by M

can thus be imple-

mented as a ﬁxed-point multiplication

. Meanwhile, multi-

plication by 2

−n

can b e implemented with an efﬁcient bit-

shift, albeit one that needs to have correct ro und-to-nearest

behavior, an issue that we return to in Append ix

2.3. Efﬁcient handling of zero-points

In or der to efﬁciently implement the eva luation of Equa-

tion (

4) without h aving to perform 2N

subtractions an d

The computation discussed in this section is implemented in Tensor-

Flow Lite [

5] reference code for a fully-connected layer.

剩余13页未读，继续阅读

sinat_20409085

粉丝: 0
资源: 2

神经网络量化与训练：提升移动设备上的高效整数计算推理

Pattern Recognition with Neural Networks in C++

micronet，一个模型压缩和部署库。-Python开发

PyTorch 1.10

quantization

轻量化神经网络effcient

yolov8量化感知训练步骤

pytorch的Quantization介绍

数字图像的quantization parameters有哪些

最新资源