量化深度卷积网络的有效推理：白皮书_量化分析

深度学习

需积分: 13 95 浏览量更新于2023-03-16 评论收藏 820KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Quantizing deep convolutional networks for

efﬁcient inference: A whitepaper

Raghuraman Krishnamoorthi

raghuramank@google.com

June 2018

Contents

1 Introduction 3

2 Quantizer Design 4

2.1 Uniform Afﬁne Quantizer . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Uniform symmetric quantizer . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Stochastic quantizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Modeling simulated quantization in the backward pass . . . . . . . . 6

2.5 Determining Quantizer parameters . . . . . . . . . . . . . . . . . . . 8

2.6 Granularity of quantization . . . . . . . . . . . . . . . . . . . . . . . 8

3 Quantized Inference: Performance and Accuracy 8

3.1 Post Training Quantization . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Weight only quantization . . . . . . . . . . . . . . . . . . . . 8

3.1.2 Quantizing weights and activations . . . . . . . . . . . . . . 10

3.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Quantization Aware Training . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Operation Transformations for Quantization . . . . . . . . . . 15

3.2.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.4 Lower Precision Networks . . . . . . . . . . . . . . . . . . . 22

4 Training best practices 23

5 Model Architecture Recommendations 25

6 Run-time measurements 28

7 Neural network accelerator recommendations 30

8 Conclusions and further work 30

arXiv:1806.08342v1 [cs.LG] 21 Jun 2018

Abstract

We present an overview of techniques for quantizing convolutional neural net-

works for inference with integer weights and activations.

1. Per-channel quantization of weights and per-layer quantization of activations

to 8-bits of precision post-training produces classiﬁcation accuracies within

2% of ﬂoating point networks for a wide variety of CNN architectures (sec-

tion 3.1).

2. Model sizes can be reduced by a factor of 4 by quantizing weights to 8-

bits, even when 8-bit arithmetic is not supported. This can be achieved with

simple, post training quantization of weights (section 3.1).

3. We benchmark latencies of quantized networks on CPUs and DSPs and ob-

serve a speedup of 2x-3x for quantized implementations compared to ﬂoat-

ing point on CPUs. Speedups of up to 10x are observed on specialized pro-

cessors with ﬁxed point SIMD capabilities, like the Qualcomm QDSPs with

HVX (section 6).

4. Quantization-aware training can provide further improvements, reducing the

gap to ﬂoating point to 1% at 8-bit precision. Quantization-aware training

also allows for reducing the precision of weights to four bits with accuracy

losses ranging from 2% to 10%, with higher accuracy drop for smaller net-

works (section 3.2).

5. We introduce tools in TensorFlow and TensorFlowLite for quantizing con-

volutional networks (Section 3).

6. We review best practices for quantization-aware training to obtain high ac-

curacy with quantized weights and activations (section 4).

7. We recommend that per-channel quantization of weights and per-layer quan-

tization of activations be the preferred quantization scheme for hardware ac-

celeration and kernel optimization. We also propose that future processors

and hardware accelerators for optimized inference support precisions of 4, 8

and 16 bits (section 7).

1 Introduction

Deep networks are increasingly used for applications at the edge. Devices at the edge

typically have lower compute capabilities and are constrained in memory and power

consumption. It is also necessary to reduce the amount of communication to the cloud

for transferring models to the device to save on power and reduce network connectivity

requirements. Therefore, there is a pressing need for techniques to optimize models for

reduced model size, faster inference and lower power consumption.

There is extensive research on this topic with several approaches being considered:

One approach is to build efﬁcient models from the ground up [1],[2] and [3]. Another

technique is to reduce the model size by applying quantization, pruning and compres-

sion techniques [4], [5] and [6]. Faster inference has been achieved by having efﬁcient

kernels for computation in reduced precision like GEMMLOWP [7], Intel MKL-DNN

[8] , ARM CMSIS [9], Qualcomm SNPE [10], Nvidia TensorRT [11] and custom hard-

ware for fast inference [12], [13] and [14].

One of the simpler ways to reduce complexity of any model is to reduce the preci-

sion requirements for the weights and activations. This approach has many advantages:

• It is broadly applicable across a range of models and use cases. One does not

need to develop a new model architecture for improved speed. In many cases,

one can start with an existing ﬂoating point model and quickly quantize it to ob-

tain a ﬁxed point quantized model with almost no accuracy loss, without needing

to re-train the model. Multiple hardware platforms and libraries support fast in-

ference with quantized weights and activations, so there is no need to wait for

new hardware development.

• Smaller Model footprint: With 8-bit quantization, one can reduce the model size

a factor of 4, with negligible accuracy loss. This can be done without needing

any data as only the weights are quantized. This also leads to faster download

times for model updates.

• Less working memory and cache for activations: Intermediate computations are

typically stored in cache for reuse by later layers of a deep network and reducing

the precision at which this data is stored leads to less working memory needed.

Having lower precision weights and activations allows for better cache reuse.

• Faster computation: Most processors allow for faster processing of 8-bit data.

• Lower Power: Moving 8-bit data is 4 times more efﬁcient than moving 32-bit

ﬂoating point data. In many deep architectures, memory access can dominate

power consumption [2]. Therefore reduction in amount of data movement can

have a signiﬁcant impact on the power consumption.

All the factors above translate into faster inference, with a typical speedup of 2-3x

due to the reduced precision for both memory accesses and computations. Further

improvements in speed and power consumption are possible with processors and hard-

ware accelerators optimized for low precision vector arithmetic.

2 Quantizer Design

In this section, we review different design choices for uniform quantization.

2.1 Uniform Afﬁne Quantizer

Consider a ﬂoating point variable with range (x

min

, x

max

) that needs to be quantized

to the range (0, N

levels

− 1) where N

levels

= 256 for 8-bits of precision. We derive

two parameters: Scale (∆) and Zero-point(z) which map the ﬂoating point values to

integers (See [15]). The scale speciﬁes the step size of the quantizer and ﬂoating point

zero maps to zero-point [4]. Zero-point is an integer, ensuring that zero is quantized

with no error. This is important to ensure that common operations like zero padding do

not cause quantization error.

For one sided distributions, therefore, the range (x

min

, x

max

) is relaxed to include

zero. For example, a ﬂoating point variable with the range (2.1,3.5) will be relaxed to

剩余35页未读，继续阅读

吃不胖的卷卷

粉丝: 91
资源: 13

会员权益专享

量化深度卷积网络的有效推理：白皮书

评论0

会员权益专享

最新资源

量化深度卷积网络的有效推理：白皮书

评论0

论文研究-Performance Analysis of Sign Quantized Projection.pdf

Field Quantization

网络游戏-基于参数量化的深度卷积神经网络的加速与压缩方法.zip

多因子量化选股系列专题研究：结合日内分时特征的量价增强模型研究.zip

2023量化科技白皮书.pdf

自适应扫描池：用于视频动作识别的深度卷积神经网络方法

帮我整理一下深度卷积网络模型压缩的综述

什么是轻量级的卷积神经网络

quant,quant_conv,quantize概念

多因子量化选股系列之二:中证500指数增强策略.pdf

pd.read_csv("C:\Users\yiqie\Desktop\Pandas文件\Python金融分析与量化交易\第一章：金融数据时间序列分析\上证指数关联分析.csv")

使上述神经网络更加轻量化

量化科技白皮书2023 pdf

cnn卷积神经网络如何将参数进行量化

pytorch卷积量化

深度学习推理（Inference）优化器

轻量级卷积神经网络的定义

模型压缩技术权重量化如何实现

int8量化算法原理

适合对影像位深进行降位的卷积神经网络类型

会员权益专享

最新资源