Abstract
We present an overview of techniques for quantizing convolutional neural net-
works for inference with integer weights and activations.
1. Per-channel quantization of weights and per-layer quantization of activations
to 8-bits of precision post-training produces classification accuracies within
2% of floating point networks for a wide variety of CNN architectures (sec-
tion 3.1).
2. Model sizes can be reduced by a factor of 4 by quantizing weights to 8-
bits, even when 8-bit arithmetic is not supported. This can be achieved with
simple, post training quantization of weights (section 3.1).
3. We benchmark latencies of quantized networks on CPUs and DSPs and ob-
serve a speedup of 2x-3x for quantized implementations compared to float-
ing point on CPUs. Speedups of up to 10x are observed on specialized pro-
cessors with fixed point SIMD capabilities, like the Qualcomm QDSPs with
HVX (section 6).
4. Quantization-aware training can provide further improvements, reducing the
gap to floating point to 1% at 8-bit precision. Quantization-aware training
also allows for reducing the precision of weights to four bits with accuracy
losses ranging from 2% to 10%, with higher accuracy drop for smaller net-
works (section 3.2).
5. We introduce tools in TensorFlow and TensorFlowLite for quantizing con-
volutional networks (Section 3).
6. We review best practices for quantization-aware training to obtain high ac-
curacy with quantized weights and activations (section 4).
7. We recommend that per-channel quantization of weights and per-layer quan-
tization of activations be the preferred quantization scheme for hardware ac-
celeration and kernel optimization. We also propose that future processors
and hardware accelerators for optimized inference support precisions of 4, 8
and 16 bits (section 7).
1 Introduction
Deep networks are increasingly used for applications at the edge. Devices at the edge
typically have lower compute capabilities and are constrained in memory and power
consumption. It is also necessary to reduce the amount of communication to the cloud
for transferring models to the device to save on power and reduce network connectivity
requirements. Therefore, there is a pressing need for techniques to optimize models for
reduced model size, faster inference and lower power consumption.
There is extensive research on this topic with several approaches being considered:
One approach is to build efficient models from the ground up [1],[2] and [3]. Another
technique is to reduce the model size by applying quantization, pruning and compres-
sion techniques [4], [5] and [6]. Faster inference has been achieved by having efficient
kernels for computation in reduced precision like GEMMLOWP [7], Intel MKL-DNN
[8] , ARM CMSIS [9], Qualcomm SNPE [10], Nvidia TensorRT [11] and custom hard-
ware for fast inference [12], [13] and [14].
3
评论0