Quantization
QUANTIZATION
WARNING
Quantization is experimental and subject to change.
Introduction to Quantization
Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point
precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point
values. This allows for a more compact model representation and the use of high performance vectorized operations on
many hardware platforms. PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x
reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8
computations is typically 2 to 4 times faster compared to FP32 compute. Quantization is primarily a technique to speed up
inference and only the forward pass is supported for quantized operators.
PyTorch supports multiple approaches to quantizing a deep learning model. In most cases the model is trained in FP32
and then the model is converted to INT8. In addition, PyTorch also supports quantization aware training, which models
quantization errors in both the forward and backward passes using fake-quantization modules. Note that the entire
computation is carried out in floating point. At the end of quantization aware training, PyTorch provides conversion functions
to convert the trained model into lower precision.
At lower level, PyTorch provides a way to represent quantized tensors and perform operations with them. They can be used
to directly construct models that perform all or part of the computation in lower precision. Higher-level APIs are provided
that incorporate typical workflows of converting FP32 model to lower precision with minimal accuracy loss.
Today, PyTorch supports the following backends for running quantized operators efficiently:
x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations)
ARM CPUs (typically found in mobile/embedded devices)
The corresponding implementation is chosen automatically based on the PyTorch build mode.
NOTE
PyTorch 1.3 doesn’t provide quantized operator implementations on CUDA yet - this is direction of future work. Move the
model to CPU in order to test the quantized functionality.
Quantization-aware training (through FakeQuantize) supports both CPU and CUDA.
NOTE
When preparing a quantized model, it is necessary to ensure that qconfig and the engine used for quantized computations
match the backend on which the model will be executed. Quantization currently supports two backends: fbgemm (for use on
x86, https://github.com/pytorch/FBGEMM) and qnnpack (for use on the ARM QNNPACK library
https://github.com/pytorch/QNNPACK). For example, if you are interested in quantizing a model to run on ARM, it is
recommended to set the qconfig by calling:
qconfig = torch.quantization.get_default_qconfig('qnnpack')
for post training quantization and
qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')
for quantization aware training.
In addition, the torch.backends.quantized.engine parameter should be set to match the backend. For using qnnpack for
inference, the backend is set to qnnpack as follows
torch.backends.quantized.engine = 'qnnpack'
Quantized Tensors
PyTorch supports both per tensor and per channel asymmetric linear quantization. Per tensor means that all the values
within the tensor are scaled the same way. Per channel means that for each dimension, typically the channel dimension of a
tensor, the values in the tensor are scaled and offset by a different value (effectively the scale and offset become vectors).