PyTorch 1.5官方文档：分布式RPC框架与命名张量操作详解

4星 · 超过85%的资源需积分: 15 66 浏览量更新于2024-07-16 收藏 6.18MB PDF 举报

PyTorch 1.5官方英文文档提供了详细的Python API指南，旨在帮助用户在分布式环境中进行多机模型训练。文档主要包括以下内容： 1. Distributed RPC Framework：这是一个核心模块，它通过一组基础方法支持跨机器间的通信，使得大规模分布式训练成为可能。它不仅提供低层的远程通信机制，还设计了一个高级API，能够自动处理在多台机器上拆分的模型，简化了并行计算的复杂性。 2. Design Notes： - Distributed Autograd Design：着重于RPC（Remote Procedure Call）为基础的分布式自动微分框架设计，特别适用于模型并行训练等场景。这种设计允许在保持计算效率的同时，有效地管理不同机器上数据的依赖关系和梯度同步。 - RRef Design：介绍RRef（Remote Reference）协议，这是框架用来引用远程工作者上的值的关键组件。RRef允许在分布式环境中高效地追踪和更新远程变量的状态。 3. Tutorials： - RPC Tutorial：通过实际示例和torch.distributed.rpc API，该教程逐步引导用户了解如何开始使用分布式RPC框架，从基础操作到更复杂的分布式应用。 4. Named Tensors Operator Coverage：这部分内容强调了命名张量的使用，它是PyTorch中一个重要的特性，允许用户通过名称来指定张量的用途，从而提高代码的可读性和自动化程度。文档详细介绍了命名张量的命名规则和自动推断过程。 5. Getting Started with Distributed RPC Framework：为初次接触分布式RPC框架的用户提供了一套完整的入门指南，包括安装、配置和基本用法的介绍，确保用户能够顺利上手。 PyTorch 1.5的官方文档是开发人员不可或缺的参考资料，它全面覆盖了分布式训练的关键技术和工具，无论是初学者还是经验丰富的开发者，都能从中找到所需的信息，以实现高效的并行计算和模型扩展。

Quantization

QUANTIZATION

WARNING

Quantization is experimental and subject to change.

Introduction to Quantization

Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point

precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point

values. This allows for a more compact model representation and the use of high performance vectorized operations on

many hardware platforms. PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x

reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8

computations is typically 2 to 4 times faster compared to FP32 compute. Quantization is primarily a technique to speed up

inference and only the forward pass is supported for quantized operators.

PyTorch supports multiple approaches to quantizing a deep learning model. In most cases the model is trained in FP32

and then the model is converted to INT8. In addition, PyTorch also supports quantization aware training, which models

quantization errors in both the forward and backward passes using fake-quantization modules. Note that the entire

computation is carried out in floating point. At the end of quantization aware training, PyTorch provides conversion functions

to convert the trained model into lower precision.

At lower level, PyTorch provides a way to represent quantized tensors and perform operations with them. They can be used

to directly construct models that perform all or part of the computation in lower precision. Higher-level APIs are provided

that incorporate typical workflows of converting FP32 model to lower precision with minimal accuracy loss.

Today, PyTorch supports the following backends for running quantized operators efficiently:

x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations)

ARM CPUs (typically found in mobile/embedded devices)

The corresponding implementation is chosen automatically based on the PyTorch build mode.

NOTE

PyTorch 1.3 doesn’t provide quantized operator implementations on CUDA yet - this is direction of future work. Move the

model to CPU in order to test the quantized functionality.

Quantization-aware training (through FakeQuantize) supports both CPU and CUDA.

NOTE

When preparing a quantized model, it is necessary to ensure that qconfig and the engine used for quantized computations

match the backend on which the model will be executed. Quantization currently supports two backends: fbgemm (for use on

x86, https://github.com/pytorch/FBGEMM) and qnnpack (for use on the ARM QNNPACK library

https://github.com/pytorch/QNNPACK). For example, if you are interested in quantizing a model to run on ARM, it is

recommended to set the qconfig by calling:

qconfig = torch.quantization.get_default_qconfig('qnnpack')

for post training quantization and

qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')

for quantization aware training.

In addition, the torch.backends.quantized.engine parameter should be set to match the backend. For using qnnpack for

inference, the backend is set to qnnpack as follows

torch.backends.quantized.engine = 'qnnpack'

Quantized Tensors

PyTorch supports both per tensor and per channel asymmetric linear quantization. Per tensor means that all the values

within the tensor are scaled the same way. Per channel means that for each dimension, typically the channel dimension of a

tensor, the values in the tensor are scaled and offset by a different value (effectively the scale and offset become vectors).

This allows for lesser error in converting tensors to quantized values.
The mapping is performed by converting the floating point tensors using
Note that, we ensure that zero in floating point is represented with no error after quantization, thereby ensuring that 
operations like padding do not cause additional quantization error.
In order to do quantization in PyTorch, we need to be able to represent quantized data in Tensors. A Quantized Tensor 
allows for storing quantized data (represented as int8/uint8/int32) along with quantization parameters like scale and 
zero_point. Quantized Tensors allow for many useful operations making quantized arithmetic easy, in addition to allowing 
for serialization of data in a quantized format.
Operation coverage
Quantized Tensors support a limited subset of data manipulation methods of the regular full-precision tensor. (see list 
below)
For NN operators included in PyTorch, we restrict support to:
1.  8 bit weights (data_type = qint8)
2.  8 bit activations (data_type = quint8)
Note that operator implementations currently only support per channel quantization for weights of the conv and linear 
operators. Furthermore the minimum and the maximum of the input data is mapped linearly to the minimum and the 
maximum of the quantized data type such that zero is represented with no quantization error.
Additional data types and quantization schemes can be implemented through the custom operator mechanism.
Many operations for quantized tensors are available under the same API as full float version in torch or torch.nn. Quantized 
version of NN modules that perform re-quantization are available in torch.nn.quantized. Those operations explicitly take 
output quantization parameters (scale and zero_point) in the operation signature.
In addition, we also support fused versions corresponding to common fusion patterns that impact quantization at: 
torch.nn.intrinsic.quantized.
For quantization aware training, we support modules prepared for quantization aware training at torch.nn.qat and 
torch.nn.intrinsic.qat
Current quantized operation list is sufficient to cover typical CNN and RNN models:
Quantized torch.Tensor operations
Operations that are available from the torch namespace or as methods on Tensor for quantized tensors:
quantize_per_tensor() - Convert float tensor to quantized tensor with per-tensor scale and zero point
quantize_per_channel() - Convert float tensor to quantized tensor with per-channel scale and zero point
View-based operations like view(), as_strided(), expand(), flatten(), select(), python-style indexing, etc - work as on 
regular tensor (if quantization is not per-channel)
Comparators
ne() — Not equal
eq() — Equal
ge() — Greater or equal
le() — Less or equal
gt() — Greater
lt() — Less
copy_() — Copies src to self in-place
clone() — Returns a deep copy of the passed-in tensor
dequantize() — Convert quantized tensor to float tensor

torch.nn.intrinsic.quantized — quantized version of fused layers for inference (no BatchNorm variants as it’s 
usually folded into convolution for inference)
LinearReLU — Linear + ReLU
ConvReLU2d — 2D Convolution + ReLU
ConvReLU3d — 3D Convolution + ReLU
torch.nn.qat
Layers for the quantization-aware training
Linear — Linear (fully-connected) layer
Conv2d — 2D convolution
torch.quantization
Functions for quantization
add_observer_() — Adds observer for the leaf modules (if quantization configuration is provided)
add_quant_dequant()— Wraps the leaf child module using QuantWrapper
convert() — Converts float module with observers into its quantized counterpart. Must have quantization 
configuration
get_observer_dict() — Traverses the module children and collects all observers into a dict
prepare() — Prepares a copy of a model for quantization
prepare_qat() — Prepares a copy of a model for quantization aware training
propagate_qconfig_() — Propagates quantization configurations through the module hierarchy and assign 
them to each leaf module
quantize() — Converts a float module to quantized version
quantize_dynamic() — Converts a float module to dynamically quantized version
quantize_qat()— Converts a float module to quantized version used in quantization aware training
swap_module() — Swaps the module with its quantized counterpart (if quantizable and if it has an observer)
default_eval_fn() — Default evaluation function used by the torch.quantization.quantize()
fuse_modules()
FakeQuantize — Module for simulating the quantization/dequantization at training time
Default Observers. The rest of observers are available from torch.quantization.observer
default_observer — Same as MinMaxObserver.with_args(reduce_range=True)
default_weight_observer — Same as MinMaxObserver.with_args(dtype=torch.qint8, 
qscheme=torch.per_tensor_symmetric)
Observer — Abstract base class for observers
Quantization configurations
QConfig — Quantization configuration class
default_qconfig — Same as QConfig(activation=default_observer, weight=default_weight_observer) (See 
QConfig)
default_qat_qconfig — Same as QConfig(activation=default_fake_quant, 
weight=default_weight_fake_quant) (See QConfig)
default_dynamic_qconfig — Same as QConfigDynamic(weight=default_weight_observer) (See QConfigDynamic)
float16_dynamic_qconfig — Same as QConfigDynamic(weight=NoopObserver.with_args(dtype=torch.float16)) 
(See QConfigDynamic)

Stubs
DeQuantStub - placeholder module for dequantize() operation in float-valued models
QuantStub - placeholder module for quantize() operation in float-valued models
QuantWrapper — wraps the module to be quantized. Inserts the QuantStub and DeQuantStub
Observers for computing the quantization parameters
MinMaxObserver — Derives the quantization parameters from the running minimum and maximum of the observed 
tensor inputs (per tensor variant)
MovingAverageMinMaxObserver — Derives the quantization parameters from the running averages of the minimums and 
maximums of the observed tensor inputs (per tensor variant)
PerChannelMinMaxObserver— Derives the quantization parameters from the running minimum and maximum of the 
observed tensor inputs (per channel variant)
MovingAveragePerChannelMinMaxObserver — Derives the quantization parameters from the running averages of the 
minimums and maximums of the observed tensor inputs (per channel variant)
HistogramObserver — Derives the quantization parameters by creating a histogram of running minimums and 
maximums.
Observers that do not compute the quantization parameters:
RecordingObserver — Records all incoming tensors. Used for debugging only.
NoopObserver — Pass-through observer. Used for situation when there are no quantization parameters (i.e. 
quantization to float16)
torch.nn.quantized
Quantized version of standard NN layers.
Quantize — Quantization layer, used to automatically replace QuantStub
DeQuantize — Dequantization layer, used to replace DeQuantStub
FloatFunctional — Wrapper class to make stateless float operations stateful so that they can be replaced with 
quantized versions
QFunctional — Wrapper class for quantized versions of stateless operations like torch.add
Conv2d — 2D convolution
Conv3d — 3D convolution
Linear — Linear (fully-connected) layer
MaxPool2d — 2D max pooling
ReLU — Rectified linear unit
ReLU6 — Rectified linear unit with cut-off at quantized representation of 6
torch.nn.quantized.dynamic
Layers used in dynamically quantized models (i.e. quantized only on weights)
Linear — Linear (fully-connected) layer
LSTM — Long-Short Term Memory RNN module
torch.nn.quantized.functional
Functional versions of quantized NN layers (many of them accept explicit quantization output parameters)
adaptive_avg_pool2d() — 2D adaptive average pooling
avg_pool2d() — 2D average pooling