瓦格拉德变换方法及计算复杂性分析

193 浏览量更新于2023-10-25 收藏 15.36MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

Channel Balancing for Accurate Quantization of Winograd ConvolutionsVladimir ChikinHuawei Noah’s Ark Labvladimir.chikin@huawei.comVladimir KryzhanovskiyHuawei Noah’s Ark Labkryzhanovskiy.vladimir@huawei.comAbstractIt is well known that Winograd convolution algorithmsspeed up the widely used small-size convolutions. How-ever, the problem of quantization of Winograd convolutionsis challenging – while quantization of slower Winograd al-gorithms does not cause problems, quantization of fasterWinograd algorithms often leads to a significant drop in thequality of models. We introduce a novel class of Winogradalgorithms that balances the filter and input channels in theWinograd domain. Unlike traditional Winograd convolu-tions, the proposed convolution balances the ranges of in-put channels on the forward pass by scaling the input ten-sor using special balancing coefficients (the filter channelsare balanced offline). As a result of balancing, the inputsand filters of the Winograd convolution are much easier toquantize. Thus, the proposed technique allows us to obtainmodels with quantized Winograd convolutions, the qualityof which is significantly higher than the quality of modelswith traditional quantized Winograd convolutions. More-over, we propose a special direct algorithm for calculat-ing the balancing coefficients, which does not require ad-ditional model training. This algorithm makes it easy toobtain the post-training quantized balanced Winograd con-volutions – one should just feed a few data samples to themodel without training to calibrate special parameters. Inaddition, it is possible to initialize the balancing coefficientsusing this algorithm and further train them as trainablevariables during Winograd quantization-aware training forgreater quality improvement.1. IntroductionLightweight architectural designs of Convolutional Neu-ral Networks (CNNs) together with quantization have pavedthe way for the deployment of demanding computer visionapplications on mobile devices. Parallel to this, alternativeformulations of the convolution operation, such as FFT orWinograd, have been adapted for the use in CNNs allowingfurther speedups. Winograd convolutions [28] (also knownas Toom-Cook algorithm [7, 27]) are the fastest known al-gorithm for spatially small convolutions, but their use in aquantized context is often accompanied by a quality dropdue to significant numeric errors.The significant speedand power consumption advantage of quantized Winogradconvolutions motivates the research in this direction, seeTab. 1. As we can see, the quantized Winograd is 2.3×(50/21.5) faster than the float16 direct convolution. In prac-tice, speedups up to 4× (ARM CPUs) can be achieved [20].ImplementationFP16, msINT8, msDirect conv5030Winograd conv2521.5Table 1. Inference time for ResNet-18 on 224 × 224 × 3 input,BOLT inference framework [11], ARM CPU: Kirin 990.Winograd convolutionsLavin et al. [15] generalized theWinograd’s minimal filtering algorithm [28] for filters withk × k kernels that are widely used in modern CNNs. Thisgeneralization is called Winograd 2D-convolution. Its mainidea is based on the fact that the minimal filtering algorithmfor computation of m outputs with k-tap FIR filter requiresµ(F(m, k)) = m + k − 1(1)multiplications, instead of mk multiplications required forthe direct 1D-convolution. Further in the paper we will usethe short form F(m, k) that denotes F(m × m, k × k) – a2D-convolution with kernel size k × k producing m × moutputs. The parameter m is called the tile size. In thispaper, we consider the most popular case of Winograd con-volutions – with 3 × 3 kernel and stride 1. Traditionally, theWinograd convolution algorithm is written in the followingmatrix for125070Y = AT × (BT XB) ⊙ (GWG T) × A，(2)0其中为简单起见省略了张量索引。不同的瓦格拉德变换矩阵A、B、G用于不同的瓦格拉德块尺寸m。我们提出的方法可以推广到任何m（包括复杂的瓦格拉德[22]）。在本文中，我们使用了来自[15]的传统瓦格拉德变换矩阵用于F(4,3)和F(6, 3)瓦格拉德算法。这些矩阵的定义见附录A。125080(a)平衡量化瓦格拉德算法。相对复杂性在右下角标注。我们在预处理阶段添加了一种新的廉价操作——瓦格拉德域输入和滤波器的通道平衡（用红色标出）。权重转换是离线完成的，因此不计入复杂性。�∙�是由公式（3）定义的量化运算符。0(b) 算法阶段复杂性与滤波器和通道数（C =F）的比例关系。在大的C和F情况下，瓦格拉德域中的乘法是最重的阶段。因此，通过量化（整数运算）加速这个阶段非常重要。关于平衡开销的理论评估结果以及我们实现的量化瓦格拉德卷积的推理时间测量结果（带和不带平衡）可以在附录C中找到。平衡开销很小，并且随着C和F的增加而迅速缩小。左侧方案（a）中的相对复杂性是针对F = C =32写的。0图1. 3×3卷积核和步长1情况下的平衡量化瓦格拉德卷积方案和计算复杂性比例。0图1a显示了瓦格拉德算法的各个阶段。请注意，标准的瓦格拉德卷积不包括平衡和量化/反量化阶段，这些阶段也在图中显示出来。首先，一个大小为a×a、步长为m的滑动窗口在特征图上移动，从特征图张量中切出一个相应大小的子张量X（请参见图中的标注），其中a = m + k -1。然后，将第c个块Xc ∈Ra×a映射到瓦格拉德域，使用矩阵B：Xc → Vc，c = 1, 2,..., C。权重也进行相同的过程（W →U），但是在推理之前离线完成，因此不计入计算复杂性。在这些步骤之后，映射后的输入V与映射后的权重U进行卷积。最后，使用矩阵A对上一步得到的张量M进行反变换。这个过程对m×m个输出像素并行进行，比直接卷积更高效。0瓦格拉德计算瓶颈瓦格拉德变换矩阵A和B具有简单的结构，因此可以在推理代码中进行硬编码，以进一步提高性能。为了估计算法阶段的相对复杂性，我们计算了加法和乘法操作的数量，假设我们使用直接矩阵乘法（而不是硬编码）。尽管这个估计可能与实际情况相差很远，因为它严重依赖于推理引擎的类型（CPU、GPU或DNN加速器），但让我们将其视为近似估计（见图1a右下角的百分比）。正如您所见，0在瓦格拉德域中的乘法（得到矩阵M）是计算量最大的阶段：它占据了所有计算的52.8%（针对C = F =32计算）。此外，随着滤波器数F和通道数C的增加，这个比例趋向于100%（见图1b）。因此，通过量化来加速这个阶段是非常有必要的。0关键问题通常情况下，瓦格拉德卷积F(m,3)的量化质量下降越小，瓦格拉德卷积的加速度也越低。使用小瓦格拉德卷积块尺寸(m=2)通常可以在不降低质量的情况下进行量化，而大瓦格拉德卷积块尺寸(m≥4)很难进行量化（参见表3或表4）。有很多研究致力于解决这个问题，然而，质量下降往往仍然无法接受。我们发现后训练量化（PTQ）中的质量下降无法通过量化感知训练（QAT）来补偿——这是瓦格拉德域数据范围的显著不平衡（参见第4节）。我们提出了一种解决这个问题的技术：通过使用特殊的平衡系数对输入张量进行缩放，从而在前向传递中使瓦格拉德域输入通道的范围相等（滤波器通道在离线状态下进行平衡）。如图2a和附录C所示，平衡是一种非常廉价的操作，但它显著降低了精度下降，例如，对于ResNet-18（ImageNet，F(6, 3)，8位，表4），降低了1.8倍。Table 2. The summary of the main well-known works and frameworks that propose methods for quantization of Winograd convolutions.WorkQuantTypeOffsetBNfusingScaletypesWeights &ActivationsWinograd[17], IEEE-2020PTQDynamicYesNoscalarbothF(2, 3)[11], ICLR-2019PTQDynamicNoYesscalarbothF(2, 3)[22], ARM, 2019PTQDynamicYesNotileweighs onlyF(2, 3), F(4, 3),complex Winorgad[10], MLSyS-2020QATStaticNoNoscalarbothF(2, 3), F(4, 3), F(6, 3),trainable transforms[6], BOLTinference frameworkPTQDynamic,StaticNoYestile,tile for act.bothF(4,3)Our contributionContributions of this work can be sum-marized as follows:1. We introduce a novel method for balancing of thechannel ranges of the inputs and filters of the quantizedWinograd convolutions. It is characterized by:(a) Lightweight: a small computational overhead;(b) Compatibility:compatible with any existingtechniques for quantization of Winograd convo-lutions and works well for both PTQ and QAT(Ω is a trainable variable in that case);(c) Universality: does not depend on the type ofWinograd algorithm.2. Our experiments on super-resolution (SR) and imageclassification tasks conducted for various bitwidths(b = 4, 6, 8), tile sizes m (m = 4, 6), quantiza-tion types (dynamic/static) and scale types (scalar/tile)show that the proposed balancing technique signifi-cantly improves the quality of Winograd quantization.Our technique is not a panacea, but it extends the applica-tion area of the quantized Winograd convolutions substan-tially. We believe that it will have its rightful place in thestandard quantization pipeline as an additional technique tofurther improve the quality of Winograd quantization.2. Related worksQuantization is a powerful tool for compressing and ac-celerating neural networks by using low-precision numbers.Quantization without training, also called Post-TrainingQuantization (PTQ), is a difficult and demanding task, sincesuch quantization does not require complex calculations,has a high execution speed and can be efficiently performedon mobile devices. Various PTQ methods are actively pro-posed by the research community [1, 18, 23–25]. Cross-layer equalization and quantization bias correction tech-niques from paper [24] are effective methods that do notrequire any data and can help preserve the quality of thequantized model. The cross-layer equalization procedurefrom [24] is based on channel balancing of weights of sub-sequent convolutions.Also, the idea of factorization oflayer channels to improve the quality of subsequent quan-tization is investigated in [21]. In these works, the chan-nel balancing technique is applied only to weight tensorsof traditional direct convolutions. AdaRound [23] is one ofthe popular and effective methods for PTQ, which allowsto adaptively adjust the round function for different weightcomponents of the model layers. While being the quickestapproach, PTQ usually leads to the decrease in quality of aquantized network.Quantization-Aware Training (QAT) is a popular way toobtain a quantized model with a high quality [4, 5, 8, 9, 13,29]. In many studies on QAT, the authors reach quantizedquality close to the original. QAT employs stochastic gra-dient descent with quantized weights and activations duringtraining. This class of quantization methods has a few draw-backs – these methods require a full training dataset (usu-ally it is private and it’s distribution can be restricted) andcomputationally expensive training (a fine-tuning time of aquantized model can reach a full-precision training time),which sometimes is not available. A popular solution to theproblem of non-differentiability of rounding functions dur-ing training of quantized models is to estimate the gradientsof such functions using straight-through estimators [3].The Winograd algorithm for acceleration of CNNs wasfirst applied by Lavin et al. in [15]. Since then, a lot of re-search has been focused on improving Winograd convolu-tions, in particular on the study of quantization of Winogradconvolutions. The most popular works on quantization ofWinograd convolutions are summarized in Tab. 2. In [19],Winograd convolutions are extended to Residue NumberSystem (RNS), which enables the use of bigger Wino-grad transformation tiles. Work [16] proposes a method ofpost-training quantization of Winograd convolutions, whichuses a special optimization of quantization parameters toincrease the final quality. Also there are some inferenceframeworks that support quantized Winograd convolutions,125091. Scalar: sv, su ∈ R are scalars, and are used for quan-tization of tensors V and U correspondingly (tensorsizes are given in Fig. 1a). This case is the most diffi-cult for quantization (see Tab. 3), but it was consideredin many previous studies (see Tab. 2).2. Tile: sv, su ∈ Ra×a are square matrices. Each pixelVij ∈ RC in tensor V ∈ RC×a×a has its own quanti-zation scale (sv)ij. This means that we quantize pixelsindependently. Such an approach is much more pre-cise, since the dynamic range of pixels varies signifi-cantly. For simplicity, same pixels in filters share the125100（a）通道平衡的思想。在平衡之前：V和U张量中通道范围的异质性导致量化误差的增加。在平衡之后：将V中的一个元素乘以Ωi，将U中对应的元素除以相同的值Ωi，不会改变M的范数，但可以使通道范围相等，从而显著提高量化的质量。0（b）在CIFAR-10数据集上使用Winograd卷积的ResNet-20模型。在平衡之前和平衡之后，所有Winograd卷积的输入和滤波器的通道范围标准差的期望比值。在绝大多数情况下，通道平衡显著降低了输入V和权重U的通道范围标准差。对于F(6,3)算法，权重U的动态范围方差有少量增加的情况（参见层0、1、7和15的比值），但整体上，均衡的正面影响仍然很强。0图2. 量化Winograd卷积的通道平衡过程。0例如，BOLT [ 6 ]，LANCE [ 17 ]或IntelCaffe [ 11]。通常，支持量化Winograd卷积的推理框架会自动执行训练后量化。更复杂和高效的Winograd算法对量化更加敏感，因此许多框架[ 11 , 17]不使用最高效的算法，因为量化会导致显著的准确性下降。此外，许多框架实现了动态量化，对Winograd域输入的量化尺度进行在线计算，因为它提供更好的质量，尽管它比静态量化效率低，对于静态量化，我们需要预先评估输入的量化尺度。0许多用于量化Wino-grad卷积的方法都基于PTQ，因为它很方便。然而，有几个工作[ 2 , 10]涉及到Winograd卷积的QAT。在[ 10]中，除了QAT之外，为了准备一个具有量化Winograd卷积的模型，还使用了神经架构搜索（NAS）来选择不同的瓦片大小m用于不同的Winograd卷积。此外，还可以使用不同的附加技术来改善量化Winograd卷积的质量，例如使用瓦片尺度或仿射量化方案[ 14]。然而，通常情况下，质量下降仍然是不可接受的，特别是在更高效的Winograd算法（m ≥4）或在具有标量尺度的量化的情况下，这是一些现有框架中的主要量化格式[ 11 , 17]。本文提出的平衡技术可以显著提高量化Winograd卷积的质量，从而使用更高效的Winograd算法和量化格式而不会降低模型质量。03. 准备工作0为了简洁起见，我们将读者引用到白皮书[ 25]中了解SOTA量化算法和术语。在我们的论文中，我们使用对称均匀量化[ 14 ]：0˜ Z = � Z ∙ s � = Clip � Round � Z ∙ s � , -B, +B � , (3)0其中波浪线表示˜ Z是一个整数张量，取值范围为[ -B, B]（与图1a中使用的符号相同），B = 2 b - 1 -1，b是量化位宽，s = B/ max | Z|是量化尺度（我们将在下面讨论它）。有更精确的量化类型，可以改进论文中的结果，例如带偏移的仿射量化[ 14]。我们选择方程（3）是为了简单起见。现在让我们专注于Winogard领域中对应于输入V和权重U的量化尺度sv和su。如图1a所示，它们在两个量化（ˆ V → ˜ V和ˆ U → ˜U）和一个去量化（˜ M →M）阶段中使用。在我们的工作中，我们考虑了两种类型的量化尺度：same quantization scale: (su)ij is used for quantiza-tion of tensor Uij ∈ RF ×C.From the quality perspective, the scalar scale sv ∈ R ismuch worse than the tile scale sv ∈ Ra×a (see Tab. 3), butfrom the number of operations they are equal (see step 3 inFig. 1a): both sv ⊙ ˆVc and sv · ˆVc require a2 operations,ˆVc ∈ Ra×a. Furthermore, this paper covers two quantiza-tion types related to quantization scale sv:1. Dynamic: sv is recalculated for every input V in theinference stage dynamically:(sv)nij =Bmax |Vnij|,(4)where n denotes the n-th processing iteration, Vnij ∈RC. The approximate relative cost of that makes upa quarter of the relative cost of the quantization stagein Fig. 1b. In practice, the dynamic approach is rarelyused, because it is hard to implement efficiently, how-ever, it is much more precise (see Tab. 3 and Tab. 4).2. Static: sv is estimated using a set of inputs V, ob-tained using the training or the calibration dataset.There are different methods how to estimate sv (see[25]). For simplicity, we use the simplest approach:(sv)ij = 1NN�n(sv)nij,(5)where n denotes the n-th element of the calibration set,and N denotes the number of elements in this set.Now, we can write the formula of the Quantized Wino-grad convolution (QW):Yf = AT�Cc ⌊BT XcB · sv ⌉ ⊙ ⌊GWfcGT · su ⌉susvA.(6)For simplicity, Eq. (6) is written for the scalar scales. Acomputation in the numerator ( ˜M in Fig. 1a) is done usinginteger arithmetic. Using INT8 data speeds up the compu-tations from 1.4× to 2.2× (strongly depends on the imple-mentation) by increasing the number of elements processedat the same time, improving the efficiency of the cache uti-lization, and reducing the number of read/write operationsto RAM.4. Balancing Winograd convolution channelsThe following rule holds for any method of calculatingconvolutions, in particular, for Winograd convolutions: themore balanced the channel ranges of inputs and weights ofthe convolution, the better its quantization. We can dividethe channels of the Winograd-domain input by some balanc-ing coefficients Ω and multiply the corresponding channelsof the Winograd-domain filter by the same balancing coef-ficients Ω (the rectangles denote the data ranges in Fig. 2a).As a result, the transformed full-precision Winograd convo-lution (hereinafter the “FP Winograd”) is equivalent to theoriginal Winograd convolution. Moreover, if the balancingcoefficients are chosen correctly, the new Winograd convo-lution is easier to quantize, since the channel ranges of theWinograd-domain inputs and filters become more balanced(Fig. 2a: small rectangles are stretched and large ones aresqueezed, and as a result, all channels have their rangesequalized). By analogy with formulas from Eq. (6), theBalanced Quantized Winograd (BQW) convolution can bewritten as:125110Y f = A T � C �0� B T X c 0Ω0� ⊙ � GW fc G T ⊙ 0� �0A,0（7）其中，为了提高可读性，省略了量化尺度s u和sv的存在（但暗示其存在）。在这里，很明显，FPWinogard的平衡不会改变输出Yf，因为变量Ω被消除了。请注意，Ω ∈ R C × a ×a的除法和乘法是逐元素操作（Hadamard乘积）。图1a中的方程式显示了来自等式（7）的计算阶段的公式。让我们首先关注张量的大小：张量V ∈ R C × a ×a和张量Ω具有相同的大小，但张量U ∈ R F × C × a ×a是一个四维张量。因此，为了平衡权重U，需要将每个Uf逐元素乘以Ω F次。然而，这不是问题，因为这个映射U →ˆU和量化ˆU →˜U是离线执行的。图1b显示了平衡的相对复杂性最初很小（1.2%，图1a），并且随着C和F的增加而快速减小。此外，在静态量化的情况下，Ω张量的尺度sv可以融入其中，平衡开销将为零（见第6节）。正确选择平衡参数Ω可以显著减小Winograd域滤波器和输入的通道范围的方差，从而在Winograd卷积的量化过程中保持高模型质量。对于后训练量化的Winograd卷积，我们提出了一种特殊的直接算法来计算平衡张量，无需额外的模型训练，详见第5节。该算法使

下载后可阅读完整内容，剩余1页未读，立即下载