视觉Transformer模型压缩与加速策略综述：量化、低秩、蒸馏与剪枝

24 浏览量更新于2024-08-03 收藏 679KB PDF 举报

本文是一篇深入研究视觉Transformer（ViT）模型压缩和加速策略的综述文章，由Feiyang Chen、Ziqian Luo、Lisang Zhou、Xueting Pan和Ying Jiang合作撰写，分别来自Coupang、Oracle和Bazaarvoice Inc.以及Carnegie Mellon University。ViT作为一种在计算机视觉任务中表现出色的新型架构，其广泛应用受到了计算资源消耗高和内存需求大的限制。为了克服这些挑战，作者探讨了四种主要的模型压缩技术：量化、低秩近似、知识蒸馏和剪枝。 1. **量化**：这是一种将模型参数从浮点数转换为整数或固定点数值的方法，以减少存储空间和计算复杂度。量化可以分为动态和静态两种方式，动态量化会根据输入数据调整量化范围，静态量化则采用预定义的范围。通过优化量化精度与资源效率之间的平衡，可以实现对ViT模型的高效部署。 2. **低秩近似**：利用矩阵分解技术，将Transformer的权重矩阵分解为两个或多个较小的因子，以降低模型参数数量。这样可以减少计算量，并在保持一定性能的前提下减小模型尺寸。研究者评估了不同秩选择和分解方法对ViT模型性能的影响。 3. **知识蒸馏**：这是一种转移学习策略，通过训练一个较小的模型（学生模型）来模仿一个大而复杂的模型（教师模型）的行为。在视觉Transformer中，可以将一个大规模的ViT作为教师，指导学生的轻量级模型学习更高效的特征表示。通过这种技术，可以在保持一定性能的同时缩小模型规模。 4. **剪枝**：通过删除或缩减模型中的冗余连接和参数，减少计算负担。文章详细分析了不同级别的结构化和非结构化剪枝方法，以及它们如何影响ViT的性能和运行速度。作者系统地比较了这些技术的优缺点，并探索了它们在资源受限环境下的组合应用。通过全面的实验和分析，该研究为实际部署视觉Transformer提供了实用的指南，帮助开发者在保持模型性能的同时，有效地管理和优化资源消耗。这为视觉Transformer的广泛应用奠定了坚实的基础，促进了未来模型压缩和加速技术的进一步发展。

deemed less critical, based on their scores. This targeted reduction aims to maintain a robust

pruning ratio while preserving the model’s accuracy. The strategy of dimensional redistribution,

as proposed by Yang et al. [29], may be integrated into the pruning process, further refining the

model’s performance. Intriguingly, studies have shown that a model, post-pruning, can occasionally

surpass the original in performance, indicating the potential of pruning to not only simplify but also

to enhance the functionality of the model [30].

Methodology

3.1

Quantization

3.1.1

Basic Concept

The overarching objective of quantization is to reduce the precision of model parameters (θ) and

intermediate activation maps to a lower precision format, such as 8-bit integers, while minimizing the

impact on the model’s generalization performance. The initial step in this process involves defining

a quantization function capable of mapping weights and activations to a discrete set of values. A

commonly utilized function for this purpose is delineated as follows:

Q(r)

Int

(r/S)

−

(1)

where Q represents the quantization mapping function, r denotes a real-valued input (e.g., weights,

activation), S is a scaling factor, and Z is an integer zero point. This mechanism, known as uniform

quantization, ensures the equidistant spacing of resultant values. It’s noteworthy that alternative

non-uniform quantization strategies exist. Moreover, the original real value r can be approximated

from its quantized counterpart Q(r) through a process known as dequantization:

r˜

S(Q(r)

Z),

(2)

where the approximation r˜ may differ from r due to rounding errors inherent in quantization.

A critical aspect of quantization is determining the optimal scaling factor

, which effectively

partitions real values

into discrete segments:

−

S =

−

(3)

with [α, β] representing the clipping range and b denoting the bit width of quantization. The selection

of the clipping range [α, β], a process termed as calibration, is pivotal. A straightforward method

involves employing the minimum and maximum of the inputs as the clipping range, i.e.,

min

and β = r

max

, corresponding to an asymmetric quantization scheme where −α = β. Alternatively, a

symmetric quantization approach, where

−

max

(

max

min

)

, can be employed. In such

cases, the quantization function in Eq. 1 can be simplified by setting

Z =

3.1.2

Post Training Quantization

Post Training Quantization (PTQ) streamlines the quantization process by adjusting weights directly,

without necessitating further fine-tuning. This efficiency, however, may lead to notable accuracy

declines due to the inherent precision loss of quantization. Liu et al. [31] observed substantial accu-

racy reductions when applying quantization to LayerNorm and Softmax layers within Transformer

architectures. Lin et al. [32] attributed these discrepancies to the polarized distribution of activation

values in LayerNorm layers and attention map values. Specifically, significant inter-channel variabil-

ity within LayerNorm layer inputs (as illustrated on the left side of Figure 1) induces considerable

quantization errors when employing layer-wise quantization approaches. Moreover, a predominance

of small-value distributions in attention maps—with only sparse outliers approaching a value of

1—further exacerbates performance declines under uniform quantization strategies. Addressing these

challenges, Lin et al. [32] introduced a novel quantization approach employing Powers-of-Two Scale

for LayerNorm and Log-Int-Softmax for Softmax layers, aiming to mitigate the adverse effects of

traditional quantization methods.

剩余11页未读，继续阅读

AGI舰长

粉丝: 638
资源: 5

视觉Transformer模型压缩与加速策略综述：量化、低秩、蒸馏与剪枝

Transformer-XL论文

Vision Transformer系列参考论文

Point Transformer V3 论文复现

Transformer-XL 论文

Vision Transformer 论文阅读报告

基于Tensorflow2.x, 对Google Transformer系列论文的复现

transformer：CVPR 论文链接

java生成海报实例源码-code-transformer:论文“Language-agnosticrepresentationlearnin

MIDI-Transformer:论文“复合词变压器”的另一种实现

transformer论文笔记及思维导图

最新资源