deemed less critical, based on their scores. This targeted reduction aims to maintain a robust
pruning ratio while preserving the model’s accuracy. The strategy of dimensional redistribution,
as proposed by Yang et al. [29], may be integrated into the pruning process, further refining the
model’s performance. Intriguingly, studies have shown that a model, post-pruning, can occasionally
surpass the original in performance, indicating the potential of pruning to not only simplify but also
to enhance the functionality of the model [30].
3
Methodology
3.1
Quantization
3.1.1
Basic Concept
The overarching objective of quantization is to reduce the precision of model parameters (θ) and
intermediate activation maps to a lower precision format, such as 8-bit integers, while minimizing the
impact on the model’s generalization performance. The initial step in this process involves defining
a quantization function capable of mapping weights and activations to a discrete set of values. A
commonly utilized function for this purpose is delineated as follows:
Q(r)
=
Int
(r/S)
−
Z,
(1)
where Q represents the quantization mapping function, r denotes a real-valued input (e.g., weights,
activation), S is a scaling factor, and Z is an integer zero point. This mechanism, known as uniform
quantization, ensures the equidistant spacing of resultant values. It’s noteworthy that alternative
non-uniform quantization strategies exist. Moreover, the original real value r can be approximated
from its quantized counterpart Q(r) through a process known as dequantization:
r˜
=
S(Q(r)
+
Z),
(2)
where the approximation r˜ may differ from r due to rounding errors inherent in quantization.
A critical aspect of quantization is determining the optimal scaling factor
S
, which effectively
partitions real values
r
into discrete segments:
β
−
α
S =
2
b
−
1
,
(3)
with [α, β] representing the clipping range and b denoting the bit width of quantization. The selection
of the clipping range [α, β], a process termed as calibration, is pivotal. A straightforward method
involves employing the minimum and maximum of the inputs as the clipping range, i.e.,
α
=
r
min
and β = r
max
, corresponding to an asymmetric quantization scheme where −α = β. Alternatively, a
symmetric quantization approach, where
−
α
=
β
=
max
(
|
r
max
|
,
|
r
min
|
)
, can be employed. In such
cases, the quantization function in Eq. 1 can be simplified by setting
Z =
0
.
3.1.2
Post Training Quantization
Post Training Quantization (PTQ) streamlines the quantization process by adjusting weights directly,
without necessitating further fine-tuning. This efficiency, however, may lead to notable accuracy
declines due to the inherent precision loss of quantization. Liu et al. [31] observed substantial accu-
racy reductions when applying quantization to LayerNorm and Softmax layers within Transformer
architectures. Lin et al. [32] attributed these discrepancies to the polarized distribution of activation
values in LayerNorm layers and attention map values. Specifically, significant inter-channel variabil-
ity within LayerNorm layer inputs (as illustrated on the left side of Figure 1) induces considerable
quantization errors when employing layer-wise quantization approaches. Moreover, a predominance
of small-value distributions in attention maps—with only sparse outliers approaching a value of
1—further exacerbates performance declines under uniform quantization strategies. Addressing these
challenges, Lin et al. [32] introduced a novel quantization approach employing Powers-of-Two Scale
for LayerNorm and Log-Int-Softmax for Softmax layers, aiming to mitigate the adverse effects of
traditional quantization methods.