decreasing the precision of the network on a per-layer
basis. Weights and activations are quantized to the
lowest bit-width possible without information loss ac-
cording to our heuristic and a certain degree of spar-
sity is induced while at the same time the bit-width
is kept high enough for further learning steps to suc-
ceed. This results in a network which has advantages
in terms of model size and time for training and infer-
ence. By training AlexNet and ResNet20 on the CI-
FAR10/100 datasets, we demonstrate on the basis of
an analytical model for the computational cost that
in comparison to a float32-baseline AdaPT is compet-
itive in terms of accuracy and produces a non-trivial
reduction of computational costs (speedup). Com-
pared to MuPPET, AdaPT also has certain intrinsic
methodological advantages. After AdaPT training,
the model is fully quantized and sparsified to a cer-
tain degree s.t., unlike the case with MuPPET, which
outputs a float32 model, AdaPT carries over it’s ad-
vantages to the inference phase as well.
2 Background
2.1 Quantization
Numerical representation describes how numbers are
stored in memory (illustrated by fig. 1) and how
arithmetic operations on those numbers are con-
ducted. Commonly available on consumer hardware
are floating-point and integer representations while
fixed-point or block-floating-point representations are
used in high-performance ASICs or FPGAs. The nu-
merical precision used by a given numerical represen-
tation refers to the amounts of bits allocated for the
representation of a single number, e.g. a real num-
ber stored in float32 refers to floating-point repre-
sentation in 32-bit precision. With these definitions
of numerical representation and precision in mind,
most generally speaking, quantization is the concept
of running a computation or parts of a computation
at reduced numerical precision or a different numeri-
cal representation with the intent of reducing compu-
tational costs and memory consumption. Quantized
execution of a computation however can lead to the
introduction of an error either through the quantized
representations the machine epsilon
mach
being too
large (underflow) to accurately depict resulting real
values or the representable range being too small to
store the result (overflow).
Floating-Point Quantization The value v of a
floating point number is given by v =
s
b
p−1
× b
e
where s is the significand (mantissa), p is the pre-
cision (number of digits in s), b is the base and
e is the exponent [45]. Hence quantization using
floating-point representation can be achieved by re-
ducing the number of bits available for mantissa and
exponent, e.g. switching from a float32 to a float16
representation, and is offered out of the box by com-
mon machine learning frameworks for post-training
quantization[46, 47].
Integer Quantization Integer representation is
available for post-training quantization and QAT
(int8, int16 due to availability on consumer hard-
ware) in common machine learning frameworks [46,
47]. Quantized training however is not supported
due to integer quantized activations being not
meaningfully differentiable, making standard back-
propagation inapplicable [33]. Special cases of integer
quantization are 1-bit and 2-bit quantization, which
are often referred to as binary and ternary quantiza-
tion in literature.
Block-Floating-Point Quantization Block-
floating-point represents each number as a pair
of W L (word length) bit signed integer x and a
scale factor s s.t. the value v is represented as
v = x × b
−s
with base b = 2 or b = 10. The scaling
factor s is shared across multiple variables (blocks),
hence the name block-floating point, and is typically
determined s.t. the modulus of the larges element is
∈ [
1
b
, 1] [48]. Block-floating-point arithmetic is used
in cases where variables cannot be expressed with
sufficient accuracy on native fixed-point hardware.
Fixed-Point Quantization Fixed-point numbers
have a fixed number of decimal digits assigned and
hence every computation must be framed s.t. the
4