A basic requirement of our quantization scheme is that it
permits efficient implementation of all ar ithmetic using only
integer arithmetic o perations on the quantized values (we
eschew implemen tations requiring lookup tables because
these tend to perform po orly compared to pure arithm etic
on SIMD hardware). This is equivalent to requiring that the
quantization scheme be an affine mapp ing of integer s q to
real numbers r, i.e. of the form
r = S(q − Z) (1)
for some constants S and Z. Equation (
1) is our quantiza-
tion scheme and the constants S and Z ar e o ur quantization
parameters. Our quantization sche me uses a single set of
quantization pa rameters f or all values within each activa-
tions array a nd within each weights array; separate arrays
use separate quantization parameters.
For 8-bit q uantization, q is quantized as an 8-bit integer
(for B-bit quantization, q is quantized as an B-bit integer).
Some arrays, typically b ia s vector s, are quantized as 32-b it
integers, see section
2.4.
The constant S (for “scale”) is a n arbitrary positive rea l
number. I t is typically represented in software as a floating-
point quantity, like the real values r. Section
2.2 describes
methods for avoiding the representation of such floating-
point quantities in the inference workload.
The constant Z (for “zero-point”) is of the same type
as quantized values q, and is in fact the quantized value q
correspo nding to the real value 0. This allows us to auto-
matically mee t the requirement that the re al value r = 0 be
exactly repre sentable by a qua ntized value. The motivation
for this req uirement is that efficient implementation of neu-
ral network operators often requir es zero -padding of arrays
around boundaries.
Our discussion so far is su mmarized in the following
quantized buffer data structure
3
, with one instance of such a
buffer existing for each activations arra y and weights array
in a neural network. We use C++ syntax because it allows
the una mbiguous conveyance of types.
template<typename QType> // e.g. QType=uint8
struct QuantizedBuffer {
vector<QType> q; // the quantized values
float S; // the scale
QType Z; // the zero-point
};
2.2. Integer-arithmetic-only matrix multiplication
We now turn to the question of how to perform inference
using only integer arithmetic, i.e. how to use Equa tion (
1)
to translate real-numbers computation into q uantized-values
3
The actual data structures in the TensorFlow Lite [
5] Converter are
QuantizationParams and Array in
this header file. As we discuss
in the next subsection, this data structure, which still contains a floating-
point quantity, does not appear in the actual quantized on-device inference
code.
computation, and how the latter can be designed to involve
only integer arithmetic even though the scale values S are
not integers.
Consider the multiplication of two sq uare N × N ma-
trices of real numbers, r
1
and r
2
, with their product repre-
sented by r
3
= r
1
r
2
. We denote the entries of each of these
matrices r
α
(α = 1, 2 or 3) as r
(i,j)
α
for 1 6 i, j 6 N,
and the quantization parameters with which they are qua n-
tized as (S
α
, Z
α
). We denote the quantized entries by q
(i,j)
α
.
Equation (
1) then becomes:
r
(i,j)
α
= S
α
(q
(i,j)
α
− Z
α
). (2)
From the definition of matrix multiplicatio n, we have
S
3
(q
(i,k)
3
− Z
3
) =
N
X
j=1
S
1
(q
(i,j)
1
− Z
1
)S
2
(q
(j,k)
2
− Z
2
), (3)
which can be rewritten as
q
(i,k)
3
= Z
3
+ M
N
X
j=1
(q
(i,j)
1
− Z
1
)(q
(j,k)
2
− Z
2
), (4)
where the multiplier M is defined as
M
:
=
S
1
S
2
S
3
. (5)
In Eq uation (
4), the only non-integer is the multiplier M.
As a c onstant depending only on the quantization scales
S
1
, S
2
, S
3
, it ca n be computed offline. We empirically fin d
it to always be in the interval (0, 1), and can therefore ex-
press it in the normalized form
M = 2
−n
M
0
(6)
where M
0
is in the interval [0.5, 1) and n is a non-negative
integer. The normalized multiplier M
0
now lends itself well
to being expressed as a fixed-poin t multiplier (e.g. int16 or
int32 dependin g on hardware capability). For example, if
int32 is used, the integer representing M
0
is the int32 value
nearest to 2
31
M
0
. Since M
0
> 0.5, this value is always at
least 2
30
and will therefor e always have at least 30 bits of
relative accuracy. Multiplication by M
0
can thus be imple-
mented as a fixed-point multiplication
4
. Meanwhile, multi-
plication by 2
−n
can b e implemented with an efficient bit-
shift, albeit one that needs to have correct ro und-to-nearest
behavior, an issue that we return to in Append ix
B.
2.3. Efficient handling of zero-points
In or der to efficiently implement the eva luation of Equa-
tion (
4) without h aving to perform 2N
3
subtractions an d
4
The computation discussed in this section is implemented in Tensor-
Flow Lite [
5] reference code for a fully-connected layer.