Towards Effective Low-bitwidth Convolutional Neural Networks
Bohan Zhuang
1, 2
, Chunhua Shen
1, 2∗
, Mingkui Tan
3
, Lingqiao Liu
1
, Ian Reid
1, 2
1
The University of Adelaide, Australia,
2
Australian Centre for Robotic Vision
3
South China University of Technology, China
{bohan.zhuang,chunhua.shen,lingqiao.liu,ian.reid}@adelaide.edu.au, mingkuitan@scut.edu.cn
Abstract
This paper tackles the problem of training a deep con-
volutional neural network with both low-precision weights
and low-bitwidth activations. Optimizing a low-precision
network is very challenging since the training process can
easily get trapped in a poor local minima, which results in
substantial accuracy loss. To mitigate this problem, we pro-
pose three simple-yet-effective approaches to improve the
network training. First, we propose to use a two-stage
optimization strategy to progressively find good local min-
ima. Specifically, we propose to first optimize a net with
quantized weights and then quantized activations. This is
in contrast to the traditional methods which optimize them
simultaneously. Second, following a similar spirit of the
first method, we propose another progressive optimization
approach which progressively decreases the bit-width from
high-precision to low-precision during the course of train-
ing. Third, we adopt a novel learning scheme to jointly train
a full-precision model alongside the low-precision one. By
doing so, the full-precision model provides hints to guide
the low-precision model training. Extensive experiments
on various datasets (i.e., CIFAR-100 and ImageNet) show
the effectiveness of the proposed methods. To highlight, us-
ing our methods to train a 4-bit precision network leads
to no performance decrease in comparison with its full-
precision counterpart with standard network architectures
(i.e., AlexNet and ResNet-50).
1. Introduction
The state-of-the-art deep neural networks [
9,17,26] usu-
ally involve millions of parameters and need billions of
FLOPs during computation. Those memory and computa-
tional cost can be unaffordable for mobile hardware device
or especially implementing deep neural networks on chips.
To improve the computational and memory efficiency, var-
ious solutions have been proposed, including pruning net-
∗
C. Shen is the corresponding author.
work weights [
7, 8], low rank approximation of weights
[
16,34], and training a low-bit-precision network [4,36–38].
In this work, we follow the idea of training a low-precision
network and our focus is to improve the training process
of such a network. Note that in the literature, many works
adopt this idea but only attempt to quantize the weights of
a network while keeping the activations to 32-bit floating
point [
4, 19, 36, 38]. Although this treatment leads to lower
performance decrease comparing to its full-precision coun-
terpart, it still needs substantial amount of computational re-
source requirement to handle the full-precision activations.
Thus, our work targets the problem of training network with
both low-bit quantized weights and activations.
The solutions proposed in this paper contain three com-
ponents. They can be applied independently or jointly. The
first method is to adopt a two-stage training process. At the
first stage, only the weights of a network is quantized. Af-
ter obtaining a sufficiently good solution of the first stage,
the activation of the network is further required to be in
low-precision and the network will be trained again. Essen-
tially, this progressive approach first solves a related sub-
problem, i.e., training a network with only low-bit weights
and the solution of the sub-problem provides a good initial
point for training our target problem. Following the similar
idea, we propose our second method by performing pro-
gressive training on the bit-width aspect of the network.
Specifically, we incrementally train a serial of networks
with the quantization bit-width (precision) gradually de-
creased from full-precision to the target precision. The third
method is inspired by the recent progress of mutual learn-
ing [
35] and information distillation [1, 11, 22, 24, 32]. The
basic idea of those works is to train a target network along-
side another guidance network. For example, The works
in [
1,11,22,24,32] propose to train a small student network
to mimic the deeper or wider teacher network. They add an
additional regularizer by minimizing the difference between
student’s and teacher’s posterior probabilities [
11] or inter-
mediate feature representations [
1, 24]. It is observed that
by using the guidance of the teacher model, better perfor-
mance can be obtained with the student model than directly
7920