AdderNet: Do We Really Need Multiplications in Deep Learning?
Hanting Chen
1,2∗
, Yunhe Wang
2∗
, Chunjing Xu
2†
, Boxin Shi
3,4
, Chao Xu
1
, Qi Tian
2
, Chang Xu
5
1
Key Lab of Machine Perception (MOE), Dept. of Machine Intelligence, Peking University.
2
Noah’s Ark Lab, Huawei Technologies.
3
NELVT, Dept. of CS, Peking University.
4
Peng Cheng Laboratory.
5
School of Computer Science, Faculty of Engineering, The University of Sydney.
{htchen, shiboxin}@pku.edu.cn, xuchao@cis.pku.edu.cn, c.xu@sydney.edu.au
{yunhe.wang, xuchunjing, tian.qi1}@huawei.com
Abstract
Compared with cheap addition operation, multiplication
operation is of much higher computation complexity. The
widely-used convolutions in deep neural networks are ex-
actly cross-correlation to measure the similarity between
input feature and convolution filters, which involves mas-
sive multiplications between float values. In this paper, we
present adder networks (AdderNets) to trade these massive
multiplications in deep neural networks, especially convo-
lutional neural networks (CNNs), for much cheaper addi-
tions to reduce computation costs. In AdderNets, we take
the ℓ
1
-norm distance between filters and input feature as
the output response. The influence of this new similarity
measure on the optimization of neural network have been
thoroughly analyzed. To achieve a better performance, we
develop a special back-propagation approach for Adder-
Nets by investigating the full-precision gradient. We then
propose an adaptive learning rate strategy to enhance the
training procedure of AdderNets according to the mag-
nitude of each neuron’s gradient. As a result, the pro-
posed AdderNets can achieve 74.9% Top-1 accuracy 91.7%
Top-5 accuracy using ResNet-50 on the ImageNet dataset
without any multiplication in convolutional layer. The
codes are publicly available at: https://github.com/huawei-
noah/AdderNet.
1. Introduction
Given the advent of Graphics Processing Units (GPUs),
deep convolutional neural networks (CNNs) with billions
of floating number multiplications could receive speed-ups
and make important strides in a large variety of computer
vision tasks, e.g. image classification [
26, 17], object de-
tection [
23], segmentation [19], and human face verifica-
∗
Equal contribution.
†
Corresponding author.
tion [
32]. However, the high-power consumption of these
high-end GPU cards (e.g. 250W+ for GeForce RTX 2080
Ti) has blocked modern deep learning systems from being
deployed on mobile devices, e.g. smart phone, camera, and
watch. Existing GPU cards are far from svelte and cannot
be easily mounted on mobile devices. Though the GPU it-
self only takes up a small part of the card, we need many
other hardware for supports, e.g. memory chips, power cir-
cuitry, voltage regulators and other controller chips. It is
therefore necessary to study efficient deep neural networks
that can run with affordable computation resources on mo-
bile devices.
Addition, subtraction, multiplication and division are the
four most basic operations in mathematics. It is widely
known that multiplication is slower than addition, but most
of the computations in deep neural networks are multiplica-
tions between float-valued weights and float-valued activa-
tions during the forward inference. There are thus many pa-
pers on how to trade multiplications for additions, to speed
up deep learning. The seminal work [
5] proposed Bina-
ryConnect to force the network weights to be binary (e.g.-1
or 1), so that many multiply-accumulate operations can be
replaced by simple accumulations. After that, Hubara et
al. [
15] proposed BNNs, which binarized not only weights
but also activations in convolutional neural networks at run-
time. Moreover, Rastegari et al. [
22] introduced scale fac-
tors to approximate convolutions using binary operations
and outperform [15, 22] by large margins. Zhou et al. [39]
utilized low bit-width gradient to accelerate the training
of binarized networks. Cai et al. [
4] proposed an half-
wave Gaussian quantizer for forward approximation, which
achieved much closer performance to full precision net-
works.
Though binarizing filters of deep neural networks sig-
nificantly reduces the computation cost, the original recog-
nition accuracy often cannot be preserved. In addition,
the training procedure of binary networks is not stable and
usually requests a slower convergence speed with a small
1468