IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1
A Survey of Model Compression and Acceleration
for Deep Neural Networks
Yu Cheng, Duo Wang, Pan Zhou, Member, IEEE, and Tao Zhang, Senior Member, IEEE
Abstract—Deep convolutional neural networks (CNNs) have
recently achieved great success in many visual recognition tasks.
However, existing deep neural network models are computation-
ally expensive and memory intensive, hindering their deployment
in devices with low memory resources or in applications with
strict latency requirements. Therefore, a natural thought is to
perform model compression and acceleration in deep networks
without significantly decreasing the model performance. During
the past few years, tremendous progress has been made in
this area. In this paper, we survey the recent advanced tech-
niques for compacting and accelerating CNNs model developed.
These techniques are roughly categorized into four schemes:
parameter pruning and sharing, low-rank factorization, trans-
ferred/compact convolutional filters, and knowledge distillation.
Methods of parameter pruning and sharing will be described at
the beginning, after that the other techniques will be introduced.
For each scheme, we provide insightful analysis regarding the
performance, related applications, advantages, and drawbacks
etc. Then we will go through a few very recent additional
successful methods, for example, dynamic capacity networks and
stochastic depths networks. After that, we survey the evaluation
matrix, the main datasets used for evaluating the model per-
formance and recent benchmarking efforts. Finally, we conclude
this paper, discuss remaining challenges and possible directions
on this topic.
Index Terms—Deep Learning, Convolutional Neural Networks,
Model Compression and Acceleration,
I. INTRODUCTION
In recent years, deep neural networks have recently received
lots of attention, been applied to different applications and
achieved dramatic accuracy improvements in many tasks.
These works rely on deep networks with millions or even
billions of parameters, and the availability of GPUs with
very high computation capability plays a key role in their
success. For example, the work by Krizhevsky et al. [1]
achieved breakthrough results in the 2012 ImageNet Challenge
using a network containing 60 million parameters with five
convolutional layers and three fully-connected layers. Usually,
it takes two to three days to train the whole model on
ImagetNet dataset with a NVIDIA K40 machine. Another
example is the top face verification results on the Labeled
Faces in the Wild (LFW) dataset were obtained with networks
containing hundreds of millions of parameters, using a mix
of convolutional, locally-connected, and fully-connected layers
Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
Way, Redmond, WA 98052, USA.
Duo Wang and Tao Zhang are with the Department of Automation,
Tsinghua University, Beijing 100084, China.
Pan Zhou is with the School of Electronic Information and Communi-
cations, Huazhong University of Science and Technology, Wuhan 430074,
China.
[2], [3]. It is also very time-consuming to train such a model
to get reasonable performance. In architectures that rely only
on fully-connected layers, the number of parameters can grow
to billions [4].
As larger neural networks with more layers and nodes
are considered, reducing their storage and computational cost
becomes critical, especially for some real-time applications
such as online learning and incremental learning. In addi-
tion, recent years witnessed significant progress in virtual
reality, augmented reality, and smart wearable devices, cre-
ating unprecedented opportunities for researchers to tackle
fundamental challenges in deploying deep learning systems to
portable devices with limited resources (e.g. memory, CPU,
energy, bandwidth). Efficient deep learning methods can have
significant impacts on distributed systems, embedded devices,
and FPGA for Artificial Intelligence. For example, the ResNet-
50 [5] with 50 convolutional layers needs over 95MB memory
for storage and over 3.8 billion floating number multiplications
when processing an image. After discarding some redundant
weights, the network still works as usual but saves more than
75% of parameters and 50% computational time. For devices
like cell phones and FPGAs with only several megabyte
resources, how to compact the models used on them is also
important.
Achieving these goal calls for joint solutions from many
disciplines, including but not limited to machine learning, op-
timization, computer architecture, data compression, indexing,
and hardware design. In this paper, we review recent works
on compressing and accelerating deep neural networks, which
attracted a lot of attention from the deep learning community
and already achieved lots of progress in the past years.
We classify these approaches into four categories: pa-
rameter pruning and sharing, low-rank factorization, trans-
ferred/compact convolutional filters, and knowledge distil-
lation. The parameter pruning and sharing based methods
explore the redundancy in the model parameters and try to
remove the redundant and uncritical ones. Low-rank factor-
ization based techniques use matrix/tensor decomposition to
estimate the informative parameters of the deep CNNs. The
approaches based on transferred/compact convolutional filters
design special structural convolutional filters to reduce the
parameter space and save storage/computation. The knowledge
distillation methods learn a distilled model and train a more
compact neural network to reproduce the output of a larger
network.
In Table I, we briefly summarize these four types of
methods. Generally, the parameter pruning & sharing, low-
rank factorization and knowledge distillation approaches can
arXiv:1710.09282v8 [cs.LG] 8 Sep 2019