深度神经网络模型压缩与加速技术综述

需积分: 0 167 浏览量更新于2024-08-05 收藏 753KB PDF 举报

"这篇论文是关于深度神经网络模型压缩与加速的综述，由Yu Cheng、Duo Wang、Pan Zhou（IEEE会员）和Tao Zhang（IEEE资深会员）撰写。文章探讨了如何在不显著降低模型性能的情况下，通过参数剪枝与共享、低秩分解、转移/紧凑卷积滤波器以及知识蒸馏等方法来压缩和加速深度卷积神经网络（CNNs），以适应资源有限或延迟要求严格的设备和应用。" 在当前的计算机视觉领域，深度卷积神经网络（CNNs）已经取得了显著的进步，广泛应用于各种视觉识别任务。然而，这些模型通常计算复杂度高且内存需求大，这限制了它们在资源受限的设备（如移动设备）或对实时性有严格要求的应用中的部署。为了解决这一问题，研究者们提出了一系列模型压缩和加速的技术。首先，参数剪枝与共享是一种常见的优化策略，其目的是减少模型中的冗余参数。通过识别并移除对模型性能影响较小的连接，可以显著减小模型大小。此外，通过参数共享，例如在某些层中使用相同的权重，也可以进一步降低存储需求。其次，低秩分解技术利用矩阵低秩的特性，将大型权重矩阵分解为两个较小的矩阵相乘，这样不仅可以减少存储空间，还能减少计算量，因为乘法运算通常比加法更耗时。第三，转移/紧凑卷积滤波器，这类方法通常涉及使用预训练的大型网络生成小型滤波器，或者设计新的滤波器结构，以达到在保持性能的同时，减小模型的计算复杂度和参数数量。最后，知识蒸馏是一种有效的模型压缩方法，它涉及到将一个大型的“教师”网络的知识转移到一个小型的“学生”网络。教师网络的输出可以作为训练学生网络的目标，使学生网络在保持高效的同时，学习到教师网络的复杂模式和特征。这些技术的发展推动了深度学习在边缘计算、物联网和实时应用等领域的应用。然而，每个方法都有其适用场景和局限性，选择合适的方法需要根据具体任务的需求和资源限制进行权衡。随着研究的深入，未来可能会出现更多创新的模型压缩和加速技术，以适应不断发展的计算环境和应用需求。

IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1

A Survey of Model Compression and Acceleration

for Deep Neural Networks

Yu Cheng, Duo Wang, Pan Zhou, Member, IEEE, and Tao Zhang, Senior Member, IEEE

Abstract—Deep convolutional neural networks (CNNs) have

recently achieved great success in many visual recognition tasks.

However, existing deep neural network models are computation-

ally expensive and memory intensive, hindering their deployment

in devices with low memory resources or in applications with

strict latency requirements. Therefore, a natural thought is to

perform model compression and acceleration in deep networks

without signiﬁcantly decreasing the model performance. During

the past few years, tremendous progress has been made in

this area. In this paper, we survey the recent advanced tech-

niques for compacting and accelerating CNNs model developed.

These techniques are roughly categorized into four schemes:

parameter pruning and sharing, low-rank factorization, trans-

ferred/compact convolutional ﬁlters, and knowledge distillation.

Methods of parameter pruning and sharing will be described at

the beginning, after that the other techniques will be introduced.

For each scheme, we provide insightful analysis regarding the

performance, related applications, advantages, and drawbacks

etc. Then we will go through a few very recent additional

successful methods, for example, dynamic capacity networks and

stochastic depths networks. After that, we survey the evaluation

matrix, the main datasets used for evaluating the model per-

formance and recent benchmarking efforts. Finally, we conclude

this paper, discuss remaining challenges and possible directions

on this topic.

Index Terms—Deep Learning, Convolutional Neural Networks,

Model Compression and Acceleration,

I. INTRODUCTION

In recent years, deep neural networks have recently received

lots of attention, been applied to different applications and

achieved dramatic accuracy improvements in many tasks.

These works rely on deep networks with millions or even

billions of parameters, and the availability of GPUs with

very high computation capability plays a key role in their

success. For example, the work by Krizhevsky et al. [1]

achieved breakthrough results in the 2012 ImageNet Challenge

using a network containing 60 million parameters with ﬁve

convolutional layers and three fully-connected layers. Usually,

it takes two to three days to train the whole model on

ImagetNet dataset with a NVIDIA K40 machine. Another

example is the top face veriﬁcation results on the Labeled

Faces in the Wild (LFW) dataset were obtained with networks

containing hundreds of millions of parameters, using a mix

of convolutional, locally-connected, and fully-connected layers

Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft

Way, Redmond, WA 98052, USA.

Duo Wang and Tao Zhang are with the Department of Automation,

Tsinghua University, Beijing 100084, China.

Pan Zhou is with the School of Electronic Information and Communi-

cations, Huazhong University of Science and Technology, Wuhan 430074,

China.

[2], [3]. It is also very time-consuming to train such a model

to get reasonable performance. In architectures that rely only

on fully-connected layers, the number of parameters can grow

to billions [4].

As larger neural networks with more layers and nodes

are considered, reducing their storage and computational cost

becomes critical, especially for some real-time applications

such as online learning and incremental learning. In addi-

tion, recent years witnessed signiﬁcant progress in virtual

reality, augmented reality, and smart wearable devices, cre-

ating unprecedented opportunities for researchers to tackle

fundamental challenges in deploying deep learning systems to

portable devices with limited resources (e.g. memory, CPU,

energy, bandwidth). Efﬁcient deep learning methods can have

signiﬁcant impacts on distributed systems, embedded devices,

and FPGA for Artiﬁcial Intelligence. For example, the ResNet-

50 [5] with 50 convolutional layers needs over 95MB memory

for storage and over 3.8 billion ﬂoating number multiplications

when processing an image. After discarding some redundant

weights, the network still works as usual but saves more than

75% of parameters and 50% computational time. For devices

like cell phones and FPGAs with only several megabyte

resources, how to compact the models used on them is also

important.

Achieving these goal calls for joint solutions from many

disciplines, including but not limited to machine learning, op-

timization, computer architecture, data compression, indexing,

and hardware design. In this paper, we review recent works

on compressing and accelerating deep neural networks, which

attracted a lot of attention from the deep learning community

and already achieved lots of progress in the past years.

We classify these approaches into four categories: pa-

rameter pruning and sharing, low-rank factorization, trans-

ferred/compact convolutional ﬁlters, and knowledge distil-

lation. The parameter pruning and sharing based methods

explore the redundancy in the model parameters and try to

remove the redundant and uncritical ones. Low-rank factor-

ization based techniques use matrix/tensor decomposition to

estimate the informative parameters of the deep CNNs. The

approaches based on transferred/compact convolutional ﬁlters

design special structural convolutional ﬁlters to reduce the

parameter space and save storage/computation. The knowledge

distillation methods learn a distilled model and train a more

compact neural network to reproduce the output of a larger

network.

In Table I, we brieﬂy summarize these four types of

methods. Generally, the parameter pruning & sharing, low-

rank factorization and knowledge distillation approaches can

arXiv:1710.09282v8 [cs.LG] 8 Sep 2019

下载后可阅读完整内容，剩余9页未读，立即下载

丛乐

粉丝: 38
资源: 312

深度神经网络模型压缩与加速技术综述

"深度学习课本：Autoencoders原理与实践

"基于三维格构模型的长龄期混凝土力学性能预测研究

"数字图像处理第八章：冗余及保真度准则，压缩与格式" (23字)

YOLOv8 Model Quantization and Acceleration: Exploring Neural Network Inference Performance ...

Awesome-model-compression-and-acceleration

Deep-Compression-Compressing-Deep-Neural-Networks-with-Pruning-Trained-Quantization-and-Huffman:这是https的pytorch实现

A Survey on Data Compression in Wireless Sensor Networks 原文和翻译

深度网络压缩文献/代码列表-Awesome Deep Neural Network Compression.zip

用卷积滤波器matlab代码-model-compression-and-acceleration-progress:模型压缩与加速进行

用卷积滤波器matlab代码-model-compression-and-acceleration-progress:信息库以跟踪模型压缩和加

最新资源