深度学习模型压缩与加速：现状与硬件实现

5星 · 超过95%的资源需积分: 50 34 浏览量更新于2024-07-18 收藏 3.43MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

深度学习模型压缩和加速是当前AI领域的核心研究方向，随着深度神经网络（Deep Neural Networks, DNNs）在众多应用场景中的广泛应用，如何提升其效率和性能成为了亟待解决的问题。本文档《深度学习模型压缩和加速》深入探讨了这一主题，作者Song Han在2017年提交给斯坦福大学电气工程系和研究生研究委员会，作为获得博士学位的一部分。该论文首先概述了深度学习模型的现状，特别关注于大规模深度神经网络模型的复杂性和计算需求。随着数据量的增长和模型结构的加深，模型的内存占用、计算速度以及能源消耗问题日益突出。为了解决这些问题，研究者提出了多种模型压缩方法，如参数剪枝（Pruning）、量化（Quantization）、低秩分解（Low-rank Approximation）和知识蒸馏（Knowledge Distillation），旨在减少模型的存储空间、提高推理速度，同时保持或接近原始模型的性能。模型压缩的关键在于寻找权衡，即在减小模型大小和复杂度的同时，尽可能地维持模型的预测能力。例如，参数剪枝通过移除不重要的权重，减少神经元数量；量化则将浮点数参数转换为整数，降低存储需求；低秩分解通过分解权重矩阵为更易于存储的形式；而知识蒸馏则是利用一个大模型的“教师”知识来训练一个小型“学生”模型，从而达到简化模型的目的。此外，论文还探讨了硬件层面的优化策略，包括专用硬件设计、硬件加速器和系统级优化。这些硬件支持能够为压缩后的模型提供高效的运行环境，比如通过硬件定制化来加速矩阵运算，或者利用特定架构来处理量化后的模型，进一步提升了计算效率。这篇论文的重要性在于它为深度学习模型的部署提供了理论依据和技术路径，对于实际应用中的模型压缩和硬件加速有着深远的影响。通过对现有方法的综述和未来研究方向的探讨，它不仅推动了学术界对深度学习效率提升的研究，也为工业界在资源受限的设备上实现高效深度学习提供了实用工具。对于想要深入了解深度学习模型优化和硬件协同工作的研究人员和工程师来说，这篇论文是一份宝贵的参考资料。

资源详情

资源推荐

CHAPTER 1. INTRODUCTION 2

given the computation pattern of deep learning and achieved higher eﬃciency compared with CPUs

and GPUs. The ﬁrst wave of accelerators eﬃciently implemented the computational primitives for

neural networks [16, 18, 24]. Researchers then realized that memory access is more expensive and

critically needs optimization, so the second wave of accelerators eﬃciently optimized memory transfer

and data movement [19

–

23]. These two generations of accelerators have made promising progress in

improving the speed and energy eﬃciency of running DNNs.

However, both generations of deep learning accelerators treated the algorithm as a black box and

focused on only optimizing the hardware architecture. In fact, there is plenty of room at the top by

optimizing the algorithm. We found that DNN models can be signiﬁcantly compressed and simpliﬁed

before touching the hardware; if we treat these DNN models merely as a black box and hand them

directly to hardware, there is massive redundancy in the workload. However, existing hardware

accelerators are optimized for uncompressed DNN models, resulting in huge wastes of computation

cycles and memory bandwidth compared with running on compressed DNN models. We therefore

need to co-design the algorithm and the hardware.

In this dissertation, we co-designed the algorithm and hardware for deep learning to make it run

faster and more energy-eﬃciently. We developed techniques to make the deep learning workload

more eﬃcient and compact to begin with and then designed the hardware architecture specialized for

the optimized DNN workload. Figure 1.1 illustrates the design methodology of this thesis. Breaking

the boundary between the algorithm and the hardware stack creates a much larger design space with

many degrees of freedom that researchers have not explored before, enabling better optimization of

deep learning.

On the algorithm side, we investigated how to simplify and compress DNN models to make them

less computation and memory intensive. We aggressively compressed the DNNs by up to 49

without

losing prediction accuracy on ImageNet [25,26]. We also found that the model compression algorithm

removes the redundancy, prevents overﬁtting, and serve as a suitable regularization method [27].

From the hardware perspective, a compressed model has great potential to improve speed and

energy eﬃciency because it requires less computation and memory. However, the model compression

algorithm makes the computation pattern irregular and hard to parallelize. Thus we designed

customized hardware for the compressed model, tailoring the data layout and control ﬂow to model

compression. This hardware accelerator achieved 3,400

better energy eﬃciency than GPU and an

order of magnitude better than previous accelerators [28]. The architecture has been prototyped on

FPGA and applied to accelerate speech recognition systems [29].

1.1 Motivation

"Less is more"

— Robert Browning, 1855

CHAPTER 1. INTRODUCTION 3

Domain-Specific

Hardware

Efficient!

Algorithm

Benchmark

Hardware

Algorithm

co-design

CPU/GPU…

?PU

design across the full stack

1

Figure 1.1: This thesis focused on algorithm and hardware co-design for deep learning. This thesis

answers the two questions: what

methods

can make deep learning algorithm more eﬃcient, and

what is the best hardware architecture for such algorithm.

The philosophy of this thesis is to make neural network inference

less

complicated and make it

more eﬃcient through algorithm and hardware co-design.

Motivation for Model Compression:

First, a smaller model means less overhead when

exporting models to clients. Take autonomous driving for example; Tesla periodically copies new

models from their servers to customers’ cars. Smaller models require less communication in such

over-the-air (OTA) updates, making frequent updates more feasible. Another example is the Apple

Store: mobile applications above 100 MB will not download until a user connects to Wi-Fi. As a

result, a new feature that increases the binary size by 100MB will receive much more scrutiny than

one that increases it by 10MB. Thus, putting a large DNN model in a mobile application is infeasible.

The second reason is inference speed. Many mobile scenarios require low-latency, real-time

inference, including self-driving cars and AR glasses, where latency is critical to guarantee safety

or user experience. A smaller model helps improve the inference speed on such devices: from the

computational perspective, smaller DNN models require fewer arithmetic operations and computation

cycles; from the memory perspective, smaller DNN models take less memory reference cycles. If the

model is small enough it can ﬁt in the on-chip SRAM, which is faster to access than oﬀ-chip DRAM

memory.

The third reason is energy consumption. Running large neural networks requires signiﬁcant

memory bandwidth to fetch the weights — this consumes considerable energy and is problematic for

battery-constrained mobile devices. As a result, iOS 10 requires iPhones to be plugged into chargers

while performing photo analysis. Memory access dominates energy consumption. Smaller neural

networks require less memory access to fetch the model, saving energy and extending battery life.

The fourth reason is cost. When deploying DNNs on Application-Speciﬁc Integrated Circuits

(ASICs), a suﬃciently small model can be stored on-chip directly. As smaller models require less

on-chip SRAM, this permits a smaller ASIC die thus making the chip less expensive.

Smaller deep learning models are also appealing when deployed in large-scale data centers as

CHAPTER 1. INTRODUCTION 4

cloud AI. Future data center workloads would be populated with AI applications, such as Google

Cloud Machine Learning and Amazon Rekognition. The cost of maintaining such large-scale data

centers is tremendous. Smaller DNN models reduce the computation of the workload and take less

energy to run. This helps to reduce the electricity bill and the total cost of ownership (TCO) of

running a data center with deep learning workloads.

A byproduct of model compression is that it can remove the redundancy during training and

prevents overﬁtting. The compression algorithm automatically selects the optimal set of parameters

as well as their precision. It additionally regularizes the network by avoiding capturing the noise in

the training data.

Motivation for Specialized Hardware:

Though model compression reduces the total number

operations deep learning algorithms require, the irregular pattern caused by compression hinders

the eﬃcient acceleration on general-purpose processors. The irregularity limited the beneﬁts of

model compression, and we achieved only 3

energy eﬃciency improvement on these machines. The

potential saving is much larger: 1

−

2 orders of magnitude comes from model compression, another

two orders of magnitude come from DRAM

⇒

SRAM. The compressed model is small enough to ﬁt

in about 10MB of SRAM (veriﬁed with AlexNet, VGG-16, Inception-V3, ResNet-50, as discussed in

Chapter 4) rather than having to be stored in a larger capacity DRAM.

Why is there such a big gap between the theoretical and the actual eﬃciency improvement? The

ﬁrst reason is the ineﬃcient data path. First, running on compressed models requires traversing a

sparse tensor, which has poor locality on general-purpose processors. Secondly, model compression

incurs a level of indirection for the weights, which requires dedicated buﬀers for fast access. Lastly,

the bit width of an aggressively compressed model is not byte aligned, which results in serialization

and de-serialization overhead on general-purpose processors.

The second reason for the gap is ineﬃcient control ﬂow. Out-of-order CPU processors have

complicated front ends attempting to speculate the parallelism in the workload; this has a costly

consequence (ﬂushing the pipeline) if any speculation is wrong. However, once narrowed down to deep

learning workloads, the computation pattern is known to the processor ahead of time. Neither branch

prediction nor caching are needed, and the execution is deterministic, not speculative. Therefore,

such speculative units are wasteful in out-of-order processors.

There are alternatives, but they are not perfect. SIMD units can amortize the instruction overhead

among multiple pieces of data. SIMT units can also hide the memory latency by having a pool of

threads. These architectures prefer the workload to be executed lockstep and in a parallel manner.

However, model compression leads to irregular computation patterns and makes it hard to parallelize,

causing divergence problem on these architectures.

While previously proposed DNN accelerators [19

–

21] can eﬃciently handle the dense, uncompressed

DNN model; they are unable to handle the aggressively compressed DNN model due to diﬀerent

computation patterns. There is an enormous waste of computation and memory bandwidth for

CHAPTER 1. INTRODUCTION 5

Model

Compression

Accelerated

Inference

Regularized

Training

Inference

Training

pruning

neurons

pruning

synapses

after pruningbefore pruning

pruning

neurons

pruning

synapses

after pruningbefore pruning

pruning

neurons

pruning

synapses

after pruningbefore pruning

Conventional

Proposed

Fast

Power-

Efficient

Slow

Power-

Hungry

Chapter 5

Han et al. ICLR’17

Chapter 3, 4

Han et al. NIPS’15

Han et al. ICLR’16

Chapter 6

Han et al. ISCA’16

Han et al. FPGA’17

1

Figure 1.2: Thesis contributions: regularized training, model compression, and accelerated inference.

previous accelerators running the uncompressed model. Previously proposed sparse linear algebra

accelerators [30

–

32] do not address weight sharing, extremely narrow bits, or the activation sparsity,

the other beneﬁts of model compression. These factors motivate us to build a specialized hardware

accelerator that can operate eﬃciently on a deeply compressed neural network.

1.2 Contribution and Thesis Outline

We optimize the eﬃciency of deep learning with a top-down approach from algorithm to hardware.

This thesis proposes the techniques for regularized training

⇒

model compression

⇒

accelerated

inference, illustrated in Figure 1.2. The contributions of this thesis are:

•

A model compression technique, called Deep Compression, that consists of pruning, trained

quantization and variable length coding, which can compress DNN models by 18

−

while

fully preserving the prediction accuracy.

•

A regularization technique, called Dense-Sparse-Dense (DSD) Training, that can regularize

neural network training and prevent overﬁtting to improve the accuracy for a wide range of

CNNs, RNNs, and LSTMs. The DSD model zoo is available online.

•

An eﬃcient hardware architecture, called "Eﬃcient Inference Engine" (EIE), that can perform

inference on the sparse, compressed DNNs and save a signiﬁcant amount of memory bandwidth.

EIE achieved 13× speed up and 3, 400× better energy eﬃciency than a GPU.

All these techniques center around exploiting the sparsity in neural networks, shown in Figure 1.3.

CHAPTER 1. INTRODUCTION 6

Higher Accuracy:

Smaller:

Faster, Energy Efficient:

EIE Acceleration

Sparsity

Deep Compression

DSD Regularization

[Chapter 3,4]

[Chapter 6]

[Chapter 5]

2

Figure 1.3: We exploit sparsity to improve the eﬃciency of neural networks from multiple aspects.

Chapter 2

provides the background for the deep neural networks, datasets, training system, and

hardware platform that we used in the thesis. We also survey the related works in model compression,

regularization, and hardware acceleration.

Chapter 3

describes the pruning technique which reduces the number of parameters of deep

neural networks, thus reducing the computation complexity and memory requirements. We also

introduce the iterative retraining methods to fully recover the prediction accuracy, together with

hardware eﬃciency consideration for pruning techniques. The content of this chapter is based

primarily on Han et al. [25].

Chapter 4

describes the trained quantization technique to reduce the bit width of the parameters

in deep neural networks. Combining pruning, trained quantization, and variable length coding, we

propose "Deep Compression" that can compress deep neural networks by an order of magnitude

without losing accuracy. The content of this chapter is based primarily on Han et al. [26].

Chapter 5

explains another beneﬁt of pruning, which is to regularize deep neural networks and

prevent overﬁtting. We propose dense-sparse-dense training (DSD) that periodically prunes and

restores the connections, which serves as a regularizor to improve the optimization performance. The

content of this chapter is based primarily on Han et al. [27].

Chapter 6

presents the "Eﬃcient Inference Engine" (EIE) to eﬃciently implement deep com-

pression. EIE is a hardware accelerator that performs decompression and inference simultaneously

and accelerates the resulting sparse matrix-vector multiplication with weight sharing. EIE takes

advantage of the compressed model, which signiﬁcantly saves memory bandwidth. EIE is also able to

deal with the irregular computation pattern eﬃciently. As a result, EIE achieved signiﬁcant speedup

and energy eﬃciency improvement over GPU. The content of this chapter is based primarily on Han

et al. [28] and brieﬂy on Han et al. [29].

In Chapter 7 we summarize the thesis and discuss the future work for eﬃcient deep learning.

剩余124页未读，继续阅读

Kaige_Zhao

粉丝: 13
资源: 4

深度学习模型压缩与加速：现状与硬件实现

编译好的Caffe2压缩包

模型压缩究竟在做什么？我们真的需要模型压缩么？.zip

caffemodel的剪枝压缩(部分权重置0)

深度学习模型压缩与加速综述_模型压缩_深度学习_压缩深度学习_

深度学习模型压缩与加速综述

深度学习模型压缩与加速综述.pdf

深度学习模型压缩与计算加速学习

深度学习模型压缩与加速技术探析

2022深度学习模型压缩与加速策略综述

深度学习模型压缩、加速及移动端部署探究

端上智能：深度学习模型压缩与加速技术探讨

"深度学习模型压缩、加速及移动端部署研究综述

深度学习模型压缩与加速：剪枝、量化与蒸馏详解

"深度学习模型压缩与加速技术解析及未来展望

深度学习模型压缩与加速：分组卷积与经典结构解析

深度学习模型压缩与加速技术

C 深度学习模型压缩与加速技术

深度学习模型压缩与加速：实现小型高效模型的技巧

最新资源