深度学习压缩技术：FPGA上的高效语音识别引擎ESE

49 浏览量更新于2024-09-08 收藏 5.2MB PDF 举报

"ESE: Efﬁcient Speech Recognition Engine with Sparse LSTM on FPGA" 本文介绍的是ESE（Efficient Speech Recognition Engine），这是一个在FPGA（Field-Programmable Gate Array）上高效运行的语音识别引擎，利用稀疏LSTM（Long Short-Term Memory）网络设计。由斯坦福大学的韩松博士及其团队进行研究，该工作获得了深度学习模型压缩、剪枝、量化以及实时实现的最佳论文奖。研究团队来自斯坦福大学、深鉴科技、清华大学和NVIDIA公司。深度学习模型，尤其是LSTM，在语音识别领域有着广泛的应用。为了提高预测精度，研究人员不断增大模型规模，导致计算量和内存需求显著增加。这不仅提高了数据中心的功率消耗，也相应增加了总体拥有成本（TCO）。为了解决这个问题，文章提出了一个负载均衡感知的剪枝方法。这种方法能够在几乎不损失预测精度的前提下，将LSTM模型的大小压缩20倍（通过10倍的剪枝和2倍的量化）。负载均衡的考虑使得模型在硬件上的利用率得以提高，确保了高效运行。此外，研究团队还设计了一个调度器，它负责编码和分区，以适应FPGA的并行处理能力。这种调度策略优化了计算资源的分配，确保了模型在硬件上的快速执行和低功耗运行。模型压缩是深度学习领域的一个关键课题，旨在减小模型的存储需求和计算复杂度，同时保持或接近原始模型的性能。剪枝技术通过移除对模型性能影响较小的神经元和连接，达到减小模型大小的目的。量化则是将模型的浮点运算转换为整数运算，进一步降低计算资源的需求。 FPGA因其可编程性和并行处理能力，成为了加速深度学习模型执行的理想平台。通过针对FPGA的优化，ESE能够实现实时的语音识别，这对于移动设备和物联网应用尤其重要，因为这些场景往往对能耗和处理速度有严格的要求。 ESE通过创新的剪枝、量化和调度策略，实现了在FPGA上高效运行的稀疏LSTM模型，为语音识别提供了一种低功耗、高性能的解决方案。这一成果对于推动深度学习在边缘计算和嵌入式系统的应用具有重要意义。

ESE: Efﬁcient Speech Recognition Engine

with Sparse LSTM on FPGA

Song Han

1,2

, Junlong Kang

, Huizi Mao

1,2

, Yiming Hu

2,3

, Xin Li

, Yubin Li

, Dongliang Xie

Hong Luo

, Song Yao

, Yu Wang

2,3

, Huazhong Yang

and William J. Dally

1,4

Stanford University,

DeePhi Tech,

Tsinghua University,

NVIDIA

{songhan,dally}@stanford.edu,

song.yao@deephi.tech,

yu-wang@mail.tsinghua.edu.cn

ABSTRACT

Long Short-Term Memory (LSTM) is widely used in speech

recognition. In order to achieve higher prediction accuracy,

machine learning scientists have built increasingly larger mod-

els. Such large models are both computation and mem-

ory intensive. Deploying such bulky models results in high

power consumption and leads to a high total cost of owner-

ship (TCO) for a data center.

To speedup prediction and make it energy eﬃcient, we

ﬁrst propose a load-balance-aware pruning method that can

compress the LSTM model size by 20× (10× from pruning

and 2× from quantization) with negligible loss of prediction

accuracy. Also we proposed load-balance-aware pruning to

ensure high hardware utilization. Next, we propose a sched-

uler that encodes and partitions the compressed model to

multiple PEs for parallelism and schedules the complicated

LSTM data ﬂow. Finally, we design a hardware architecture

named ESE that works directly on the sparse LSTM model.

Implemented on Xilinx XCKU060 FPGA running at 200MHz,

ESE has a performance of 282 GOPS working directly on the

sparse LSTM network, corresponding to 2.52 TOPS on the

dense one, and processes a full LSTM for speech recogni-

tion with a power dissipation of 41 Watts. Evaluated on the

LSTM for speech recognition benchmark, ESE is 43× and

3× faster than Core i7 5930k CPU and Pascal Titan X GPU

implementations. It achieves 40× and 11.5× higher energy

eﬃciency compared with the CPU and GPU respectively.

Keywords

Deep Learning; Speech Recognition; Model Compression;

Hardware Acceleration; Software-Hardware Co-Design; FPGA

1. INTRODUCTION

Deep neural network is widely used for speech recogni-

tion [6, 13]. Long Short-Term Memory (LSTM) and Gated

Recurrent Unit (GRU) are two popular types of recurrent

neural networks (RNNs) used for speech recognition. In this

work, we evaluated the most complex one: LSTM [14]. A

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

FPGA ’17, February 22 - 24, 2017, Monterey, CA, USA

 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-4354-1/17/02. . . $15.00

DOI: http://dx.doi.org/10.1145/3020078.3021745

Training

Accelerated

Inference

Compression

Pruning

Quantization

Conventional

Proposed

Training Inference

This Work

Figure 1: Proposed eﬃcient DNN deployment ﬂow:

model compression+accelerated inference.

LSTM Model

Compression

20x smaller

similar accuracy

Scheduling

Compiling

relative-indexed

blocked CSC

FPGA

Acceleration

3x speedup

11.5x lower energy

Deep Model

Compression

35x-49x smaller

same accuracy

Blocking

Encoding

relative-indexed CSC

format with codebook

Customized

Accelerator

13x speedup, 3400x

lower energy than GPU

Algorithm Software Hardware

Acceleration

Load Balancing

Compression Hardware

Compression

Pruning /

Weight Sharing

Load Balance-Aware Pruning

Acceleration

Sparsity, Load

Balancing

Compression Hardware

Compression

Pruning /

Weight Sharing

Figure 2: ESE optimizes LSTM computation across

algorithm, software and hardware stack.

similar methodology could be easily applied to other types

of recurrent neural networks.

Despite its high prediction accuracy, LSTM is hard to de-

ploy because of its high computation complexity and mem-

ory footprint, leading to high power consumption. Memory

reference consumes more than two orders of magnitude more

energy than ALU operations, thus we focus on optimizing

the memory footprint.

To reduce the memory footprint, we design a novel method

to optimize across the algorithm, software and hardware

stack: we ﬁrst optimize the algorithm by compressing the

LSTM model to 5% of it’s original size (10% density and

2× narrower weights) while retaining similar accuracy; then

we develop a software mapping strategy to represent the

compressed model in a hardware-friendly way; ﬁnally we de-

sign specialized hardware to work directly on the compressed

LSTM model.

The proposed ﬂow for eﬃcient deep learning inference

is illustrated in Fig. 1. It shows a new paradigm for ef-

ﬁcient deep learning inference, from Training=>Inference,

to Training=>Compression=>Accelerated Inference, which

has advantage of faster inference speed and energy eﬃciency

arXiv:1612.00694v2 [cs.CL] 20 Feb 2017

下载后可阅读完整内容，剩余9页未读，立即下载

leezongling

粉丝: 0
资源: 7

深度学习压缩技术：FPGA上的高效语音识别引擎ESE

ese:ES6 日常使用

ESE: Enterprise SmartEiffel-开源

PL3_ESE:编程实验室 3 的 ESE 问题陈述的存储库

ese:将 Elasticsearch 索引从一个集群导出到另一个集群的工具

matlab归零码功率谱源码-ese524:ese524

ESE2005:嵌入式系统架构

ESE101:带有Arm的嵌入式系统基础知识

ESE_516_KOIN：ESE516的自行车锁项目：物联网边缘设备

ESE_SWEngineering:2014 嵌入式软件工程

drones:ESE440

最新资源