ESE: Efficient Speech Recognition Engine
with Sparse LSTM on FPGA
Song Han
1,2
, Junlong Kang
2
, Huizi Mao
1,2
, Yiming Hu
2,3
, Xin Li
2
, Yubin Li
2
, Dongliang Xie
2
Hong Luo
2
, Song Yao
2
, Yu Wang
2,3
, Huazhong Yang
3
and William J. Dally
1,4
1
Stanford University,
2
DeePhi Tech,
3
Tsinghua University,
4
NVIDIA
1
{songhan,dally}@stanford.edu,
2
song.yao@deephi.tech,
3
yu-wang@mail.tsinghua.edu.cn
ABSTRACT
Long Short-Term Memory (LSTM) is widely used in speech
recognition. In order to achieve higher prediction accuracy,
machine learning scientists have built increasingly larger mod-
els. Such large models are both computation and mem-
ory intensive. Deploying such bulky models results in high
power consumption and leads to a high total cost of owner-
ship (TCO) for a data center.
To speedup prediction and make it energy efficient, we
first propose a load-balance-aware pruning method that can
compress the LSTM model size by 20× (10× from pruning
and 2× from quantization) with negligible loss of prediction
accuracy. Also we proposed load-balance-aware pruning to
ensure high hardware utilization. Next, we propose a sched-
uler that encodes and partitions the compressed model to
multiple PEs for parallelism and schedules the complicated
LSTM data flow. Finally, we design a hardware architecture
named ESE that works directly on the sparse LSTM model.
Implemented on Xilinx XCKU060 FPGA running at 200MHz,
ESE has a performance of 282 GOPS working directly on the
sparse LSTM network, corresponding to 2.52 TOPS on the
dense one, and processes a full LSTM for speech recogni-
tion with a power dissipation of 41 Watts. Evaluated on the
LSTM for speech recognition benchmark, ESE is 43× and
3× faster than Core i7 5930k CPU and Pascal Titan X GPU
implementations. It achieves 40× and 11.5× higher energy
efficiency compared with the CPU and GPU respectively.
Keywords
Deep Learning; Speech Recognition; Model Compression;
Hardware Acceleration; Software-Hardware Co-Design; FPGA
1. INTRODUCTION
Deep neural network is widely used for speech recogni-
tion [6, 13]. Long Short-Term Memory (LSTM) and Gated
Recurrent Unit (GRU) are two popular types of recurrent
neural networks (RNNs) used for speech recognition. In this
work, we evaluated the most complex one: LSTM [14]. A
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
FPGA ’17, February 22 - 24, 2017, Monterey, CA, USA
c
2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4354-1/17/02. . . $15.00
DOI: http://dx.doi.org/10.1145/3020078.3021745
Accelerated
Inference
Compression
Pruning
Quantization
Conventional
Proposed
Training Inference
This Work
Figure 1: Proposed efficient DNN deployment flow:
model compression+accelerated inference.
LSTM Model
Compression
20x smaller
similar accuracy
Scheduling
Compiling
relative-indexed
blocked CSC
FPGA
Acceleration
3x speedup
11.5x lower energy
Deep Model
Compression
35x-49x smaller
same accuracy
Blocking
Encoding
relative-indexed CSC
format with codebook
Customized
Accelerator
13x speedup, 3400x
lower energy than GPU
Algorithm Software Hardware
Algorithm Software Hardware
Acceleration
Load Balancing
Compression Hardware
Compression
Pruning /
Weight Sharing
Load Balance-Aware Pruning
Acceleration
Sparsity, Load
Balancing
Compression Hardware
Compression
Pruning /
Weight Sharing
Figure 2: ESE optimizes LSTM computation across
algorithm, software and hardware stack.
similar methodology could be easily applied to other types
of recurrent neural networks.
Despite its high prediction accuracy, LSTM is hard to de-
ploy because of its high computation complexity and mem-
ory footprint, leading to high power consumption. Memory
reference consumes more than two orders of magnitude more
energy than ALU operations, thus we focus on optimizing
the memory footprint.
To reduce the memory footprint, we design a novel method
to optimize across the algorithm, software and hardware
stack: we first optimize the algorithm by compressing the
LSTM model to 5% of it’s original size (10% density and
2× narrower weights) while retaining similar accuracy; then
we develop a software mapping strategy to represent the
compressed model in a hardware-friendly way; finally we de-
sign specialized hardware to work directly on the compressed
LSTM model.
The proposed flow for efficient deep learning inference
is illustrated in Fig. 1. It shows a new paradigm for ef-
ficient deep learning inference, from Training=>Inference,
to Training=>Compression=>Accelerated Inference, which
has advantage of faster inference speed and energy efficiency
arXiv:1612.00694v2 [cs.CL] 20 Feb 2017