Enabling Embedded Inference Engine
with the ARM Compute Library: A Case
Study
Dawei Sun, Shaoshan Liu*, and Jean-Luc Gaudiot
If you need to enable deep learning on low-cost embedded SoCs, should you port an existing deep learning framework
or should you build one from scratch? In this paper, we seek to answer this question by sharing our practical
experience of building an embedded inference engine using the ARM Compute Library (ACL). The results show that,
contradictory to conventional wisdom, for simple models, it takes much less development time to build an inference
engine from scratch as opposed to porting existing frameworks. In addition, by utilizing ACL, we managed to build
an inference engine that outperforms TensorFlow by 25%. Our conclusion is that, with embedded devices, we most
likely will use very simple deep learning models for inference, and with well-developed building blocks such as ACL,
it may yield better performance and result in lower development time if the engine is built from scratch.
Enabling Inference on Embedded Devices
We were building an internet-of-things product with inference capabilities on our bare-metal ARM SoC, code-named
Zuluko (the Zuluko SoC contains four ARM v7 cores running at 1 GHz, as well as 512 MB of RAM). At its peak it
consumes about 3 W of power and costs only about four dollars. Everything was progressing smoothly until we had
to enable high-performance inference capabilities on it. An easy option was to migrate an existing deep learning
platform, so we chose to migrate TensorFlow [1] since it delivered the best performance on ARM-Linux platforms
based on our study.
We thought this would be an easy task, but it took us days to port all the dependencies of TensorFlow before we could
even run the TensorFlow platform. Eventually, after a week of intensive efforts, we managed to run TensorFlow on
Zuluko. This experience made us wonder whether it could be worthwhile to build a platform from scratch or better
to port an existing platform. This question had two implications: first, without basic building blocks such as
convolution operator, it would be very hard to build an inference engine from scratch. Second, an inference engine
built-from-scratch may not outperform a well-tested deep learning framework. Let us examine these problems in the
coming sections.
Building Inference Engine with the ARM Compute Library
Recently, ARM announced their Compute Library [2], a comprehensive collection of software functions implemented
for the ARM Cortex-A family of CPU processors and the ARM Mali family of GPUs. Specifically, it provides the
basic building blocks for Convolutional Neural Networks, including Activation, Convolution, Fully Connected,
Locally Connected, Normalization, Pooling, and Soft-Max. These are exactly what we needed to build an inference
engine. We went ahead and attempted to build a SqueezeNet [3] using these building blocks.
To construct SqueezeNet, we started by building the fire module proposed in [3]. As shown in Figure 1, SqueezeNet
utilizes a 1 X 1 convolution kernel to reduce the input size of the 3 X 3 convolution layer while maintaining similar
inference accuracy. Then SqueezeNet utilizes an expand strategy to guarantee the dimension of the network does not
change. This is the fire module and it is the core of SqueezeNet. We utilized the ACL core operators to implement
the fire module and our implementation eliminates the need for extra memory copy otherwise needed for concatenation
operation.