ARM架构下嵌入式推理引擎实现剖析

需积分: 10 97 浏览量更新于2024-09-07 收藏 189KB PDF 举报

"这篇文档是关于在ARM架构上实现嵌入式推理引擎的案例研究，主要探讨了在低功耗嵌入式系统上启用深度学习的策略。通过使用ARM Compute Library (ACL)，作者展示了从头构建一个嵌入式推理引擎相较于移植现有的深度学习框架可能更为高效，并能取得更好的性能表现。" 在当前的物联网(IoT)环境中，嵌入式设备越来越多地被用于执行复杂的任务，如图像识别和自然语言处理，这通常需要深度学习模型的支持。然而，对于资源有限的嵌入式系统，如基于ARM的SoC（System on Chip），如何有效地实现这些功能是一个挑战。传统的观点认为，移植现有的深度学习框架，如TensorFlow或PyTorch，可以节省时间和资源。然而，"Enabling Embedded Inference Engine with the ARM Compute Library: A Case Study" 这篇文章中，作者Dawei Sun、Shaoshan Liu和Jean-Luc Gaudiot通过实践发现，对于简单的深度学习模型，从零开始构建一个定制的推理引擎可能更快，并且能更好地利用硬件资源。 ARM Compute Library (ACL) 是一个专门为ARM处理器设计的开源库，它提供了优化的计算函数，特别适用于图像处理和计算机视觉应用。在文中，作者利用ACL构建的推理引擎比TensorFlow快了25%，这表明对于嵌入式设备，利用特定平台的优化库可能比通用框架更有优势。在嵌入式设备上启用推理的关键因素包括： 1. **硬件优化**：ACL针对ARM架构进行了优化，可以更高效地利用CPU和GPU资源，从而提高推理速度。 2. **模型简化**：由于资源限制，嵌入式设备往往使用较简单的模型进行推理，这使得从头构建引擎更具优势，因为可以更好地针对模型结构进行定制。 3. **内存管理**：嵌入式系统内存有限，从零开始构建引擎可以更好地控制内存使用，减少不必要的数据交换和内存开销。 4. **开发时间**：尽管构建自定义引擎需要初始的开发工作，但实际案例显示，对于简单模型，这种方法可能比移植现有框架更快。文章强调，随着嵌入式设备上的深度学习应用越来越普遍，开发人员应考虑使用像ACL这样的专门工具，它们不仅可以提高性能，还能缩短开发周期。对于那些需要在成本敏感和资源受限的平台上运行深度学习的应用，这是一个重要的启示。

Enabling Embedded Inference Engine

with the ARM Compute Library: A Case

Study

Dawei Sun, Shaoshan Liu*, and Jean-Luc Gaudiot

If you need to enable deep learning on low-cost embedded SoCs, should you port an existing deep learning framework

or should you build one from scratch? In this paper, we seek to answer this question by sharing our practical

experience of building an embedded inference engine using the ARM Compute Library (ACL). The results show that,

contradictory to conventional wisdom, for simple models, it takes much less development time to build an inference

engine from scratch as opposed to porting existing frameworks. In addition, by utilizing ACL, we managed to build

an inference engine that outperforms TensorFlow by 25%. Our conclusion is that, with embedded devices, we most

likely will use very simple deep learning models for inference, and with well-developed building blocks such as ACL,

it may yield better performance and result in lower development time if the engine is built from scratch.

Enabling Inference on Embedded Devices

We were building an internet-of-things product with inference capabilities on our bare-metal ARM SoC, code-named

Zuluko (the Zuluko SoC contains four ARM v7 cores running at 1 GHz, as well as 512 MB of RAM). At its peak it

consumes about 3 W of power and costs only about four dollars. Everything was progressing smoothly until we had

to enable high-performance inference capabilities on it. An easy option was to migrate an existing deep learning

platform, so we chose to migrate TensorFlow [1] since it delivered the best performance on ARM-Linux platforms

based on our study.

We thought this would be an easy task, but it took us days to port all the dependencies of TensorFlow before we could

even run the TensorFlow platform. Eventually, after a week of intensive efforts, we managed to run TensorFlow on

Zuluko. This experience made us wonder whether it could be worthwhile to build a platform from scratch or better

to port an existing platform. This question had two implications: first, without basic building blocks such as

convolution operator, it would be very hard to build an inference engine from scratch. Second, an inference engine

built-from-scratch may not outperform a well-tested deep learning framework. Let us examine these problems in the

coming sections.

Building Inference Engine with the ARM Compute Library

Recently, ARM announced their Compute Library [2], a comprehensive collection of software functions implemented

for the ARM Cortex-A family of CPU processors and the ARM Mali family of GPUs. Specifically, it provides the

basic building blocks for Convolutional Neural Networks, including Activation, Convolution, Fully Connected,

Locally Connected, Normalization, Pooling, and Soft-Max. These are exactly what we needed to build an inference

engine. We went ahead and attempted to build a SqueezeNet [3] using these building blocks.

To construct SqueezeNet, we started by building the fire module proposed in [3]. As shown in Figure 1, SqueezeNet

utilizes a 1 X 1 convolution kernel to reduce the input size of the 3 X 3 convolution layer while maintaining similar

inference accuracy. Then SqueezeNet utilizes an expand strategy to guarantee the dimension of the network does not

change. This is the fire module and it is the core of SqueezeNet. We utilized the ACL core operators to implement

the fire module and our implementation eliminates the need for extra memory copy otherwise needed for concatenation

operation.

下载后可阅读完整内容，剩余3页未读，立即下载

aiXpert

粉丝: 224
资源: 11

ARM架构下嵌入式推理引擎实现剖析

SOM神经网络在嵌入式ARM上的移植

Python-uTensor一个基于mbed和Tensorflow的极端轻量级深度学习推理框架

Enabling Large-Scale Storage in Sensor Networks with the Coffee File System

Austin Gibbons：Enabling Data Science Teams with Laburnum

Enabling the Cerner Instant Access Solution for the Physicia.pdf

Enabling Confidential Computing in Cloud with Intel SGX and Libr

Session Border Controllers - Enabling the VoIP Revolution

MIPI Alliance—Enabling the IoT Opportunity.pdf

"ARM NEOVERSE: 架构探索，云到边缘的基础设施构建

"ITE-Embedded Controller架构介绍及功能概述

最新资源