ARM SVE指令集在机器学习中的应用实践

5星 · 超过95%的资源需积分: 50 19 浏览量更新于2024-07-18 1 收藏 499KB PDF 举报

"本文档是关于ARM Scalable Vector Extension (SVE) 在机器学习应用中的实例，由Dan Andrei Iliescu和Francesco Petrogalli于2017年11月撰写。文档主要探讨了如何利用SVE对机器学习的核心计算内核进行向量化处理，特别关注了在通用矩阵乘法(GEMM)和低精度矩阵乘法(GEMM lowp)中的高效向量化技术。" ARM Scalable Vector Extension (SVE) 是一个针对AArch64执行模式的向量扩展，设计用于A64指令集架构。SVE的目标是为高性能计算、机器学习和数据中心应用提供更大的数据并行性。它的关键特性是可变向量长度（Vector Length Agnostic, VLA），这意味着软件可以无需知道具体硬件支持的最大向量长度就能编写代码，从而提高了代码的可移植性和效率。 SVE与传统的单指令多数据流（SIMD）指令集相比，具有更宽泛的向量宽度，允许处理更大规模的数据集。在SVE中，向量长度可以在128位到2048位之间动态调整，适应不同的计算需求和硬件资源。 SVE还引入了一种名为ACLE的编程模型，它是ARM的向量和标量编程语言扩展。ACLE提供了SVE intrinsic函数，这些函数是预定义的、硬件支持的操作，可以简化程序员进行向量化编程的工作。通过使用这些intrinsic函数，开发者能够有效地将算法转换为利用SVE硬件优势的代码。文档中举例说明了如何使用SVE intrinsic函数实现向量化，包括： 1. 矩阵乘法：这是许多机器学习算法的核心操作，SVE可以通过向量化技术显著提高其计算速度。 2. 简单向量化：展示了基本的向量化技巧，如并行处理数组元素。 3. 展开向量化：通过循环展开进一步提升性能，减少控制流开销并增加数据并行度。 4. 点积运算：点积在机器学习中的优化也是至关重要的，SVE可以高效地处理大向量的点积计算。此外，文档还提到了用于SVE开发的工具，这些工具可以帮助开发者调试、优化和验证SVE代码。最后，作者们对做出贡献的人表示感谢，并列出了一些相关的商标和参考资料。通过这些示例，读者可以理解SVE如何在机器学习环境中提升计算性能，尤其是在处理大规模数据时。对于希望优化机器学习算法效率的开发者来说，掌握SVE技术将是一个巨大的优势。

White paper

Listing 2.1

VLA vectorization example using the SVE ACLE.

1 // Scalar version.

2 void add_arrays(double *dst, double *src, double c, const int N) {

3 for (int i = 0; i < N; i++)

dst[i] = src[i] + c;

5 }

7 // Vector version

8 void vla_add_arrays(double *dst, double *src, double c, const int N) {

9 svfloat64_t vc = svdup_f64(c);

10 for (int i = 0; i < N; i += svcntd()) {

11 svbool_t Pg = svwhilelt_b64(i, N);

12 svfloat64_t vsrc = svld1(Pg, &src[i]);

13 svfloat64_t vdst = svadd_x(Pg, vsrc, vc);

14 svst1(Pg, &dst[i], vdst);

15 }

16 }

First, the constant c is reproduced into all the lanes of a vector vc, with the svdup_f64 function (line 9). Note

that although we are using the short form of the ACLE, the _f64 part in the name is required, because standard C

scalar promotion does not allow the contraction of the name of those functions that process only scalar arguments

(see section 4.2 of [2] for a detailed explanation).

Next, the header of the vector loop is issued (line 10). The number of lanes that one iteration of the vector

loop can process is unknown at compile time. This means that the induction variable i needs to be incremented

dynamically with the svcntd() function, which returns the number of 64-bit (double-word) lanes in an SVE vector

type, or VL.D hereafter.

In the body of the loop, the predicate Pg is set with the whilelt_b64 function (line 4). This function builds

a predicate by testing the i < N inequality for all the values of the induction variable spanning the iteration of the

vector loop and associating its result to the correspondent lane of the vector register. At iteration i, it computes

j < N for j=i, i+1, ..., i+VL.D-1. The 64-bit lanes view of the predicate is specified by the _b64 part of

the intrinsic, which cannot be contracted into the intrinsic name because of C scalar promotion.

On a 256-bits implementation, the value of the predicate Pg would look as follows in the second iteration of

the loop in the example where N = 7:

MSB LSB

Pg = [00000000 00000001 00000001 00000001]

7 6 5 4 64-bit lanes index 'i'

Using the predicate Pg effectively removes the need to deal with the remainder of the loop that would not fit

in a full vector. The predicate values over the full loop iteration for the 256-bit example where N = 7 are shown

in figure 1.

Page 4 of 16 1.0-alpha

剩余15页未读，继续阅读

xiaoshiyi2015

粉丝: 0
资源: 1

ARM SVE指令集在机器学习中的应用实践

ARM指令集详解(超详细!带实例!)

向量化编程，介绍向量化基本概念、编译器自动向量化、代码变换、X86 intrinsic代码实例，以及ARM SVE特点。

arm.rar_ARM 指令集_ARM 教程_arm_arm 指令_arm 指令集 教程

ARM指令集详解（包括机器码）

ARM汇编指令集

ARM指令集详解及实例.doc

ARM.rar_ARM V7 指令集

ARM汇编指令集（包括所有的ARM指令和Thumb指令）

ARM 汇编指令集 chm格式

ARM SVE C：加速机器学习中的矩阵运算与VLA技术

最新资源

arm.rar_ARM 指令集_ARM 教程_arm_arm 指令_arm 指令集教程