Intel AVX 指令集优化指南

5星 · 超过95%的资源需积分: 9 6 浏览量更新于2024-07-26 收藏 2.35MB PDF 举报

"Intel AVX (Advanced Vector Extensions) 是英特尔推出的一种处理器指令集扩展，旨在提升处理器在浮点计算、并行处理以及高效能计算任务中的性能。Intel AVX Optimization Manual 是一本专为开发者和工程师设计的手册，帮助他们充分利用AVX技术优化代码，提高程序运行效率。该手册详细介绍了Intel Architecture Instruction Set Extensions Programming Reference的相关内容，适用于August 2012版本，编号319433-014。" Intel AVX 指令集的引入是为了应对日益增长的高性能计算需求，特别是对于科学计算、图形处理、机器学习等领域。它提供了更宽的数据路径（256位）和更多的操作指令，使得单个CPU周期内可以处理更多数据，从而大幅提高计算密集型应用的性能。手册中可能会包含以下关键知识点： 1. **AVX指令集介绍**：包括新的向量数据类型（如YMM寄存器）、新的指令操作符以及如何将这些指令集成到现有的 SSE 或 MMX 指令集中。 2. **指令集扩展**：AVX增加了新的SIMD（单指令多数据）指令，用于浮点运算、整数运算、向量比较和转换等。 3. **寄存器组织**：AVX提供了更多的通用寄存器，允许并行处理更多数据，减少了数据在内存和寄存器之间移动的需求，提高了效率。 4. **内存对齐与数据处理**：AVX需要特定的内存对齐来最大化性能，手册会解释如何正确处理内存对齐问题，以避免性能损失或异常。 5. **代码优化策略**：如何利用AVX进行循环展开、向量化、并行化等优化技术，以及如何避免数据依赖和流水线冲突。 6. **兼容性和向后兼容性**：讨论AVX如何与早期的SSE和MMX指令集协同工作，以及如何编写兼容不同处理器架构的代码。 7. **性能分析和调优工具**：介绍使用Intel提供的工具（如Intel VTune Amplifier, Intel Inspector等）来分析和优化AVX代码的性能。 8. **安全性和稳定性**：特别指出对于“关键任务应用”（可能导致人身伤害或死亡的应用），使用Intel产品时需谨慎，并提供法律免责声明和责任义务。 9. **示例和案例研究**：手册可能包含实际的代码示例，展示如何在具体应用中实施AVX优化，以及成功优化的案例分析。通过深入理解和应用这本手册中的内容，开发者可以更好地利用Intel AVX特性，实现代码的高效执行，提高程序在现代处理器上的运行速度。

Ref. # 319433-014 1-5

INTEL® ADVANCED VECTOR EXTENSIONS

SIMD prefixes. The 128-bit data processing instructions in AVX cover floating-point and integer data movement

primitives.

Additional enhancements in AVX on 128-bit data processing primitives include 16 new instructions with the

following capabilities:

• Non-unit-strided fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching

primitives:

— broadcast of single data element into a 128-bit destination,

— masked move primitives to load or store SIMD data elements conditionally,

• Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data

manipulation primitives:

— permute primitives to facilitate efficient manipulation of floating-point data elements in 128-bit SIMD

registers

• Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:

— new variable blend instructions supports four-operand syntax with non-destructive source syntax.

Branching conditions dependent on floating-point data or integer data can benefit from Intel AVX. This is

more flexible than non-VEX encoded instruction syntax that uses the XMM0 register as implied mask for

blend selection. While variable blend with implied XMM0 syntax is supported in SSE4 using SIMD prefix

encoding, VEX-encoded 128-bit variable blend instructions only support the more flexible four-operand

syntax.

— Packed TEST instructions for floating-point data.

1.5.5 AVX2 and 256-bit Vector Integer Processing

AVX2 promotes the vast majority of 128-bit integer SIMD instruction sets to operate with 256-bit wide YMM regis-

ters. AVX2 instructions are encoded using the VEX prefix and require the same operating system support as AVX.

Generally, most of the promoted 256-bit vector integer instructions follow the 128-bit lane operation, similar to the

promoted 256-bit floating-point SIMD instructions in AVX.

Newer functionalities in AVX2 generally fall into the following categories:

• Fetching non-contiguous data elements from memory using vector-index memory addressing. These “gather”

instructions introduce a new memory-addressing form, consisting of a base register and multiple indices

specified by a vector register (either XMM or YMM). Data elements sizes of 32 and 64-bits are supported, and

data types for floating-point and integer elements are also supported.

• Cross-lane functionalities are provided with several new instructions for broadcast and permute operations.

Some of the 256-bit vector integer instructions promoted from legacy SSE instruction sets also exhibit cross-

lane behavior, e.g. VPMOVZ/VPMOVS family.

• AVX2 complements the AVX instructions that are typed for floating-point operation with a full compliment of

equivalent set for operating with 32/64-bit integer data elements.

• Vector shift instructions with per-element shift count. Data elements sizes of 32 and 64-bits are supported.

1.6 GENERAL PURPOSE INSTRUCTION SET ENHANCEMENTS

Enhancements in the general-purpose instruction set consist of several categories:

• A rich collection of instructions to manipulate integer data at bit-granularity. Most of the bit-manipulation

instructions employ VEX-prefix encoding to support three-operand syntax with non-destructive source

operands. Two of the bit-manipulating instructions (LZCNT, TZCNT) are not encoded using VEX. The VEX-

encoded bit-manipulation instructions include: ANDN, BEXTR, BLSI, BLSMSK, BLSR, BZHI, PEXT, PDEP, SARX,

SHLX, SHRX, and RORX.

• Enhanced integer multiply instruction (MULX) in conjunctions with some of the bit-manipulation instructions

allow software to accelerate calculation of large integer numerics (wider than 128-bits).

• INVPCID instruction targets system software that manages processor context IDs.

2-2 Ref. # 319433-014

APPLICATION PROGRAMMING MODEL

Prior to using AVX, the application must identify that the operating system supports the XGETBV instruction, the

YMM register state, in addition to processor’s support for YMM state management using XSAVE/XRSTOR and AVX

instructions. The following simplified sequence accomplishes both and is strongly recommended.

1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use

)

2) Issue XGETBV and verify that XFEATURE_ENABLED_MASK[2:1] = ‘11b’ (XMM state and YMM state are enabled

by OS).

3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).

(Step 3 can be done in any order relative to 1 and 2)

The following pseudocode illustrates this recommended application AVX detection process:

----------------------------------------------------------------------------------------

INT supports_AVX()

{ ; result in eax

mov eax, 1

cpuid

and ecx, 018000000H

cmp ecx, 018000000H; check both OSXSAVE and AVX feature flags

jne not_supported

; processor supports AVX instructions and XGETBV is enabled by OS

mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register

XGETBV; result in EDX:EAX

and eax, 06H

cmp eax, 06H; check OS has enabled both XMM and YMM state support

jne not_supported

mov eax, 1

jmp done

NOT_SUPPORTED:

mov eax, 0

done:

}

-------------------------------------------------------------------------------

Note: It is unwise for an application to rely exclusively on CPUID.1:ECX.AVX[bit 28] or at all on

CPUID.1:ECX.XSAVE[bit 26]: These indicate hardware support but not operating system support. If YMM state

management is not enabled by an operating systems, AVX instructions will #UD regardless of

CPUID.1:ECX.AVX[bit 28]. “CPUID.1:ECX.XSAVE[bit 26] = 1” does not guarantee the OS actually uses the XSAVE

process for state management.

These steps above also apply to enhanced 128-bit SIMD floating-pointing instructions in AVX (using VEX prefix-

encoding) that operate on the YMM states. Application detection of VEX-encoded AES is described in Section 2.2.2.

2.2.1 Detection of FMA

Hardware support for FMA is indicated by CPUID.1:ECX.FMA[bit 12]=1.

1. If CPUID.01H:ECX.OSXSAVE reports 1, it also indirectly implies the processor supports XSAVE, XRSTOR, XGETBV, processor

extended state bit vector XFEATURE_ENALBED_MASK register. Thus an application may streamline the checking of CPUID feature

flags for XSAVE and OSXSAVE. XSETBV is a privileged instruction.

剩余476页未读，继续阅读

zhushulei

粉丝: 2
资源: 7

Intel AVX 指令集优化指南

intel 指令集完全参考手册（官方）

Intel® 64 and IA-32 Architectures Optimization Reference Manual（intel优化手册）

Intel AVX编程参考文档

IA-32 Intel® Architecture Optimization Reference Manual

intel 64&IA32 optimization reference manual

Intel-Software-Optimization-Manual-R047

Intel® 64 and IA-32 architectures optimization reference manual

Intel(Intel_ 64 and IA-32 Architectures Optimization Reference Manual下载失败)

Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 2A (248966)

64-ia-32-architectures-optimization-manual

最新资源