Intel Xeon Phi coprocessor 的 Streaming Store 指令优化

183 浏览量更新于2024-08-25 收藏 294KB PDF 举报

"英特尔Xeon Phi协处理器中的Streaming Store指令技术是针对高性能计算场景优化内存带宽使用的一种方法。在2012年的技术介绍中，Intel引入了新的 Streaming Store 指令，如VMOVNRNGOAPS和VMOVNRNGOAPD，旨在提升在向量对齐且未屏蔽的存储操作中的性能。这些指令主要用于流式计算内核，以避免在完全覆盖缓存行内容时，因读取原有内存内容而浪费内存带宽。自ComposerXE2013 Update1编译器开始，对于特定情况，编译器默认会生成VMOVNRNGO指令进行流式存储。用户可以提供提示给编译器来决定何时生成这些指令，也可以通过外部选项-opt-streaming-storesnever禁用这一功能。" 在Intel Xeon Phi协处理器中，Streaming Store指令是一个重要的性能提升工具。这些特殊指令设计用于处理向量对齐且不被掩码遮挡的存储操作，这样的操作常见于流式计算应用，比如大规模并行计算或数据处理。在传统的存储操作中，如果一个存储操作覆盖了缓存线的全部内容，处理器通常需要先从内存中读取原内容，然后再写入新数据，这一过程会消耗额外的内存带宽。 VMOVNRNGOAPS和VMOVNRNGOAPD是Intel引入的两种Streaming Store指令，它们允许处理器直接写入数据到内存，而无需读取并替换原有内容，从而避免了不必要的带宽消耗。这对于需要高效利用内存带宽的高性能计算环境，尤其是那些依赖连续存储操作的流式计算任务来说，是非常有益的。 Intel的ComposerXE2013 Update1编译器开始支持自动优化，即在特定情况下，编译器会默认生成这些Streaming Store指令。这种智能优化可以帮助程序员更轻松地编写高效代码，同时降低了手动调整代码以实现最佳性能的需求。然而，为了满足不同应用场景的需求，用户可以通过使用特定的编译选项-opt-streaming-storesnever来禁止编译器生成这些指令，以确保程序行为符合预期或与其他优化策略兼容。 Streaming Store指令是Intel Xeon Phi协处理器提高内存访问效率的关键特性，它通过减少不必要的内存读取，提升了内存带宽的利用率，从而在高性能计算领域提供了更高效的性能。对于开发者而言，理解和掌握如何利用这些指令以及与之相关的编译器选项，是优化基于Intel Xeon Phi平台的应用程序性能的重要步骤。

*Other brands and names are the property of their respective owners.

Heuristics for streaming stores

Compiler generates streaming store instructions only when:

• Compiler is able to vectorize the loop and generate an aligned unit-strided vector unmasked store:

– If the store accesses in the loop are aligned properly, user can convey alignment information using pragmas/clauses

– Ex: Use #pragma vector aligned OR !DEC$ vector aligned before loop to convey alignment of all memory refs inside loop

including the stores

– In some cases, even when there is no pragma to align the store-access, the compiler may align the store-access at

runtime using a dynamic peel-loop based on its own heuristics

– Based on alignment analysis, compiler could prove that the store accesses are aligned (at 64 bytes)

– Store has to be aligned and be writing to a full cache line (vstore – 64 bytes, no masks)

– Note that it is the responsibility of the user to align the data appropriately at allocation time using align clauses,

aligned_malloc, “-align array64byte” option on Fortran, etc.

• Vector-stores are classified as nontemporal using one of:

– User has specified a nontemporal pragma on the loop to mark the vector-stores as streaming

– #pragma vector nontemporal (in C/C++) OR !DEC$ vector nontemporal (in F) before loop to mark aligned stores

– Or communicate nontemporal-property of store using “#pragma vector nontemporal A” where “A[i] = …” is the store

inside the loop

– User has specified the compiler option “-opt-streaming-stores always” to force marking ALL aligned vector-stores as

nontemporal

– Has the implicit effect of adding the nontemporal pragma to all loops that are vectorized by the compiler in the

compilation scope

– Using this option on KNC has few negative consequences since the data remains in the L2 cache (just not in the L1

cache) – so this option can be used if most aligned vector-stores are nontemporal

– Using this option on Xeon for cases where some accesses are temporal can cause significant performance losses since

the streaming-store instructions on Xeon bypass the cache altogether

– Fully automatic heuristic that will kick in when the loop has a constant large trip-count (known to the compiler)

– Compiler will also generate a memory-fence after the loop in this case

On KNC, compiler generates streaming stores if conditions listed above are satisfied

Study the output of –vec-report6 to check whether store is aligned and whether streaming stores are generated

剩余11页未读，继续阅读

weixin_38502239

粉丝: 7
资源: 941

Intel Xeon Phi coprocessor 的 Streaming Store 指令优化

flink-streaming-java_2.11-1.13.2-API文档-中文版.zip

Intel Multimedia Instructions (MMX, SSE, SSE2, SSE3 and SSE4) - Slides-计算机科学

A Few Experiments with Intel's Cache Allocation Technology - Slides (2015)-计算机科学

Cache-Oblivious Streaming B-trees-计算机科学

-stream-loops-on-flink-reinventing-the-wheel-for-the-streaming

server-the-worst-streaming-platform

streaming-App-Live-Kirtan--Golden-Temple-

Streaming Data Mining-计算机科学

rt-analytics-streaming-bigdata:rt-analytics-streaming-bigdata演示

Streaming Data - Understanding the real-time pipeline.zip

最新资源