AVX-512加速Intel Knights Landing上快速排序算法

150 浏览量更新于2024-07-14 收藏 662KB PDF 举报

本文档"Fast Sorting Algorithms using AVX-512 on Intel Knights Landing"发表于2017年4月24日，作者是Berenger Bramas，他隶属于Max Planck Computing and Data Facility（MPCDF）。该研究论文主要探讨了如何利用英特尔Knights Landing（Intel KNL）处理器上的最新AVX-512指令集实现快速排序算法。 AVX-512是一种高级矢量化指令集，提供了强大的并行计算能力。在论文中，作者提出了一种两部分混合排序策略。首先，针对小数组，他们采用了一种无分支的比特onic排序算法，这种算法能够避免条件分支带来的性能损失，从而提高效率。其次，他们开发了一个基于AVX-512的向量化分区函数，这是著名的快速排序算法的关键组成部分。这种向量化处理使得算法的执行更为高效，因为新指令允许对数据进行流水线操作。作者强调，他们的算法是就地排序，易于实现，这得益于AVX-512提供的新型指令。研究还展示了如何适应和利用AVX-512特性来优化排序算法，尤其是在整数和双精度浮点数运算中。在Intel KNL平台上进行了性能测试，结果显示，他们的方法在不同大小的数据上，无论是整数还是浮点数，都能比GNU C++标准库的排序算法快出大约4倍。关键词包括：快速排序（quicksort）、排序算法、向量化（vectorization）、AVX-512以及Intel Knights Landing（KNL）。这个研究不仅提升了排序算法的性能，也展示了现代处理器技术如何推动计算密集型任务的优化。对于那些在数据库服务器、图像处理等应用中依赖高性能排序的领域来说，这项工作具有重要意义，因为它提供了一种在实际硬件环境中提高性能的新方法。

4 B. Bramas

Load/set/store. As in previous instruction sets, AVX-

512 has instructions to load a contiguous block of values

from main memory and to transform it into a SIMD-

vector (load), ﬁll a SIMD-vector with a given value (set)

and move back a SIMD-vector into memory (store).

Store some. A new operation allows to save only

some values from a SIMD-vector into memory

(vpcmpd/vcmppd). The values are saved contiguously.

This is a major improvement because without this

instruction, several operations are needed to obtain the

same result. For example, to save some values from a

SIMD-vector v at address p in memory, one possibility

is to load the current values from p into a SIMD-vector

, permute the values in v to move the values to store

at the beginning, merge v and v

, and ﬁnally save the

resulting vector. The pseudo-code in Figure 4 describes

the results obtained with this store-some instruction.

1 void mm512 cmp epi32 mask (

2 in t ∗ pt r ,

3 mas k16 msk ,

4 m5 1 2i v a l u e s ) {

5 o f f s e t = 0 ;

6 fo r ( id x from 0 t o 15) {

7 i f ( msk AND s h i f t ( 1 , i d x ) ) {

8 p tr [ o f f s e t ] = v a l u e s [ i dx ] ;

9 o f f s e t += 1 ;

10 }

11 }

12 }

Store some (integer)

1 void mm512 cmp pd mask (

2 double ∗ p t r ,

3 mask 8 msk ,

4 m512d v a l u e s ) {

5 o f f s e t = 0 ;

6 fo r ( id x from 0 t o 7) {

7 i f ( msk AND s h i f t ( 1 , i d x ) ) {

8 p tr [ o f f s e t ] = v a l u e s [ i dx ] ;

9 o f f s e t += 1 ;

10 }

11 }

12 }

Store some (double)

FIGURE 4: AVX-512 store-some behavior for an

integer and a double ﬂoating-point vectors.

Vector permutation. Permuting the values inside

a vector was possible since AVX/AVX2 using

permutevar8x32 (instruction vperm(d,ps)). This in-

struction allows to re-order the values inside a SIMD-

vector using a second integer array which contains the

permutation indexes. AVX-512 also has this instruction

on its own SIMD-vector, and we synthetize its behavior

in Figure 5.

Min/Max. The min/max operations (vpmaxsd/vp-

minsd/vmaxpd/vminpd ) return a SIMD-vector where

1 m5 1 2i mm5 12 p ermu tex var e pi3 2 (

2 m5 1 2i permIdxs ,

3 m5 1 2i v a l u e s ) {

4 m5 1 2i r e s ;

5 fo r ( id x from 0 t o 15)

6 r e s [ i d x ] = v a l u e s [ permIdxs [ i d x ] ] ;

7 return r e s ;

8 }

Permute (integer)

1 m512d mm512 permutexvar pd (

2 m5 1 2i permIdxs ,

3 m512d v a l u e s ) {

4 m512d r e s ;

5 fo r ( id x from 0 t o 7)

6 r e s [ i d x ] = v a l u e s [ permIdxs [ i d x ] ] ;

7 return r e s ;

8 }

Permute (double)

FIGURE 5: AVX-512 permute behavior for an integer

and a double ﬂoating-point vectors.

each value correspond to the minimum/maximum value

of the two input vectors at the same position (they do

not return a single scalar as the global minimum/max-

imum among all the values). Such instructions exist in

SSE/SSE2/AVX too.

Comparison. In AVX-512, the value returned by a

test/comparison (vpcmpd/vcmppd) is a mask (integer)

and not a SIMD-vector of integers as it was in

SSE/AVX. Therefore, it is easy to modify and work

directly on the mask with arithmetic and binary

operations for scalar integers. The behavior of the

comparison is shown in Figure 6, where the mask is

ﬁlled with bits from the comparison results. AVX-512

provides several instructions that use this type of mask

like the conditional selection for instance.

Conditional selection. Among the mask-based instruc-

tions, the mask move (vmovdqa32/vmovapd) allows to

select values between two vectors using a mask. The be-

havior is show in Figure 7, where a value is taken from

the second vector where the mask is false and from the

ﬁrst vector otherwise. Achieving the same result was

possible in previous instruction set only using several

operations and not one dedicated instruction.

2.3. Vectorized Sorting Algorithms

The literature on sorting and vectorized sorting

implementations is immense. Therefore, we only

cite here some of the studies that we consider most

connected to our work.

The sorting technique from [16] tries to remove

branches and improves the prediction of a scalar sort,

and they show a speedup by a factor of 2 against the

STL (the implementation of the STL at that time was

剩余15页未读，继续阅读

weixin_38682279

粉丝: 9

AVX-512加速Intel Knights Landing上快速排序算法

Algorithm-sorting-algorithms-performance-comparison.zip

[麻省理工学院-算法导论].Introduction.to. Algorithms,.Second.Edition

HT3-Sorting-Algorithms-Grupo-1：完全互补的反义词

sortingalgorithms-源码.rar

musical-sorting-algorithms:从 code.google.compmusical-sorting-algorithms 自动导出

Sorting-algorithms-animation-Java:为7种不同的排序算法对动画进行排序

javaa算法的源码-sorting-algorithms-ultimate-guide:排序算法源代码+最终测试以比较所有算法的性能。对于我

Java-various-sorting-algorithms.rar_algorithms

Hadoop-Sorting-Using-Map-Reducing:使用Hadoop MapReducer-Sorting创建的项目

Guide to Data Structures_A Concise Introduction Using Java-Springer(2017).pdf

最新资源