4 B. Bramas
Load/set/store. As in previous instruction sets, AVX-
512 has instructions to load a contiguous block of values
from main memory and to transform it into a SIMD-
vector (load), fill a SIMD-vector with a given value (set)
and move back a SIMD-vector into memory (store).
Store some. A new operation allows to save only
some values from a SIMD-vector into memory
(vpcmpd/vcmppd). The values are saved contiguously.
This is a major improvement because without this
instruction, several operations are needed to obtain the
same result. For example, to save some values from a
SIMD-vector v at address p in memory, one possibility
is to load the current values from p into a SIMD-vector
v
0
, permute the values in v to move the values to store
at the beginning, merge v and v
0
, and finally save the
resulting vector. The pseudo-code in Figure 4 describes
the results obtained with this store-some instruction.
1 void mm512 cmp epi32 mask (
2 in t ∗ pt r ,
3 mas k16 msk ,
4 m5 1 2i v a l u e s ) {
5 o f f s e t = 0 ;
6 fo r ( id x from 0 t o 15) {
7 i f ( msk AND s h i f t ( 1 , i d x ) ) {
8 p tr [ o f f s e t ] = v a l u e s [ i dx ] ;
9 o f f s e t += 1 ;
10 }
11 }
12 }
13
Store some (integer)
1 void mm512 cmp pd mask (
2 double ∗ p t r ,
3 mask 8 msk ,
4 m512d v a l u e s ) {
5 o f f s e t = 0 ;
6 fo r ( id x from 0 t o 7) {
7 i f ( msk AND s h i f t ( 1 , i d x ) ) {
8 p tr [ o f f s e t ] = v a l u e s [ i dx ] ;
9 o f f s e t += 1 ;
10 }
11 }
12 }
13
Store some (double)
FIGURE 4: AVX-512 store-some behavior for an
integer and a double floating-point vectors.
Vector permutation. Permuting the values inside
a vector was possible since AVX/AVX2 using
permutevar8x32 (instruction vperm(d,ps)). This in-
struction allows to re-order the values inside a SIMD-
vector using a second integer array which contains the
permutation indexes. AVX-512 also has this instruction
on its own SIMD-vector, and we synthetize its behavior
in Figure 5.
Min/Max. The min/max operations (vpmaxsd/vp-
minsd/vmaxpd/vminpd ) return a SIMD-vector where
1 m5 1 2i mm5 12 p ermu tex var e pi3 2 (
2 m5 1 2i permIdxs ,
3 m5 1 2i v a l u e s ) {
4 m5 1 2i r e s ;
5 fo r ( id x from 0 t o 15)
6 r e s [ i d x ] = v a l u e s [ permIdxs [ i d x ] ] ;
7 return r e s ;
8 }
9
Permute (integer)
1 m512d mm512 permutexvar pd (
2 m5 1 2i permIdxs ,
3 m512d v a l u e s ) {
4 m512d r e s ;
5 fo r ( id x from 0 t o 7)
6 r e s [ i d x ] = v a l u e s [ permIdxs [ i d x ] ] ;
7 return r e s ;
8 }
9
Permute (double)
FIGURE 5: AVX-512 permute behavior for an integer
and a double floating-point vectors.
each value correspond to the minimum/maximum value
of the two input vectors at the same position (they do
not return a single scalar as the global minimum/max-
imum among all the values). Such instructions exist in
SSE/SSE2/AVX too.
Comparison. In AVX-512, the value returned by a
test/comparison (vpcmpd/vcmppd) is a mask (integer)
and not a SIMD-vector of integers as it was in
SSE/AVX. Therefore, it is easy to modify and work
directly on the mask with arithmetic and binary
operations for scalar integers. The behavior of the
comparison is shown in Figure 6, where the mask is
filled with bits from the comparison results. AVX-512
provides several instructions that use this type of mask
like the conditional selection for instance.
Conditional selection. Among the mask-based instruc-
tions, the mask move (vmovdqa32/vmovapd) allows to
select values between two vectors using a mask. The be-
havior is show in Figure 7, where a value is taken from
the second vector where the mask is false and from the
first vector otherwise. Achieving the same result was
possible in previous instruction set only using several
operations and not one dedicated instruction.
2.3. Vectorized Sorting Algorithms
The literature on sorting and vectorized sorting
implementations is immense. Therefore, we only
cite here some of the studies that we consider most
connected to our work.
The sorting technique from [16] tries to remove
branches and improves the prediction of a scalar sort,
and they show a speedup by a factor of 2 against the
STL (the implementation of the STL at that time was