AVX-512
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.
Besides widening most 256-bit instructions, the extensions introduce various new operations, such as new data conversions, scatter operations, and permutations. The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length extension—included in most AVX-512-capable processors —these instructions may also be used on the 128-bit and 256-bit vector sizes.
AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.
The successor to AVX-512 is AVX10, announced in July 2023. AVX10 simplifies detection of supported instructions by introducing a version of the instruction set, where each subsequent version includes all instructions from the previous one. In the initial revisions of the AVX10 specification, the support for 512-bit vectors was made optional, which would allow Intel to support it in their E-cores. In later revisions, Intel made 512-bit vectors mandatory, with the intention to support 512-bit vectors both in P- and E-cores. The initial version 1 of AVX10 does not add new instructions compared to AVX-512, and for processors supporting 512-bit vectors it is equivalent to AVX-512. Later AVX10 versions will introduce new features.
Instruction set
The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit. However, they are typically grouped by the processor generation that implements them.F, CD, ER, PF: introduced with Xeon Phi x200 and Xeon Scalable, with the last two being specific to Knights Landing & Knights Mill.
- AVX-512 Foundation expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon
- AVX-512 Conflict Detection Instructions efficient conflict detection to allow more loops to be vectorized, implemented by Knights Landing and Skylake X
- AVX-512 Exponential and Reciprocal Instructions exponential and reciprocal operations designed to help implement transcendental operations, implemented by Knights Landing
- AVX-512 Prefetch Instructions new prefetch capabilities, implemented by Knights Landing
- AVX-512 Vector Neural Network Instructions Word variable precision – vector instructions for deep learning, enhanced word, variable precision.
- AVX-512 Fused Multiply Accumulation Packed Single precision – vector instructions for deep learning, floating point, single precision.
- AVX-512 Vector Length Extensions extends most AVX-512 operations to also operate on XMM and YMM registers
- AVX-512 Doubleword and Quadword Instructions adds new 32-bit and 64-bit AVX-512 instructions
- AVX-512 Byte and Word Instructions extends AVX-512 to cover 8-bit and 16-bit integer operations
- AVX-512 Integer Fused Multiply Add – fused multiply add of integers using 52-bit precision.
- AVX-512 Vector Bit Manipulation Instructions adds vector byte permutation instructions which were not present in AVX-512BW.
- AVX-512 Vector Neural Network Instructions – vector instructions for deep learning.
VBMI2, BITALG: introduced with Ice Lake.
- AVX-512 Vector Bit Manipulation Instructions 2 – byte/word load, store and concatenation with shift.
- AVX-512 Bit Algorithms – byte/word bit manipulation instructions expanding VPOPCNTDQ.
- AVX-512 Vector Pair Intersection to a Pair of Mask Registers .
- These are not AVX-512 features per se. Together with AVX-512, they enable EVEX encoded versions of GFNI, PCLMULQDQ and AES instructions.
- AVX-512 Bit Manipulation Instructions – includes Bit Matrix Multiply and Bit Reversal operations.
Encoding and features
Compared to VEX, EVEX adds the following benefits:
- Expanded register encoding allowing 32 512-bit registers.
- Adds 8 new opmask registers for masking most AVX-512 instructions.
- Adds a new scalar memory mode that automatically performs a broadcast.
- Adds room for explicit rounding control in each instruction.
- Adds a new compressed displacement memory addressing mode.
SIMD modes
The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have [|new AVX-512 versions encoded with the EVEX prefix] which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension.| Name | Extension sets | Registers | Types |
| Legacy SSE | SSE–SSE4.2 | xmm0–xmm15 | single floats from SSE2: bytes, words, doublewords, quadwords and double floats |
| AVX-128 | AVX, AVX2 | xmm0–xmm15 | bytes, words, doublewords, quadwords, single floats and double floats |
| AVX-256 | AVX, AVX2 | ymm0–ymm15 | single float and double float from AVX2: bytes, words, doublewords, quadwords |
| AVX-128 | AVX-512VL | xmm0–xmm31 | doublewords, quadwords, single float and double float with AVX512BW: bytes and words. with AVX512-FP16: half float |
| AVX-256 AVX10/256 | AVX-512VL | ymm0–ymm31 | doublewords, quadwords, single float and double float with AVX512BW: bytes and words. with AVX512-FP16: half float |
| AVX10/512 | AVX-512F | doublewords, quadwords, single float and double float with AVX512BW: bytes and words with AVX512-FP16: half float |
Extended registers
The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.Opmask registers
AVX-512 vector instructions may indicate an opmask register to control which values are written to the destination. The instruction encoding supports 0-7 for this field; however, only opmask registers k1-k7 can be used as the mask corresponding to the value 1-7, whereas the value 0 is reserved for indicating no opmask register is used; that is, a hardcoded constant is used to indicate unmasked operations. The special opmask register 'k0' is still a functioning, valid register, it can be used in opmask register manipulation instructions or used as the destination opmask register. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension. How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.
The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.
New opmask instructions
The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit versions. With AVX-512DQ 8-bit versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit and 64-bit versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.| Instruction | Extension set | Description |
KAND | F | Bitwise logical AND Masks |
KANDN | F | Bitwise logical AND NOT Masks |
KMOV | F | Move from and to Mask Registers or General Purpose Registers |
KUNPCK | F | Unpack for Mask Registers |
KNOT | F | NOT Mask Register |
KOR | F | Bitwise logical OR Masks |
KORTEST | F | OR Masks And Set Flags |
KSHIFTL | F | Shift Left Mask Registers |
KSHIFTR | F | Shift Right Mask Registers |
KXNOR | F | Bitwise logical XNOR Masks |
KXOR | F | Bitwise logical XOR Masks |
KADD | BW/DQ | Add Two Masks |
KTEST | BW/DQ | Bitwise comparison and set flags |