List of x86 SIMD instructions
The x86 instruction set has several times been extended with SIMD instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.
Summary of SIMD extensions
The main SIMD instruction set extensions that have been introduced for x86 are:| SIMD instruction set extension | Year | Description | Added in |
| 1997 | A set of 57 integer SIMD instruction acting on 64-bit vectors, mostly providing 8/16/32-bit lane-width operations. Repurposed the old x87 FPU register-file as a bank of eight 64-bit vector registers, referred to as MM0..MM7 when used for MMX instructions. | AMD K6, Intel Pentium II, Rise mP6, IDT WinChip C6, Transmeta Crusoe, DM&P Vortex86MX | |
| 1999 | "Katmai New Instructions" - introduced a set of 70 new instructions. Most but not all of these instructions provide scalar and vector operations on 32-bit floating-point values in 128-bit SIMD vector registers. SSE introduced a new set of eight vector registers XMM0..XMM7, each 128 bits, and a status/control register MXCSR. This set of eight vector registers would later be extended to 16 registers with the introduction of x86-64. | Intel Pentium III, AMD Athlon XP, VIA C3 "Nehemiah", Transmeta Efficeon | |
| 2000 | Extended SSE with 144 new instructions - mainly additional instructions to work on scalars and vectors of 64-bit floating-point values, as well as 128-bit-vector forms of most of the MMX integer instructions. | Intel Pentium 4, Intel Pentium M, AMD Athlon 64, Transmeta Efficeon, VIA C7 | |
| 2004 | "Prescott New Instructions": added a set of 13 new instructions, mostly horizontal add/subtract operations. | Intel Pentium 4 "Prescott", Transmeta Efficeon 8800, AMD Athlon 64 "Venice", VIA C7, Intel Core "Yonah" | |
| 2006 | Added a set of 32 new instructions to extend MMX and SSE, including a byte-shuffle instruction. | Intel Core 2 "Conroe"/"Merom", VIA Nano 2000, Intel Atom "Bonnell", AMD "Bobcat", AMD FX "Bulldozer" | |
| 2007 | AMD-only extension that added a set of 4 instructions, including bitfield insert/extract and scalar non-temporal store instructions. | AMD K10 | |
| 2007 | Added a set of 47 instructions, including variants of integer min/max, widening integer conversions, vector lane insert/extract, and dot-product instructions. | Intel Core 2 "Penryn", VIA Nano 3000, AMD FX "Bulldozer", AMD "Jaguar", Intel Atom "Silvermont", Zhaoxin ZX-A | |
| 2008 | Added a set of 7 instructions, mostly pertaining to string processing. | Intel Core i7 "Nehalem", AMD FX "Bulldozer", AMD "Jaguar", Intel Atom "Silvermont", VIA Nano QuadCore C4000, Zhaoxin ZX-C "ZhangJiang" | |
| 2011 | Extended the XMM0..XMM15 vector registers to 256-bit registers, referred to as YMM0..YMM15 when used as full 256-bit registers. Added three-operand variants of most of the SSE1-4 vector instructions, as well as 256-bit vector variants of most of the SSE1-4 vector instructions acting on 32/64-bit floating-point values. These new instruction variants are all encoded with the new VEX prefix. | Intel Core i7 "Sandy Bridge", AMD FX "Bulldozer", AMD "Jaguar", VIA Nano QuadCore C4000, Zhaoxin ZX-C "ZhangJiang", Intel Atom "Gracemont" | |
| 2013 | Added three-operand floating-point fused-multiply add operations, scalar and vector variants. | Intel Core i7 "Haswell", AMD FX "Piledriver", Intel Atom "Gracemont", Zhaoxin KH-40000 "YongFeng" | |
| 2013 | Added 256-bit vector variants of most of the MMX/SSE1-4 vector integer instructions. Also adds vector gather instructions. | Intel Core i7 "Haswell", AMD FX "Excavator", VIA Nano QuadCore C4000, Intel Atom "Gracemont", Zhaoxin KH-40000 "YongFeng" | |
| 2016 | Extended the YMM0..YMM15 vector registers to a set of 32 registers, each 512-bits wide - referred to as ZMM0..ZMM31 when used as 512-bit registers. Also added eight opmask registers K0..K7. Added 512-bit versions of most of the MMX/SSE/AVX vector instructions, as well as a substantial number of additional instructions. These are mostly encoded with the new EVEX prefix Added the ability to perform per-vector-lane masking of the operation of most of its vector instructions, by using the opmask registers. Also added embedded rounding controls for floating-point instructions and a scalar-to-vector broadcast function for most instructions that can accept memory operands. | ||
| 2023 | Added a set of eight new tile registers, referred to as TMM0..TMM7. Each of these tile registers has a size of 8192 bits. Also added a 64-byte tile configuration register TILECFG, and instructions to perform matrix multiplication on the tile registers with various data formats. | ||
| 2024 | Reformulation of AVX-512 that includes most of the optional AVX-512 subsets as baseline functionality, and switches feature enumeration from the flag-based scheme of AVX-512 to a version-based scheme. No new instructions are added. | Intel Xeon 6 "Granite Rapids" | |
| Adds instructions to convert to/from FP8 datatypes, perform arithmetic on BF16 numbers, saturating conversions from floating-point to integer, IEEE754-compliant min/max, and a few other instructions. |
MMX instructions and extended variants thereof
These instructions are, unless otherwise noted, available in the following forms:- MMX: 64-bit vectors, operating on mm0..mm7 registers
- SSE2: 128-bit vectors, operating on xmm0..xmm15 registers
- AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix.
- AVX2: 256-bit vectors, operating on ymm0..ymm15 registers
- AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers. AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register. AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.
is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.SSE instructions and extended variants thereof
Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof
For the instructions in the below table, the following considerations apply unless otherwise noted:- Packed instructions are available at all vector lengths
- FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
- The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
- For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations.
Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof
These instructions do not have any MMX forms, and do not support any encodings without a prefix.Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:
- The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits - under AVX2, they are also made available with a vector length of 256 bits.
- The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.
Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof
SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.AVX/AVX2 instructions, and AVX-512 extended variants thereof
This covers instructions/opcodes that are new to AVX and AVX2.AVX and AVX2 also include extended VEX-encoded forms of a large number of MMX/SSE instructions - please see tables above.
Some of the AVX/AVX2 instructions also exist in extended EVEX-encoded forms under AVX-512 as well.
Other VEX-encoded SIMD instructions
SIMD instructions set extensions that are using the VEX prefix, and are not considered part of baseline AVX/AVX2/AVX-512, FMA3/4 or AMX.Integer, opmask and cryptographic instructions that use the VEX prefix are not included.
FMA3 and FMA4 instructions
Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD processors that support AVX2 also support FMA3. FMA3 instructions are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form
VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format. The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering and the bottom nibble y selects which one of the 10 fused-multiply-add operations to perform. At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:
vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← +xmm2, vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← +xmm3, vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← +xmm1For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form
EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024, similarly adds BF16 variants of the packed FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants.FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form
VEX.66.0F3A xx /r ib. The opcode byte xx uses its bottom bit to select floating-point format and the remaining bits to select one of the 10 fused-multiply-add operations to perform.For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib part of the instruction. If VEX.W=1, then these two operands are swapped. For example:
vfmaddsd xmm1,xmm2,,xmm3 will perform xmm1 ← +xmm3 and require a W=0 encoding.vfmaddsd xmm1,xmm2,xmm3, will perform xmm1 ← + and require a W=1 encoding.vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← +xmm4 and can be encoded with either W=0 or W=1.Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:
| Basic operation | Opcode byte | FP32 instructions | FP64 instructions | FP16 instructions | BF16 instructions |
Packed alternating multiply-add/subtract
| 96 | VFMADDSUB132PS | VFMADDSUB132PD | VFMADDSUB132PH | |
Packed alternating multiply-add/subtract
| A6 | VFMADDSUB213PS | VFMADDSUB213PD | VFMADDSUB213PH | |
Packed alternating multiply-add/subtract
| B6 | VFMADDSUB231PS | VFMADDSUB231PD | VFMADDSUB231PH | |
Packed alternating multiply-add/subtract
| |||||
Packed alternating multiply-subtract/add
| 97 | VFMSUBADD132PS | VFMSUBADD132PD | VFMSUBADD132PH | |
Packed alternating multiply-subtract/add
| A7 | VFMSUBADD213PS | VFMSUBADD213PD | VFMSUBADD213PH | |
Packed alternating multiply-subtract/add
| B7 | VFMSUBADD231PS | VFMSUBADD231PD | VFMSUBADD231PH | |
Packed alternating multiply-subtract/add
| |||||
| Packed multiply-add +C | 98 | VFMADD132PS | VFMADD132PD | VFMADD132PH | VFMADD132BF16 |
| Packed multiply-add +C | A8 | VFMADD213PS | VFMADD213PD | VFMADD213PH | VFMADD213BF16 |
| Packed multiply-add +C | B8 | VFMADD231PS | VFMADD231PD | VFMADD231PH | VFMADD231BF16 |
| Packed multiply-add +C | |||||
| Scalar multiply-add +C | 99 | VFMADD132SS | VFMADD132SD | VFMADD132SH | |
| Scalar multiply-add +C | A9 | VFMADD213SS | VFMADD213SD | VFMADD213SH | |
| Scalar multiply-add +C | B9 | VFMADD231SS | VFMADD231SD | VFMADD231SH | |
| Scalar multiply-add +C | |||||
| Packed multiply-subtract -C | 9A | VFMSUB132PS | VFMSUB132PD | VFMSUB132PH | VFMSUB132BF16 |
| Packed multiply-subtract -C | AA | VFMSUB213PS | VFMSUB213PD | VFMSUB213PH | VFMSUB213BF16 |
| Packed multiply-subtract -C | BA | VFMSUB231PS | VFMSUB231PD | VFMSUB231PH | VFMSUB231BF16 |
| Packed multiply-subtract -C | |||||
| Scalar multiply-subtract -C | 9B | VFMSUB132SS | VFMSUB132SD | VFMSUB132SH | |
| Scalar multiply-subtract -C | AB | VFMSUB213SS | VFMSUB213SD | VFMSUB213SH | |
| Scalar multiply-subtract -C | BB | VFMSUB231SS | VFMSUB231SD | VFMSUB231SH | |
| Scalar multiply-subtract -C | |||||
| Packed negative-multiply-add +C | 9C | VFNMADD132PS | VFNMADD132PD | VFNMADD132PH | VFNMADD132BF16 |
| Packed negative-multiply-add +C | AC | VFNMADD213PS | VFNMADD213PD | VFNMADD213PH | VFNMADD213BF16 |
| Packed negative-multiply-add +C | BC | VFNMADD231PS | VFNMADD231PD | VFNMADD231PH | VFNMADD231BF16 |
| Packed negative-multiply-add +C | |||||
| Scalar negative-multiply-add +C | 9D | VFMADD132SS | VFMADD132SD | VFMADD132SH | |
| Scalar negative-multiply-add +C | AD | VFMADD213SS | VFMADD213SD | VFMADD213SH | |
| Scalar negative-multiply-add +C | BD | VFMADD231SS | VFMADD231SD | VFMADD231SH | |
| Scalar negative-multiply-add +C | |||||
| Packed negative-multiply-subtract -C | 9E | VFNMSUB132PS | VFNMSUB132PD | VFNMSUB132PH | VFNMSUB132BF16 |
| Packed negative-multiply-subtract -C | AE | VFNMSUB213PS | VFNMSUB213PD | VFNMSUB213PH | VFNMSUB213BF16 |
| Packed negative-multiply-subtract -C | BE | VFNMSUB231PS | VFNMSUB231PD | VFNMSUB231PH | VFNMSUB231BF16 |
| Packed negative-multiply-subtract -C | |||||
| Scalar negative-multiply-subtract -C | 9F | VFNMSUB132SS | VFNMSUB132SD | VFNMSUB132SH | |
| Scalar negative-multiply-subtract -C | AF | VFNMSUB213SS | VFNMSUB213SD | VFNMSUB213SH | |
| Scalar negative-multiply-subtract -C | BF | VFNMSUB231SS | VFNMSUB231SD | VFNMSUB231SH | |
| Scalar negative-multiply-subtract -C |
AVX-512
AVX-512, introduced in 2014, adds 512-bit wide vector registers and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation extension is mandatory. Most of the added instructions may also be used with the 256- and 128-bit registers.AVX-512 foundation, byte/word and doubleword/quadword instructions (F, BW and DQ subsets)
This covers instructions that are new to AVX-512's F, BW and DQ subsets.These AVX-512 subsets also include extended EVEX-encoded forms of a large number of MMX/SSE/AVX instructions - please see tables above.
Regularly-encoded floating-point instructions
These instructions all follow a given pattern where:- EVEX.W is used to specify floating-point format
- The bottom opcode bit is used to select between packed and scalar operation
- For a given operation, all the scalar/packed variants belong to the same AVX-512 subset.
- The instructions all support result masking by opmask registers. They also all support broadcast of memory operands for packed variants.
- If AVX512VL is supported, then all vector widths are supported for packed variants.
Opmask instructions
AVX-512 introduces, in addition to 512-bit vectors, a set of eight opmask registers, named k0,k1,k2...k7. These registers are 64 bits wide in implementations that support AVX512BW and 16 bits wide otherwise. They are mainly used to enable/disable operation on a per-lane basis for most of the AVX-512 vector instructions. They are usually set with vector-compare instructions or instructions that otherwise produce a 1-bit per-lane result as a natural part of their operation - however, AVX-512 defines a set of 55 new instructions to help assist manual manipulation of the opmask registers.These instructions are, for the most part, defined in groups of 4 instructions, where the four instructions in a group are basically just 8-bit, 16-bit, 32-bit and 64-bit variants of the same basic operation. The opmask instructions are all encoded with the VEX prefix.
In general, the 16-bit variants of the instructions are introduced by AVX512F, the 8-bit variants by the AVX512DQ extension, and the 32/64-bit variants by the AVX512BW extension.
Most of the instructions follow a very regular encoding pattern where the four instructions in a group have identical encodings except for the VEX.pp and VEX.W fields:
| Instruction description | Basic opcode | 8-bit instructions encoded with VEX.66.W0 | 16-bit instructions encoded with VEX.NP.W0 | 32-bit instructions encoded with VEX.66.W1 | 64-bit instructions encoded with VEX.NP.W1 |
| Bitwise AND between two opmask-registers | VEX.L1.0F 41 /r | KANDB k,k,k | KANDW k,k,k | KANDD k,k,k | KANDQ k,k,k |
| Bitwise AND-NOT between two opmask-registers | VEX.L1.0F 42 /r | KANDNB k,k,k | KANDNW k,k,k | KANDND k,k,k | KANDNQ k,k,k |
| Bitwise NOT of opmask-register | VEX.L0.0F 44 /r | KNOTB k,k | KNOTW k,k | KNOTD k,k | KNOTQ k,k |
| Bitwise OR of two opmask-registers | VEX.L1.0F 45 /r | KORB k,k,k | KORW k,k,k | KORD k,k,k | KORQ k,k,k |
| Bitwise XNOR of two opmask-registers | VEX.L1.0F 46 /r | KXNORB k,k,k | KXNORW k,k,k | KXNORD k,k,k | KXNORQ k,k,k |
| Bitwise XOR of two opmask-registers | VEX.L1.0F 47 /r | KXORB k,k,k | KXORW k,k,k | KXORD k,k,k | KXORQ k,k,k |
| Integer addition of two opmask-registers | KADDB k,k,k | KADDW k,k,k | KADDD k,k,k | KADDQ k,k,k | |
| Load opmask-register from memory or opmask-register | VEX.L0.0F 90 /r | ||||
| Store opmask-register to memory | VEX.L0.0F 91 /r | KMOVB m8,k | KMOVW m16,k | KMOVD m32,k | KMOVQ m64,k |
| Load opmask-register from general-purpose register | VEX.L0.0F 92 /r | KMOVB k,r32 | KMOVW k,r32 | ||
| Store opmask-register to general-purpose register with zero-extension | VEX.L0.0F 93 /r | KMOVB r32,k | KMOVW r32,k | ||
| Bitwise OR-and-test. Performs bitwise-OR between two opmask-registers and set flags accordingly. | VEX.L0.0F 98 /r | ||||
| Bitwise test. Performs bitwise-AND and ANDNOT between two opmask-registers and set flags accordingly. | VEX.L0.0F 99 /r | KTESTB k,k | KTESTW k,k | KTESTD k,k | KTESTQ k,k |
Not all of the opmask instructions fit the pattern above - the remaining ones are:
| Instruction description | Instruction mnemonics | Opcode | Operation/result width | AVX-512 subset |
| Opmask-register shift right immediate with zero-fill | KSHIFTRB k,k,imm8 | 8 | DQ | |
| Opmask-register shift right immediate with zero-fill | VEX.L0.66.0F3A.W1 30 /r /ib | 16 | F | |
| Opmask-register shift right immediate with zero-fill | KSHIFTRD k,k,imm8 | VEX.L0.66.0F3A.W0 31 /r /ib | 32 | BW |
| Opmask-register shift right immediate with zero-fill | KSHIFTRQ k,k,imm8 | VEX.L0.66.0F3A.W1 31 /r /ib | 64 | BW |
| Opmask-register shift left immediate | KSHIFTLB k,k,imm8 | VEX.L0.66.0F3A.W0 32 /r /ib | 8 | DQ |
| Opmask-register shift left immediate | VEX.L0.66.0F3A.W1 32 /r /ib | 16 | F | |
| Opmask-register shift left immediate | KSHIFTLD k,k,imm8 | VEX.L0.66.0F3A.W0 33 /r /ib | 32 | BW |
| Opmask-register shift left immediate | KSHIFTLQ k,k,imm8 | VEX.L0.66.0F3A.W1 33 /r /ib | 64 | BW |
| 32/64-bit move between general-purpose registers and opmask registers | KMOVD k,r32 | VEX.L0.F2.0F.W0 92 /r | 32 | BW |
| 32/64-bit move between general-purpose registers and opmask registers | KMOVQ k,r64 | VEX.L0.F2.0F.W1 92 /r | 64 | BW |
| 32/64-bit move between general-purpose registers and opmask registers | KMOVD r32,k | VEX.L0.F2.0F.W0 93 /r | 32 | BW |
| 32/64-bit move between general-purpose registers and opmask registers | KMOVQ r64,k | VEX.L0.F2.0F.W1 93 /r | 64 | BW |
| Concatenate two 8-bit opmasks into a 16-bit opmask | KUNPCKBW k,k,k | VEX.L1.66.0F.W0 4B /r | 16 | F |
| Concatenate two 16-bit opmasks into a 32-bit opmask | KUNPCKWD k,k,k | VEX.L1.0F.W0 4B /r | 32 | BW |
| Concatenate two 32-bit opmasks into a 64-bit opmask | KUNPCKDQ k,k,k | VEX.L1.0F.W1 4B /r | 64 | BW |
Compare, test, blend, opmask-convert instructions
Vector-register instructions that use opmasks in ways other than just as a result writeback mask.AMX
Intel AMX adds eight new tile-registers,tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.