X87


x87 is a floating-point-related subset of the x86 architecture instruction set. It originated as an extension of the 8086 instruction set in the form of optional floating-point coprocessors that work in tandem with corresponding x86 CPUs. These microchips have names ending in "87". This is also known as the NPX. Like other extensions to the basic instruction set, x87 instructions are not strictly needed to construct working programs, but provide hardware and microcode implementations of common numerical tasks, allowing these tasks to be performed much faster than corresponding machine code routines can. The x87 instruction set includes instructions for basic floating-point operations such as addition, subtraction and comparison, but also for more complex numerical operations, such as the computation of the tangent function and its inverse, for example.
Most x86 processors since the Intel 80486 have had these x87 instructions implemented in the main CPU, but the term is sometimes still used to refer to that part of the instruction set. Before x87 instructions were standard in PCs, compilers or programmers had to use rather slow library calls to perform floating-point operations, a method that is still common in embedded systems.

Description

The x87 registers form an eight-level deep non-strict stack structure ranging from ST to ST with registers that can be directly accessed by either operand, using an offset relative to the top, as well as pushed and popped.
There are instructions to push, calculate, and pop values on top of this stack; unary operations then implicitly address the topmost ST, while binary operations implicitly address ST and ST. The non-strict stack model also allows binary operations to use ST together with a direct memory operand or with an explicitly specified stack register, ST, in a role similar to a traditional accumulator. This can also be reversed on an instruction-by-instruction basis with ST as the unmodified operand and ST as the destination. Furthermore, the contents in ST can be exchanged with another stack register using an instruction called FXCH ST.
These properties make the x87 stack usable as seven freely addressable registers plus a dedicated accumulator. This is especially applicable on superscalar x86 processors, where these exchange instructions are optimized down to a zero clock penalty by using one of the integer paths for FXCH ST in parallel with the FPU instruction. Despite being natural and convenient for human assembly language programmers, some compiler writers have found it complicated to construct automatic code generators that schedule x87 code effectively. Such a stack-based interface potentially can minimize the need to save scratch variables in function calls compared with a register-based interface
The x87 provides single-precision, double-precision and 80-bit double-extended precision binary floating-point arithmetic as per the IEEE 754-1985 standard. By default, the x87 processors all use 80-bit double-extended precision internally. A given sequence of arithmetic operations may thus behave slightly differently compared to a strict single-precision or double-precision IEEE 754 FPU. As this may sometimes be problematic for some semi-numerical calculations written to assume double precision for correct operation, to avoid such problems, the x87 can be configured using a special configuration/status register to automatically round to single or double precision after each operation. Since the introduction of SSE2, the x87 instructions are not as essential as they once were, but remain important as a high-precision scalar unit for numerical calculations sensitive to round-off error and requiring the 64-bit mantissa precision and extended range available in the 80-bit format.

Performance

Clock cycle counts for examples of typical x87 FPU instructions.
The A...B notation covers timing variations dependent on transient pipeline status and the arithmetic precision chosen ; it also includes variations due to numerical cases. The L → H notation depicts values corresponding to the lowest and the highest maximal clock frequencies that were available.
x87 implementationFADDFMULFDIVFXCHFCOMFSQRTFPTANFPATANMax clock
Peak FMUL
FMUL§
rel. 5 MHz 8087
808770…10090…145193…20310…1540…50180…18630…540250…8005 → 100.034…0.055 → 0.100…0.1111 → 2× as fast
80287 70…10090…145193…20310…1540…50180…18630…540250…8006 → 120.041…0.066 → 0.083…0.1331.2 → 2.4×
80387 23…3429…5788…911824122…129191…497314…48716 → 330.280…0.552 → 0.580…1.1~10 → 20×
80486 8…2016734483…87200…273218…30316 → 501.0 → 3.1~18 → 56×
Cyrix 6x86, Cyrix MII4…74…624…342459…60117…12997…16166 → 30011…16 → 50…75~320 → 1400×
AMD K6 2221…412321…41??166 → 55083 → 275~1500 → 5000×
Pentium / Pentium MMX1…31…3391 1…47017…17319…13460 → 30020…60 → 100…300~1100 → 5400×
Pentium Pro1…32…516…561 128…68??150 → 20030…75 → 40…100~1400 → 1800×
Pentium II / III1…32…517…381 127…50??233 → 140047…116 → 280…700~2100 → 13000×
Athlon 1…41…413…241 1…216…35??500 → 2330125…500 → 580…2330~9000 → 42000×
Athlon 64 1000 → 3200250…1000 → 800…3200~18000 → 58000×1 -------
Pentium 41…52…720…43multiple
cycles
120…43??1300 → 3800186…650 → 543…1900~11000 → 34000×

Manufacturers

Companies that have designed or manufactured floating-point units compatible with the Intel 8087 or later models include AMD, Chips and Technologies, Cyrix, Fujitsu, Harris Semiconductor, IBM, IDT, IIT, LC Technology, National Semiconductor, NexGen, Rise Technology, ST Microelectronics, Texas Instruments, Transmeta, ULSI, VIA, Weitek, and Xtend.

Architectural generations

8087

The 8087 was the first math coprocessor for 16-bit processors designed by Intel. It was released in 1980 to be paired with the Intel 8088 or 8086 microprocessors.

80C187

Although the original 1982 datasheet for the 80188 and 80186 seem to mention specific math coprocessors, both chips were actually paired with an 8087.
However, in 1987, in order to work with the refreshed CMOS based Intel 80C186 CPU, Intel introduced the 80C187 math coprocessor. The 80C187 interface to the main processor is the same as that of the 8087, but its core is essentially that of an 80387SX and is thus fully IEEE 754-compliant and capable of executing all the 80387's extra instructions.

80287

The 80287, released in 1982, is the math coprocessor for the Intel 80286 series of microprocessors. Intel's models included variants with specified upper frequency limits ranging from 6 up to 12 MHz. The NMOS version were available 6, 8 and 10 MHz. The available 10 MHz Intel 80287-10 Numerics Coprocessor version was for in quantities of 100. These boxed version of 80287, 80287-8, and 80287-10 were available for $212, $326, and $374, respectively. There was boxed version of 80C287A available for $457.
Other 287 models with 387-like performance are the Intel 80C287, built using CHMOS III, and the AMD 80EC287 manufactured in AMD's CMOS process, using only fully static gates.
Later followed the i80287XL with 387SX microarchitecture with a 287 pinout, the i80287XLT, a special version intended for laptops, as well as other variants. It contains an internal 3/2 multiplier, so that motherboards that ran the coprocessor at 2/3 CPU speed could instead run the FPU at the same speed of the CPU. Both 80287XL and 80287XLT offered 50% better performance, 83% less power consumption, and additional instructions.
The 80287 works with the 80386 microprocessor and was initially the only coprocessor available for the 80386 until the introduction of the 80387 in 1987. The 80387 is strongly preferred for its higher performance and more capable instruction set.

80387

The 80387 is the first Intel coprocessor to be fully compliant with the IEEE 754-1985 standard. Released in 1987, two years after the 386 chip, the i387 includes much improved speed over Intel's previous 8087/80287 coprocessors and improved characteristics of its trigonometric functions. It was made available for USD $500 in quantities of 100. Shortly afterwards, it was made available through Intel's Personal Computer Enhancement Operation for a retail market price of USD $795. The 25 MHz version was available in retail channel for USD $1395. The Intel M387 math coprocessor met under MIL-STD-883 Rev. C standard. This device was tested which includes temperature cycling between -55 and 125 °C, hermeticity sealed and extended burn-in. This military version operates at 16 MHz. This military version was available in 68-lead PGA and quad flatpack. This military version was available for USD $1155 in 100-unit of quantities for the PGA version. The 33 MHz version of 387DX was available and it has the performance of 3.4 megawhetstones per second. The following boxed version of 16-, 20-, 25-, and 33-MHz 387DX math coprocessor were available for USD $570, $647, $814, and $994 respectively. The 8087 and 80287's FPTAN and FPATAN instructions are limited to an argument in the range ±π/4, and the 8087 and 80287 have no direct instructions for the SIN and COS functions.
Without a coprocessor, the 386 normally performs floating-point arithmetic through software routines, implemented at runtime through a software exception handler. When a math coprocessor is paired with the 386, the coprocessor performs the floating-point arithmetic in hardware, returning results much faster than an software library call.
The i387 is compatible only with the standard i386 chip, which has a 32-bit processor bus. The later cost-reduced i386SX, which has a narrower 16-bit data bus, can not interface with the i387's 32-bit bus. The i386SX requires its own coprocessor, the 80387SX, which is compatible with the SX's narrower 16-bit data bus. The 387SX coprocessor was also offered in low-power version.
In addition, to pair with the i386SL used in laptops, Intel released the i387SL. Marketed as "Intel387 SL Mobile Math CoProcessor", it included power-management features which allowed it to run without significantly reducing battery life. There are two battery-saving power-down features. The first one stops the coprocessor's clock when the CPU goes into "stop clock" mode; the 387SL consumes about 25 microamperes when its clock is stopped. The second one operates automatically when the CPU is running, putting the 387SL into "idle mode" when it is not executing an instruction. When active, the 387SL typically consumes 30 percent less battery power than the 387SX. In idle mode, it consumes 4 mA, a 96 percent power reduction compared to the active mode. It works in the range of 16 to 25 MHz and does not require BIOS or hardware reconfiguration. It was initially available for USD $189.