Extended precision
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended-precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Extended-precision implementations
There is a long history of extended floating-point formats reaching back nearly to the middle of the last century.. Various manufacturers have used different formats for extended precision for different machines. In many cases the format of the extended precision is not quite the same as a scale-up of the ordinary single- and double-precision formats it is meant to extend. In a few cases the implementation was merely a software-based change in the floating-point data format, but in most cases extended precision was implemented in hardware, either built into the central processor itself, or more often, built into the hardware of an optional, attached processor called a "floating-point unit" or "floating-point processor", accessible to the CPU as a fast input / output device.IBM extended-precision formats
The IBM 1130, sold in 1965, offered two floating-point formats: A 32-bit "standard precision" format and a 40-bit "extended precision" format. Standard-precision format contains a 24-bit two's complement significand while extended-precision utilizes a 32-bit two's complement significand. The latter format makes full use of the CPU's 32-bit integer operations. The characteristic in both formats is an 8-bit field containing the power of two biased by 128. Floating-point arithmetic operations are performed by software, and double precision is not supported at all. The extended format occupies three 16-bit words, with the extra space simply ignored.The IBM System/360 supports a 32-bit "short" floating-point format and a 64-bit "long" floating-point format. The 360/85 and follow-on System/370 add support for a 128-bit "extended" format. These formats are still supported in the current design, where they are now called the "hexadecimal floating-point" formats.
Microsoft MBF extended-precision format
The Microsoft BASIC port for the 6502 CPU, such as in adaptations like Commodore BASIC, AppleSoft BASIC, KIM-1 BASIC or MicroTAN BASIC, supports an extended 40-bit variant of the floating-point format Microsoft Binary Format since 1977.IEEE 754 extended-precision formats
The IEEE 754 floating-point standard recommends that implementations provide extended-precision formats. The standard specifies the minimum requirements for an extended format but does not specify an encoding. The encoding is the implementor's choice.The IA32, x86-64, and Itanium processors support what is by far the most influential format on this standard, the Intel 80-bit "double extended" format, described in the next section.
The Motorola 6888x math coprocessors and the Motorola 68040 and 68060 processors also support a 64-bit significand extended-precision format. The follow-on Coldfire processors do not support this 96-bit extended-precision format.
The FPA10 math coprocessor for early ARM processors also supports a 64-bit significand extended-precision format, but without correct rounding.
The x87 and Motorola 68881 80-bit formats meet the requirements of the IEEE 754-1985 double extended format, as does the IEEE 754 128-bit binary format.
x86 extended-precision format
The x86 extended-precision format is an 80-bit format first implemented in the Intel 8087 math coprocessor and is supported by all processors that are based on the x86 design that incorporate a floating-point unit.The Intel 8087 was the first x86 device which supported floating-point arithmetic in hardware. It was designed to support a 32-bit "single precision" format and a 64-bit "double-precision" format for encoding and interchanging floating-point numbers. The extended format was designed not to store data at higher precision, but rather to allow for the computation of temporary double results more reliably and accurately by minimising overflow and roundoff-errors in intermediate calculations. All the floating-point registers in the 8087 hold this format, and it automatically converts numbers to this format when loading registers from memory and also converts results back to the more conventional formats when storing the registers back into memory. To enable intermediate subexpression results to be saved in extended precision scratch variables and continued across programming language statements, and otherwise interrupted calculations to resume where they were interrupted, it provides instructions which transfer values between these internal registers and memory without performing any conversion, which therefore enables access to the extended format for calculations – also reviving the issue of the accuracy of functions of such numbers, but at a higher precision.
The floating-point units on all subsequent x86 processors have supported this format. As a result, software can be developed which takes advantage of the higher precision provided by this format. William Kahan, a primary designer of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An extended format as wide as we dared was included to serve the same support role as the 13 decimal internal format serves in Hewlett-Packard's 10 decimal calculators." Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087, and that the x87 extended precision was designed to be extensible to higher precision in future processors:
This 80-bit format uses one bit for the sign of the significand, 15 bits for the exponent field and 64 bits for the significand. The exponent field is biased by 16383, meaning that 16383 has to be subtracted from the value in the exponent field to compute the actual An exponent field value of 32767 is reserved so as to enable the representation of special states such as infinity and Not a Number. If the exponent field is zero, the value is a subnormal number and the exponent of 2 is −16382.
In the following table, "" is the value of the sign bit, "" is the value of the exponent field interpreted as a positive integer, and "" is the significand interpreted as a positive binary number, where the binary point is located between bits 63 and 62. The "" field is the combination of the integer and fraction parts in the above diagram.
In contrast to the single- and double-precision formats, this format does not utilize an implicit / hidden bit. Rather, bit 63 contains the integer part of the significand and bits 62–0 hold the fractional part. Bit 63 will be 1 on all normalized numbers. There were several advantages to this design when the 8087 was being developed:
- Calculations can be completed a little faster if all bits of the significand are present in the register.
- A 64-bit significand provides sufficient precision to avoid loss of precision when the results are converted back to double-precision format in the vast number of cases.
- This format provides a mechanism for indicating precision loss due to underflow which can be carried through further operations. For example, the calculation generates the intermediate result which is a subnormal and also involves precision loss. The product of all of the terms is which can be represented as a normalized number. The 80287 could complete this calculation and indicate the loss of precision by returning a "subnormal" result. Processors since the 80387 no longer generate unnormal values and do not support unnormal inputs to operations. They will generate a subnormal if an underflow occurs but will generate a normalized result if subsequent operations on the subnormal can be normalized.
Examples
0000 0000 0000 0000 000116 = 2−16382 × 2−63 = 2−16445
≈ 3.64519953188247460252841 × 10−4951
0000 7fff ffff ffff ffff16 = 2−16382 ×
≈ 3.36210314311209350589816 × 10−4932
0001 8000 0000 0000 000016 = 2−16382
≈ 3.36210314311209350626268 × 10−4932
7ffe ffff ffff ffff ffff16 = 216384 ×
≈ 1.18973149535723176502126 × 104932
3ffe ffff ffff ffff ffff16 = 1 − 2−64
≈ 0.99999999999999999994579
3fff 8000 0000 0000 000016 = 1
3fff 8000 0000 0000 000116 = 1 + 2−63
≈ 1.00000000000000000010842
4000 8000 0000 0000 000016 = 2
c000 8000 0000 0000 000016 = −2
0000 0000 0000 0000 000016 = 0
8000 0000 0000 0000 000016 = −0
3ffd aaaa aaaa aaaa aaab16 ≈ 0.33333333333333333334237
4000 c90f daa2 2168 c23516 ≈ 3.14159265358979323851281