Quadruple-precision floating-point format
In computing, quadruple precision is a binary floating-point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
This 128-bit quadruple precision is designed for applications needing results in higher than double precision, and as a primary function, to allow computing double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE 754 floating-point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."
In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128.
IEEE 754 quadruple-precision binary floating-point format: binary128
The IEEE 754 standard specifies a binary128 as having:- Sign bit: 1 bit
- Exponent width: 15 bits
- Significand precision: 113 bits
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits for normal values; subnormals have gracefully degrading precision down to 1 bit for the smallest non-zero value. The bits are laid out as:
Exponent encoding
The quadruple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.- Emin = 000116 − 3FFF16 = −16382
- Emax = 7FFE16 − 3FFF16 = 16383
- Exponent bias = 3FFF16 = 16383
The stored exponents 000016 and 7FFF16 are interpreted specially.
The minimum strictly positive value is 2−16494 ≈ 10−4965 and has a precision of only one bit. The minimum positive normal value is 2−16382 ≈ and has a precision of 113 bits, i.e. ±2−16494 as well. The maximum representable value is ≈.
Quadruple precision examples
These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, exponent, and significand.By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. Thus, the bits beyond the rounding point are
0101... which is less than 1/2 of a unit in the last place.Double-double arithmetic
A common software technique to implement nearly quadruple precision using pairs of double-precision values is sometimes called double-double arithmetic. Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least , only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits, significantly lower than the 15-bit exponent of IEEE quadruple precision.In particular, a double-double/quadruple-precision value q in the double-double technique is represented implicitly as a sum of two double-precision values x and y, each of which supplies half of q
Note that double-double arithmetic has the following special characteristics:
- As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is, or. Numbers whose magnitude is smaller than 2−1021 will not have additional precision compared with double precision.
- The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than a half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits are implied between the significand of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
- Because of the reason above, it is possible to represent values like, which is the smallest representable number greater than 1.
A similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 bits.
Implementations
Quadruple precision is often implemented in software by a variety of techniques, since direct hardware support for quadruple precision is,, less common. One can use general arbitrary-precision arithmetic libraries to obtain quadruple precision, but specialized quadruple-precision implementations may achieve higher performance.Computer-language support
A separate question is the extent to which quadruple-precision types are directly incorporated into computer programming languages.Quadruple precision is specified in Fortran by the
real, or as real, or in a non-standard way as REAL*16. For the C programming language, ISO/IEC TS 18661-3 specifies
_Float128 as the type implementing the IEEE 754 quadruple-precision format. Alternatively, in C/C++ with a few systems and compilers, quadruple precision may be specified by the long double type, but this is not required by the language, nor is it common.As of C++23, the C++ language defines a
header that contains fixed-width floating-point types. Implementations of these are optional, but if supported, std::float128_t corresponds to quadruple precision.On x86 and x86-64, the most common C/C++ compilers implement
long double as either 80-bit extended precision or simply as being synonymous with double precision, rather than as quadruple precision. The procedure call standard for the ARM 64-bit architecture specifies that long double corresponds to the IEEE 754 quadruple-precision format. On a few other architectures, some C/C++ compilers implement long double as quadruple precision, e.g. gcc on PowerPC and SPARC, or the Sun Studio compilers on SPARC. Even if long double is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called __float128 for x86, x86-64 and Itanium CPUs, and on PowerPC as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options; and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called _Quad.Zig provides support for it with its
f128 type.Google's work-in-progress language Carbon provides support for it with the type called
f128.As of 2024, Rust is currently working on adding a new
f128 type for IEEE quadruple-precision 128-bit floats.Libraries and toolboxes
- The GCC quad-precision math library, , provides
__float128and__complex128operations. - The Boost multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for
__float128and_Quadtypes, and includes a custom implementation of the standard math library. - The Multiprecision Computing Toolbox for MATLAB allows quadruple-precision computations in MATLAB. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra.
- The DoubleFloats package provides support for double-double computations for the Julia programming language.
- The doubledouble.py library enables double-double computations in Python.
- Mathematica supports IEEE quad-precision numbers: 128-bit floating-point values, and 256-bit complex values.