IEEE 754


The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers. The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
The standard defines:
  • arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers, infinities, and special "not a number" values
  • interchange formats: encodings that may be used to exchange floating-point data in an efficient and compact form
  • rounding rules: properties to be satisfied when rounding numbers during arithmetic and conversions
  • operations: arithmetic and other operations on arithmetic formats
  • exception handling: indications of exceptional conditions
IEEE 754-2008, published in August 2008, includes nearly all of the original IEEE 754-1985 standard, plus the IEEE 854-1987 standard. The current version, IEEE 754-2019, was published in July 2019. It is a minor revision of the previous version, incorporating mainly clarifications, defect fixes and new recommended operations.

History

The need for a floating-point standard arose from chaos in the business and scientific computing industry in the 1960s and 1970s. IBM used a hexadecimal floating-point format with 7 bits always used for the exponent regardless of precision. CDC and Cray computers used ones' complement representation, which admits a value of +0 and −0. CDC 60-bit computers did not have full 60-bit adders, so integer arithmetic was limited to 48 bits of precision from the floating-point unit. Exception processing from divide-by-zero was different on different computers. Moving data between systems and even repeating the same calculations on different systems was often difficult.
The first IEEE standard for floating-point arithmetic, IEEE 754-1985, was published in 1985. It covered only binary floating-point arithmetic.
A new version, IEEE 754-2008, was published in August 2008, following a seven-year revision process, chaired by Dan Zuras and edited by Mike Cowlishaw. It replaced both IEEE 754-1985 and IEEE 854-1987 standards. The binary formats in the original standard are included in this new standard along with three new basic formats, one binary and two decimal. To conform to the current standard, an implementation must implement at least one of the basic formats as both an arithmetic format and an interchange format.
The international standard ISO/IEC/IEEE 60559:2011 has been approved for adoption through ISO/IEC JTC 1/SC 25 under the ISO/IEEE PSDO Agreement and published.
The current version, IEEE 754-2019 published in July 2019, is derived from and replaces IEEE 754-2008, following a revision process started in September 2015, chaired by David G. Hough and edited by Mike Cowlishaw. It incorporates mainly clarifications and defect fixes, but also includes some new recommended operations.
The international standard ISO/IEC 60559:2020 has been approved for adoption through ISO/IEC JTC 1/SC 25 and published.
The next projected revision of the standard is in 2029.

Formats

An IEEE 754 format is a "set of representations of numerical values and symbols". A format may also include how the set is encoded.
A floating-point format is specified by
  • a base b, which is either 2 or 10 in IEEE 754;
  • a precision p;
  • an exponent range from emin to emax, with emin = 1 − emax, or equivalently emin = −, for all IEEE 754 formats.
A format comprises
  • Finite numbers, which can be described by three integers: s = a sign, c = a significand having no more than p digits when written in base b, and q = an exponent such that eminq + p − 1 ≤ emax. The numerical value of such a finite number is. Moreover, there are two zero values, called signed zeros: the sign bit specifies whether a zero is +0 or −0.
  • Two infinities: +∞ and −∞.
  • Two kinds of NaN : a quiet NaN and a signaling NaN.
For example, if b = 10, p = 7, and emax = 96, then emin = −95, the significand satisfies 0 ≤ c ≤, and the exponent satisfies. Consequently, the smallest non-zero positive number that can be represented is 1×10−101, and the largest is 9999999×1090, so the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax are the smallest normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.

Representation and encoding in memory

Some numbers may have several possible floating-point representations. For instance, if b = 10, and p = 7, then −12.345 can be represented by −12345×10−3, −123450×10−4, and −1234500×10−5. However, for most operations, such as arithmetic operations, the result does not depend on the representation of the inputs.
For the decimal formats, any representation is valid, and the set of these representations is called a cohort. When a result can have several representations, the standard specifies which member of the cohort is chosen.
For the binary formats, the representation is made unique by choosing the smallest representable exponent allowing the value to be represented exactly. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers. For numbers with an exponent in the normal range, the leading bit of the significand will always be 1. Consequently, a leading 1 can be implied rather than explicitly present in the memory encoding, and under the standard the explicitly represented part of the significand will lie between 0 and 1. This rule is called leading bit convention, implicit bit convention, or hidden bit convention. This rule allows the binary format to have an extra bit of precision. The leading bit convention cannot be used for the subnormal numbers as they have an exponent outside the normal exponent range and scale by the smallest represented exponent as used for the smallest normal numbers.
Due to the possibility of multiple encodings, a NaN may carry other information: a sign bit and a payload, which is intended for diagnostic information indicating the source of the NaN.

Basic and [|interchange formats]

The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats and two decimal floating-point basic formats. The binary32 and binary64 formats are the single and double formats of IEEE 754-1985 respectively. A conforming implementation must fully implement at least one of the basic formats.
The standard also defines interchange formats, which generalize these basic formats. For the binary formats, the leading bit convention is required. The following table summarizes some of the possible interchange formats.
In the table above, integer values are exact, whereas values in decimal notation are rounded values. The minimum exponents listed are for normal numbers; the special subnormal number representation allows even smaller numbers to be represented with some loss of precision. For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the emin value −1022 and all but one of the 53 significand bits.
Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as digits × log10 base. E.g. binary128 has approximately the same precision as a 34 digit decimal number.
log10 MAXVAL is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point.
The binary32 and binary64 formats are two of the most common formats used today. The figure below shows the absolute precision for both formats over a range of values. This figure can be used to select an appropriate format given the expected value of a number and the required precision.
An example of a layout for 32-bit floating point is
and the 64 bit layout is similar.

Extended and extendable precision formats

The standard specifies optional extended and extendable precision formats, which provide greater precision than the basic formats. An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters. These parameters uniquely describe the set of finite numbers that it can represent.
The standard recommends that language standards provide a method of specifying p and emax for each supported base b. The standard recommends that language standards and implementations support an extended format which has a greater precision than the largest basic format supported for each radix b. For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64-bit extended precision binary number must have an 'emax' of at least 16383. The x87 80-bit extended format meets this requirement.
The original IEEE 754-1985 standard also had the concept of extended formats, but without any mandatory relation between emin and emax. For example, the Motorola 68881 80-bit format, where emin = − emax, was a conforming extended format, but it became non-conforming in the 2008 revision.