Floating-point arithmetic


In computing, floating-point arithmetic is arithmetic on subsets of real numbers formed by a significand multiplied by an integer power of that base.
Numbers of this form are called floating-point numbers.
For example, the number 2469/200 is a floating-point number in base ten with five digits:
However, 7716/625 = 12.3456 is not a floating-point number in base ten with five digits—it needs six digits.
The nearest floating-point number with only five digits is 12.346.
And 1/3 = 0.3333… is not a floating-point number in base ten with any finite number of digits.
In practice, most floating-point systems use base two, though base ten is also common.
Floating-point arithmetic operations, such as addition and division, approximate the corresponding real number arithmetic operations by rounding any result that is not a floating-point number itself to a nearby floating-point number.
For example, in a floating-point arithmetic with five base-ten digits, the sum 12.345 + 1.0001 = 13.3451 might be rounded to 13.345.
The term floating point refers to the fact that the number's radix point can "float" anywhere to the left, right, or between the significant digits of the number. This position is indicated by the exponent, so floating point can be considered a form of scientific notation.
A floating-point system can be used to represent, with a fixed number of digits, numbers of very different orders of magnitude — such as the number of meters between galaxies or between protons in an atom. For this reason, floating-point arithmetic is often used to allow very small and very large real numbers that require fast processing times. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with their exponent.
Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.
The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.
Floating-point numbers can be computed using software implementations or hardware implementations. Floating-point units are specially designed to carry out operations on floating-point numbers and are part of most computer systems. When FPUs are not available, software implementations can be used instead.

Overview

Floating-point numbers

A number representation specifies some way of encoding a number, usually as a string of digits.
There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.
In scientific notation, the given number is scaled by a power of 10, so that it lies within a specific range—typically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is seconds, a value that would be represented in standard-form scientific notation as seconds.
Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
  • A signed digit string of a given length in a given radix. This digit string is referred to as the significand, mantissa, or coefficient. The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost digit. This article generally follows the convention that the radix point is set just after the most significant digit.
  • A signed integer exponent, which modifies the magnitude of the number.
To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.
Using base-10 as an example, the number, which has ten decimal digits of precision, is represented as the significand together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by to give, or. In storing such a number, the base need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.
Symbolically, this final value is:
where is the significand, is the precision, is the base, and is the exponent.
Historically, several number bases have been used for representing floating-point numbers, with base two being the most common, followed by base ten, and other less common varieties, such as base sixteen, base eight, base four, base three and even base 256 and base.
A floating-point number is a rational number, because it can be represented as one integer divided by another; for example is ×1000 or /100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base. However, 1/3 cannot be represented exactly by either binary or decimal, but in base 3, it is trivial . The occasions on which infinite expansions occur depend on the base and its prime factors.
The way in which the significand and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision floating-point representation,, and so the significand is a string of 24 bits. For instance, the number π's first 33 bits are:
In this binary expansion, let us denote the positions from 0 to 32. The 24-bit significand will stop at position 23, shown as the underlined bit above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33-bit approximation to the nearest 24-bit number. This bit, which is in this example, is added to the integer formed by the leftmost 24 bits, yielding:
When this is stored in memory using the IEEE 754 encoding, this becomes the significand. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:
where is the precision, is the position of the bit of the significand from the left and is the exponent.
It can be required that the most significant digit of the significand of a non-zero number be non-zero. This process is called normalization. For binary formats, this non-zero digit is necessarily. Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention, or the assumed bit convention.

Alternatives to floating-point numbers

The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:
  • Fixed-point representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
  • Logarithmic number systems represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve is smooth. Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The level-index arithmetic of Charles Clenshaw, Frank Olver and Peter Turner is a scheme based on a generalized logarithm representation.
  • Tapered floating-point representation, used in Unum formats, including Posit.
  • Some simple rational numbers cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them, but the possibilities remain limited. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
  • Interval arithmetic allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
  • Computer algebra systems such as Mathematica, Maxima, and Maple can often handle irrational numbers like or in a completely "formal" way, without dealing with a specific encoding of the significand. Such a program can evaluate expressions like "" exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.