Minifloat
In computing, minifloats are floating-point values represented with very few bits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:
- Computer graphics, where human perception of color and light levels has low precision. The 16-bit half-precision format is very popular.
- Machine learning, which can be relatively insensitive to numeric precision. 16-bit, 8-bit, and even 4-bit floats are increasingly being used.
Depending on context minifloat may mean any size less than 32, any size less or equal to 16, or any size less than 16. The term microfloat may mean any size less or equal to 8.
Notation
This page uses the notation to describe a mini float:- is the length of the sign field.
- is the length of the exponent field.
- is the length of the mantissa field.
The exponent bias. This value insures that all representable numbers have a representable reciprocal.
The notation can be converted to a format as.
A common notation used in the field of machine learning is FPn EeM''m'', where the lowercase letters are replaced by numbers. For example, FP8 E4M3 is the same as.
Usage
Many situations that call for floating-point numbers do not actually require a lot of precision. This is typical for high dynamic range graphics and image processing. This is also typical for larger neural networks, a property that has been exploited since the 2020s to allow increasingly large language models to be trained and deployed. The more "general-purpose" example is the fp16 in IEEE 754-2008, called "half-precision".The bfloat16 format is the first 16 bits of a standard single-precision number and was often used in image processing and machine learning before hardware support was added for other formats.
Graphics
The Radeon R300 and R420 GPUs used an "fp24" floating-point format. "Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 graphics API initially supported both FP24 and FP32 as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.In 2016 Khronos defined 10-bit and 11-bit unsigned formats for use with Vulkan. These can be converted from positive half-precision by truncating the sign and trailing digits.
Microcontroller
Minifloats are also commonly used in embedded devices such as microcontrollers where floating-point will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting.Machine learning
In 2022 NVidia and others announced support for "fp8" format. These can be converted from half-precision by truncating the trailing digits. This format supports special values such as NaN and infinity. They also announced a format without Infinity and only two representations for NaN, the FP8 E4M3 : after all, special values are unnecessary in the inference of neural networks. The formats have been made into an industrial standard called OCP-FP8. Further compression such as FP4 E2M1 has also proven fruitful.| .000 | .001 | .010 | .011 | .100 | .101 | .110 | .111 | |
| 0... | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | 6 |
| 1... | −0 | −0.5 | −1 | −1.5 | −2 | −3 | −4 | −6 |
Since 2023, IEEE SA Working Group P3109 is working on a standard for minifloats optimized for machine learning by systematizing current practice. Interim Report version 3.0 defines a family of many formats under the systematic name "binaryKp''P", where K'' is the total bit length, P is the number of mantissa bits, s/u refers to whether a sign bit is present, and e/f refers to whether infinity is included. By convention, s and e may be omitted. To save space for more numbers, there is no such thing as a "negative zero", and there is only one representation for NaN; for signed formats, the NaN can thus use the bit-pattern of what would've been negative zero. For example, the FP4-E2M1 format can be approximated as the following in P3109:
| .000 | .001 | .010 | .011 | .100 | .101 | .110 | .111 | |
| 0... | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | 6 |
| 1... | NaN | −0.5 | −1 | −1.5 | −2 | −3 | −4 | −6 |
A downside of very small minifloats is that they have very little representable dynamic range. To fix this problem, the machine learning industry has invented "microscaling formats", a kind of block floating-point. In a MX format, a group of 32 minifloats share an additional scaling factor represented by an "E8M0" minifloat. MX has been defined for FP8-E5M2, FP8-E4M3, FP6-E3M2, FP6-E2M3, and FP4-E2M1.
Examples
8-bit (1.4.3)
A minifloat in 1 byte with 1 sign bit, 4 exponent bits and 3 significand bits is demonstrated here. The exponent bias is defined as 7 to center the values around 1 to match other IEEE 754 floats so the actual multiplier for exponent is. All IEEE 754 principles should be valid. This form is quite common for instruction.Zero is represented as zero exponent with a zero mantissa. The zero exponent means zero is a subnormal number with a leading "0." prefix, and with the zero mantissa all bits after the decimal point are zero, meaning this value is interpreted as. Floating point numbers use a signed zero, so is also available and is equal to positive.
0 0000 000 = 0
1 0000 000 = −0
For the lowest exponent the significand is extended with "0." and the exponent value is treated as 1 higher like the least normalized number:
0 0000 001 = 0.0012 × 21 - 7 = 0.125 × 2−6 = 0.001953125
...
0 0000 111 = 0.1112 × 21 - 7 = 0.875 × 2−6 = 0.013671875
All other exponents the significand is extended with "1.":
0 0001 000 = 1.0002 × 21 - 7 = 1 × 2−6 = 0.015625
0 0001 001 = 1.0012 × 21 - 7 = 1.125 × 2−6 = 0.017578125
...
0 0111 000 = 1.0002 × 27 - 7 = 1 × 20 = 1
0 0111 001 = 1.0012 × 27 - 7 = 1.125 × 20 = 1.125
...
0 1110 000 = 1.0002 × 214 - 7 = 1.000 × 27 = 128
0 1110 001 = 1.0012 × 214 - 7 = 1.125 × 27 = 144
...
0 1110 110 = 1.1102 × 214 - 7 = 1.750 × 27 = 224
0 1110 111 = 1.1112 × 214 - 7 = 1.875 × 27 = 240
Infinity values have the highest exponent, with the mantissa set to zero. The sign bit can be either positive or negative.
0 1111 000 = +infinity
1 1111 000 = −infinity
NaN values have the highest exponent, with the mantissa non-zero.
s 1111 mmm = NaN
This is a chart of all possible values for this example 8-bit float:
| … 000 | … 001 | … 010 | … 011 | … 100 | … 101 | … 110 | … 111 | |
| 0 0000 … | 0 | 0.001953125 | 0.00390625 | 0.005859375 | 0.0078125 | 0.009765625 | 0.01171875 | 0.013671875 |
| 0 0001 … | 0.015625 | 0.017578125 | 0.01953125 | 0.021484375 | 0.0234375 | 0.025390625 | 0.02734375 | 0.029296875 |
| 0 0010 … | 0.03125 | 0.03515625 | 0.0390625 | 0.04296875 | 0.046875 | 0.05078125 | 0.0546875 | 0.05859375 |
| 0 0011 … | 0.0625 | 0.0703125 | 0.078125 | 0.0859375 | 0.09375 | 0.1015625 | 0.109375 | 0.1171875 |
| 0 0100 … | 0.125 | 0.140625 | 0.15625 | 0.171875 | 0.1875 | 0.203125 | 0.21875 | 0.234375 |
| 0 0101 … | 0.25 | 0.28125 | 0.3125 | 0.34375 | 0.375 | 0.40625 | 0.4375 | 0.46875 |
| 0 0110 … | 0.5 | 0.5625 | 0.625 | 0.6875 | 0.75 | 0.8125 | 0.875 | 0.9375 |
| 0 0111 … | 1 | 1.125 | 1.25 | 1.375 | 1.5 | 1.625 | 1.75 | 1.875 |
| 0 1000 … | 2 | 2.25 | 2.5 | 2.75 | 3 | 3.25 | 3.5 | 3.75 |
| 0 1001 … | 4 | 4.5 | 5 | 5.5 | 6 | 6.5 | 7 | 7.5 |
| 0 1010 … | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 0 1011 … | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 |
| 0 1100 … | 32 | 36 | 40 | 44 | 48 | 52 | 56 | 60 |
| 0 1101 … | 64 | 72 | 80 | 88 | 96 | 104 | 112 | 120 |
| 0 1110 … | 128 | 144 | 160 | 176 | 192 | 208 | 224 | 240 |
| 0 1111 … | Inf | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 0000 … | −0 | −0.001953125 | −0.00390625 | −0.005859375 | −0.0078125 | −0.009765625 | −0.01171875 | −0.013671875 |
| 1 0001 … | −0.015625 | −0.017578125 | −0.01953125 | −0.021484375 | −0.0234375 | −0.025390625 | −0.02734375 | −0.029296875 |
| 1 0010 … | −0.03125 | −0.03515625 | −0.0390625 | −0.04296875 | −0.046875 | −0.05078125 | −0.0546875 | −0.05859375 |
| 1 0011 … | −0.0625 | −0.0703125 | −0.078125 | −0.0859375 | −0.09375 | −0.1015625 | −0.109375 | −0.1171875 |
| 1 0100 … | −0.125 | −0.140625 | −0.15625 | −0.171875 | −0.1875 | −0.203125 | −0.21875 | −0.234375 |
| 1 0101 … | −0.25 | −0.28125 | −0.3125 | −0.34375 | −0.375 | −0.40625 | −0.4375 | −0.46875 |
| 1 0110 … | −0.5 | −0.5625 | −0.625 | −0.6875 | −0.75 | −0.8125 | −0.875 | −0.9375 |
| 1 0111 … | −1 | −1.125 | −1.25 | −1.375 | −1.5 | −1.625 | −1.75 | −1.875 |
| 1 1000 … | −2 | −2.25 | −2.5 | −2.75 | −3 | −3.25 | −3.5 | −3.75 |
| 1 1001 … | −4 | −4.5 | −5 | −5.5 | −6 | −6.5 | −7 | −7.5 |
| 1 1010 … | −8 | −9 | −10 | −11 | −12 | −13 | −14 | −15 |
| 1 1011 … | −16 | −18 | −20 | −22 | −24 | −26 | −28 | −30 |
| 1 1100 … | −32 | −36 | −40 | −44 | −48 | −52 | −56 | −60 |
| 1 1101 … | −64 | −72 | −80 | −88 | −96 | −104 | −112 | −120 |
| 1 1110 … | −128 | −144 | −160 | −176 | −192 | −208 | −224 | −240 |
| 1 1111 … | −Inf | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
There are only 242 different non-NaN values, because 14 of the bit patterns represent NaNs.
To convert to/from 8-bit floats in programming languages, libraries or functions are usually required, since this format is not standardized. For example, in C++.