TensorFloat-32


TensorFloat-32 is a numeric floating point format designed for Tensor Core running on certain Nvidia GPUs. It was first implemented in the Ampere architecture. TensorFloat-32 combines the 8-bit exponent size of IEEE single precision with the 10-bit mantissa size of half precision for a total of 19 bits per number. It is comparable to the bfloat16 format, which uses a 7-bit mantissa.

Format

The binary format is:
The 19-significant-bit format fits within a double word, and while it lacks precision compared with a normal 32-bit IEEE 754 floating-point number, it provides much faster computation, up to 8 times on a A100.
Stored in the same space as FP32, it is not a distinct storage format, but a specification for reduced-precision FP32 multiply–accumulate operations. FP32 inputs are rounded to TF32, multiplied to produce a 21-bit product, and summed into a standard FP32 accumulator.