Unum (number format)
Unums are a family of number formats and arithmetic for implementing real numbers on a computer, proposed by John L. Gustafson in 2015. They are designed as an alternative to the ubiquitous IEEE 754 floating-point standard. The latest version is known as posits.
Type I Unum
The first version of unums, formally known as Type I unum, was introduced in Gustafson's book The End of Error as a superset of the IEEE-754 floating-point format. The defining features of the Type I unum format are:- a variable-width storage format for both the significand and exponent, and
- a u-bit, which determines whether the unum corresponds to an exact number, or an interval between consecutive exact unums. In this way, the unums cover the entire extended [real number line] .
William M. Kahan and Gustafson debated unums at the Arith23 conference.
Type II Unum
Type II Unums were introduced in 2016 as a redesign of Unums that broke IEEE-754 compatibility. In addition to the sign bit and the interval bit mentioned earlier, the Type II Unum uses a bit to indicate inversion. These three operations make it possible, starting from a finite set of points between one and infinity, to quantify the entire projective line except for four points: the two exceptions, 0 and ∞, and then 1 and −1. This set of points is chosen arbitrarily, and arithmetic operations involving them are not performed logically but rather by using a lookup table. The size of such a table becomes prohibitive for an encoding format spanning multiple bytes. This challenge necessitated the development of the Type III Unum, known as the posit, discussed below.Posit (Type III Unum)
In February 2017, Gustafson officially introduced Type III unums, for fixed floating-point-like values and valids for interval arithmetic. In March 2022, a standard was ratified and published by the Posit Working Group.Posits are a hardware-friendly version of unum where difficulties faced in the original type I unum due to its variable size are resolved. Compared to IEEE 754 floats of similar size, posits offer a bigger dynamic range and more fraction bits for values with magnitude near 1, and Gustafson claims that they offer better accuracy. Studies confirm that for some applications, posits with quire out-perform floats in accuracy. Posits have superior accuracy in the range near one, where most computations occur. This makes it very attractive to the current trend in deep learning to minimize the number of bits used. It potentially helps any application to accelerate by enabling the use of fewer bits reducing network and memory bandwidth and power requirements.
The format of an n-bit posit is given a label of "posit" followed by the decimal digits of n and consists of four sequential fields:
- sign: 1 bit, representing an unsigned integer s
- regime: at least 2 bits and up to, representing an unsigned integer r as described below
- exponent: generally 2 bits as available after regime, representing an unsigned integer e
- fraction: all remaining bits available after exponent, representing a non-negative real dyadic rational f less than 1
002, a one-bit exponent E1 is treated as E102, and an absent fraction is treated as 0. Negative numbers are encoded as 2's complements.The two encodings in which all non-sign bits are 0 have special interpretations:
- If the sign bit is 1, the posit value is
NaR - If the sign bit is 0, the posit value is 0
Quire
For each positn type of precision, the standard defines a corresponding "quire" type [|quire] n of precision, used to accumulate exact sums of products of those posits without rounding or overflow in dot products for vectors of up to 231 or more elements. The quire format is a two's complement signed integer, interpreted as a multiple of units of magnitude except for the special value with a leading sign bit of 1 and all other bits equal to 0. Quires are based on the work of Ulrich W. Kulisch and Willard L. Miranker.Valid
Valids are described as a Type III Unum mode that bounds results in a given range.Implementations
Several software and hardware solutions implement posits. The first complete parameterized posit arithmetic hardware generator was proposed in 2018.Unum implementations have been explored in Julia and MATLAB. A C++ version with support for any posit sizes combined with any number of exponent bits is available. A fast implementation in C, SoftPosit, provided by the NGA research team based on Berkeley SoftFloat adds to the available software implementations.
| Project author | Type | Precisions | Quire Support? | Speed | Testing | Notes |
VividSparks | World's first FPGA GPGPU | 32 | ~3.2 TPOPS | Exhaustive. No known bugs. | RacEr GP-GPU has 512 cores | |
A*STAR | C library based on Berkeley SoftFloat C++ wrapper to override operators Python wrapper using SWIG of SoftPosit | 8, 16, 32 published and complete; | ~60 to 110 MPOPS on x86 core | 8: Exhaustive; 16: Exhaustive except FMA, quire 32: Exhaustive test is still in progress. No known bugs. | license. Fastest and most comprehensive C library for posits presently. Designed for plug-in comparison of IEEE floats and posits. | |
A*STAR | Mathematica notebook | All | < 80 KPOPS | Exhaustive for low precisions. No known bugs. | . Original definition and prototype. Most complete environment for comparing IEEE floats and posits. Many examples of use, including linear solvers | |
A*STAR | JavaScript widget | Convert decimal to posit 6, 8, 16, 32; generate tables 2–17 with es 1–4. | Fully tested | Table generator and conversion | ||
Stillwater Supercomputing, Inc | C++ template library C library Python wrapper Golang library | Arbitrary precision posit float valid Unum type 1 Unum type 2 | Arbitrary quire configurations with programmable capacity | posit<4,0> 1 GPOPS posit<8,0> 130 MPOPS posit<16,1> 115 MPOPS posit<32,2> 105 MPOPS posit<64,3> 50 MPOPS posit<128,4> 1 MPOPS posit<256,5> 800 KPOPS | Complete validation suite for arbitrary posits Randoms for large posit configs. Uses induction to prove nbits+1 is correct no known bugs | . MIT license. Fully integrated with C/C++ types and automatic conversions. Supports full C++ math library. Runtime integrations: MTL4/MTL5, Eigen, Trilinos, HPR-BLAS. Application integrations: G+SMO, FDBB, FEniCS, ODEintV2, TVM.ai. Hardware accelerator integration. |
Chung Shin Yee | Python library | All | ~20 MPOPS | Extensive; no known bugs | ||
David Thien | SoftPosit bindings for Racket | All | ||||
Bill Zorn | SoftPosit bindings for Python | All | ~20–45 MPOPS on 4.9 GHz Skylake core | |||
Diego Coelho | Octave implementation | All | Limited Testing; no known bugs | |||
Isaac Yonemoto | Julia library | All <32, all ES | No known bugs. Division bugs | Leverages Julia's templated mathematics standard library, can natively do matrix and tensor operations, complex numbers, FFT, DiffEQ. Support for valids | ||
Isaac Yonemoto | Julia and C/C++ library | 8, 16, 32, all ES | Known bug in 32-bit multiplication | Used by LLNL in shock studies | ||
Milan Klöwer | Julia library | Based on softposit; 8-bit 16-bit 24-bit 32-bit | Similar to A*STAR "SoftPosit" | Yes: Posit, Posit, Posit Other formats lack full functionality | . Issues and suggestions on GitHub. This project was developed due to the fact that SigmoidNumbers and FastSigmoid by Isaac Yonemoto is not maintained currently. Supports basic linear algebra functions in Julia | |
Ken Mercado | Python library | All | < 20 MPOPS | . Easy-to-use interface. Neural net example. Comprehensive functions support. | ||
Federico Rossi, Emanuele Ruffaldi | C++ library | 4 to 64 ; "Template version is 2 to 63 bits" | A few basic tests | 4 levels of operations working with posits. Special support for NaN types | ||
Clément Guérin | C++ library | Bugs found; status of fixes unknown | Supports + – × ÷ √ reciprocal, negate, compare | |||
Isaac Yonemoto | Julia and Verilog | 8, 16, 32, ES=0 | Comprehensively tested for 8-bit, no known bugs | Intended for Deep Learning applications Addition, Subtraction and Multiplication only. A proof of concept matrix multiplier has been built, but is off-spec in its precision | ||
Lombiq Technologies | C# with Hastlayer for hardware generation | 8, 16, 32. | 10 MPOPS Click here for more | Requires Microsoft.Net APIs | ||
| Jeff Johnson, Facebook | SystemVerilog | Limited | Does not strictly conform to posit spec. Supports +,-,/,*. Implements both logarithmic posit and normal, "linear" posits License: CC-BY-NC 4.0 at present | |||
| Tokyo Tech | FPGA | 16, 32, extendable | "2 GHz", not translated to MPOPS | ; known rounding bugs | Yet to be open-source | |
| Manish Kumar Jaiswal | Verilog HDL for Posit Arithmetic | precision. Able to generate any combination of word-size and exponent-size | Speed of design is based on the underlying hardware platform | Exhaustive tests for 8-bit posit. Multi-million random tests are performed for up to 32-bit posit with various ES combinations | It supports rounding-to-nearest rounding method. | |
| Vinay Saxena, Research and Technology Centre, Robert Bosch, India and Farhad Merchant, RWTH Aachen University | Verilog generator for VLSI, FPGA | All | Similar to floats of same bit size | N=8 - ES=2 | N=7,8,9,10,11,12 Selective combinations for - ES=1 | N=16 | To be used in commercial products. To the best of our knowledge. | |
| Posit-enabled RISC-V core | BSV Implementation | 32-bit posit with and | Verified against SoftPosit for and tested with several applications for and. No known bugs. | First complete posit-capable RISC-V core. Supports dynamic switching between and. More info here. | ||
David Mallasén | Open-Source Posit RISC-V Core with Quire Capability | Posit<32,2> with 512-bit quire | Speed of design is based on the underlying hardware platform | Functionality testing of each posit instruction. | Application-level posit-capable RISC-V core based on CVA6 that can execute all posit instructions, including the quire fused operations. PERCIVAL is the first work that integrates the complete posit ISA and quire in hardware. It allows the native execution of posit instructions as well as the standard floating-point ones simultaneously. | |
Chris Lomont | Single file C# MIT Licensed | Any size | Extensive; no known bugs | Ops: arithmetic, comparisons, sqrt, sin, cos, tan, acos, asin, atan, pow, exp, log | ||
REX Computing | FPGA version of the "Neo" VLIW processor with posit numeric unit | 32 | ~1.2 GPOPS | Extensive; no known bugs | No divide or square root. First full processor design to replace floats with posits. | |
| Calligo Tech |
| 500 MHz * 8 Cores | Exhaustive tests completed for 32 bits and 64 bits with Quire support completed.Applications tested and being made available for seamless adoption www.calligotech.com | Fully integrated with C/C++ types and automatic conversions. Supports full C++ math library. Runtime integrations: GNU Utils, OpenBLAS, CBLAS. Application integrations: in progress. Compiler support extended: C/C++, G++, GFortran & LLVM. | ||
Jianyu Chen | Specific-purpose FPGA | 32 | 16–64 GPOPS | Only one known case tested | Does 128-by-128 matrix-matrix multiplication using quire. | |
Raul Murillo | Python library | 8, 16, 32 | A DNN framework using posits | |||
Jaap Aarts | Pure Go library | 80 MPOPS for div32/2 and similar linear functions. Much higher for truncate and much lower for exp. | Fuzzing against C softposit with a lot of iterations for 16/1 and 32/2. Explicitly testing edge cases found. | The implementations where ES is constant the code is generated. The generator should be able to generate for all sizes and ES below the size. However, the ones not included into the library by default are not tested, fuzzed, or supported. For some operations on 32/ES, mixing and matching ES is possible. However, this is not tested. |
SoftPosit
SoftPosit is a software implementation of posits based on Berkeley SoftFloat. It allows software comparison between posits and floats. It currently supports- Add
- Subtract
- Multiply
- Divide
- Fused-multiply-add
- Fused-dot-product
- Square root
- Convert posit to signed and unsigned integer
- Convert signed and unsigned integer to posit
- Convert posit to another posit size
- Less than, equal, less than equal comparison
- Round to nearest integer
Helper functions
- convert double to posit
- convert posit to double
- cast unsigned integer to posit
Examples
Add with posit8_t- include "softposit.h"
Fused dot product with quire16_t
// Convert double to posit
posit16_t pA = convertDoubleToP16;
posit16_t pB = convertDoubleToP16;
posit16_t pC = convertDoubleToP16;
posit16_t pD = convertDoubleToP16;
quire16_t qZ;
// Set quire to 0
qZ = q16_clr;
// Accumulate products without roundings
qZ = q16_fdp_add;
qZ = q16_fdp_add;
// Convert back to posit
posit16_t pZ = q16_to_p16;
// To check answer
double dZ = convertP16ToDouble;
Critique
William M. Kahan, the principal architect of IEEE 754-1985 criticizes type I unums on the following grounds :- The description of unums sidesteps using calculus for solving physics problems.
- Unums can be expensive in terms of time and power consumption.
- Each computation in unum space is likely to change the bit length of the structure. This requires either unpacking them into a fixed-size space, or data allocation, deallocation, and garbage collection during unum operations, similar to the issues for dealing with variable-length records in mass storage.
- Unums provide only two kinds of numerical exception, quiet and signaling NaN.
- Unum computation may deliver overly loose bounds from the selection of an algebraically correct but numerically unstable algorithm.
- The benefits of unum over short precision floating point for problems requiring low precision are not obvious.
- Solving differential equations and evaluating integrals with unums guarantee correct answers but may not be as fast as methods that usually work.