Range coding


Range coding is an entropy coding method defined by G. Nigel N. Martin in a 1979 paper, which effectively rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976. Given a stream of symbols and their probabilities, a range coder produces a space-efficient stream of bits to represent these symbols and, given the stream and the probabilities, a range decoder reverses the process.
Range coding is very similar to arithmetic coding, except that coding is done with digits in any base, instead of with bits, and so it is faster when using larger bases at small cost in compression efficiency. After the expiration of the first arithmetic coding patent, range coding appeared to clearly be free of patent encumbrances. This particularly drove interest in the technique in the open source community. Since that time, patents on various well-known arithmetic coding techniques have also expired.

How range coding works

Range coding conceptually encodes all the symbols of the message into one number, unlike Huffman coding which assigns each symbol a bit-pattern and concatenates all the bit-patterns together. Thus range coding can achieve greater compression ratios than the one-bit-per-symbol lower bound on Huffman coding and it does not suffer the inefficiencies that Huffman does when dealing with probabilities that are not an exact power of two.
The central concept behind range coding is this: given a large-enough range of integers, and a probability estimation for the symbols, the initial range can easily be divided into sub-ranges whose sizes are proportional to the probability of the symbol they represent. Each symbol of the message can then be encoded in turn, by reducing the current range down to just that sub-range which corresponds to the next symbol to be encoded. The decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from already transferred data or be part of the compressor and decompressor.
When all symbols have been encoded, merely identifying the sub-range is enough to communicate the entire message. A single integer is actually sufficient to identify the sub-range, and it may not even be necessary to transmit the entire integer; if there is a sequence of digits such that every integer beginning with that prefix falls within the sub-range, then the prefix alone is all that's needed to identify the sub-range and thus transmit the message.

Example

Suppose we want to encode the message "AABA", where is the end-of-message symbol. For this example it is assumed that the decoder knows that we intend to encode exactly five symbols in the base 10 number system using the probability distribution. The encoder breaks down the range , which allows the encoder to emit two of the left-most digits of low, and readjust the interval to = 88800, 100000), that is, low = 88800 and range = [100000 - low.
The decoder will be following the same steps so it will know when it needs to do this to keep in sync.

// this goes just before the end of Encode above
if

Base 10 was used in this example, but a real implementation would just use binary, with the full range of the native integer data type. Instead of 10000 and 1000 you would likely use hexadecimal constants such as 0x1000000 and 0x10000. Instead of emitting a digit at a time you would emit a byte at a time and use a byte-shift operation instead of multiplying by 10.
Decoding uses exactly the same algorithm with the addition of keeping track of the current code value consisting of the digits read from the compressor. Instead of emitting the top digit of low you just throw it away, but you also shift out the top digit of code and shift in a new digit read from the compressor. Use AppendDigit below instead of EmitDigit.

int code = 0;
int low = 0;
int range = 1;
void InitializeDecoder
void AppendDigit
void Decode // Decode is same as Encode with EmitDigit replaced by AppendDigit

In order to determine which probability intervals to apply, the decoder needs to look at the current value of code within the interval low, low+range) and decide which symbol this represents.

void Run
int GetValue

For the AABA example above, this would return [a value
in the range 0 to 9. Values 0 through 5 would represent A, 6 and 7 would represent B, and 8 and 9 would represent .

Relationship with arithmetic coding

Arithmetic coding is the same as range coding, but with the integers taken as being the numerators of fractions. These fractions have an implicit, common denominator, such that all the fractions fall in the range 0,1). Accordingly, the resulting arithmetic code is interpreted as beginning with an implicit "0". As these are just different interpretations of the same coding methods, and as the resulting arithmetic and range codes are identical, each [arithmetic coder is its corresponding range encoder, and vice versa. In other words, arithmetic coding and range coding are just two, slightly different ways of understanding the same thing.
In practice, though, so-called range encoders tend to be implemented pretty much as described in Martin's paper, while arithmetic coders more generally tend not to be called range encoders. An often noted feature of such range encoders is the tendency to perform renormalization a byte at a time, rather than one bit at a time. In other words, range encoders tend to use bytes as coding digits, rather than bits. While this does reduce the amount of compression that can be achieved by a very small amount, it is faster than when performing renormalization for each bit.