LZMA
The Lempel–Ziv–Markov chain algorithm is an algorithm used to perform lossless data compression. It has been developed since 1998 by Igor Pavlov who is the developer of 7-Zip. It has been used in the 7z format of the 7-Zip archiver since 2001. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.
LZMA2 is a simple container format that allows multi-threaded compression and decompression using multiple separate LZMA streams. It is used in both in the 7z and the xz format.
Overview
LZMA uses a dictionary compression algorithm, whose output is then encoded with a range encoder, using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a dynamic programming algorithm is used to select an optimal one under certain approximations.Prior to LZMA, most encoder models were purely byte-based. The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context. Furthermore, compared to classic dictionary compression, the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems.
Compressed format overview
In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit in previous packets of the same type.Both the lzip and the LZMA SDK documentation describe this stream format.
There are 7 types of packets:
| Packed code | Packet name | Packet description |
| 0 + byteCode | LIT | A single byte encoded using an adaptive binary range coder. |
| 1+0 + len + dist | MATCH | A typical LZ77 sequence describing sequence length and distance. |
| 1+1+0+0 | SHORTREP | A one-byte LZ77 sequence. Distance is equal to the last used LZ77 distance. |
| 1+1+0+1 + len | LONGREP | An LZ77 sequence. Distance is equal to the last used LZ77 distance. |
| 1+1+1+0 + len | LONGREP | An LZ77 sequence. Distance is equal to the second last used LZ77 distance. |
| 1+1+1+1+0 + len | LONGREP | An LZ77 sequence. Distance is equal to the third last used LZ77 distance. |
| 1+1+1+1+1 + len | LONGREP | An LZ77 sequence. Distance is equal to the fourth last used LZ77 distance. |
LONGREP refers to LONGREP packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP.
LONGREP packets remove the distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP don't alter the list.
The length is encoded as follows:
| Length code | Description |
| 0+ 3 bits | The length encoded using 3 bits, gives the lengths range from 2 to 9. |
| 1+0+ 3 bits | The length encoded using 3 bits, gives the lengths range from 10 to 17. |
| 1+1+ 8 bits | The length encoded using 8 bits, gives the lengths range from 18 to 273. |
As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if the copy was performed byte by byte, keeping the distance constant.
Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary.
The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed.
Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to the following table.
| 6-bit distance slot | Highest 2 bits | Fixed 0.5 probability bits | Context encoded bits |
| 0 | 00 | 0 | 0 |
| 1 | 01 | 0 | 0 |
| 2 | 10 | 0 | 0 |
| 3 | 11 | 0 | 0 |
| 4 | 10 | 0 | 1 |
| 5 | 11 | 0 | 1 |
| 6 | 10 | 0 | 2 |
| 7 | 11 | 0 | 2 |
| 8 | 10 | 0 | 3 |
| 9 | 11 | 0 | 3 |
| 10 | 10 | 0 | 4 |
| 11 | 11 | 0 | 4 |
| 12 | 10 | 0 | 5 |
| 13 | 11 | 0 | 5 |
| 14–62 | 10 | slot / 2 − 5 | 4 |
| 15–63 | 11 | / 2 − 5 | 4 |
Decompression algorithm details
No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text.The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference is not ideal, any programmer should be able to check the claims below with a few hours of work.
Range coding of bits
LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder.Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable prob representing the predicted probability of the bit being 0, which is read and updated by the range decoder.
Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding.
The range decoder state consists of two unsigned 32-bit variables, range, and code.
Initialization of the range decoder consists of setting range to, and code to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored.
Normalization proceeds in this way:
- Shift both range and code left by 8 bits
- Read a byte from the compressed stream
- Set the least significant 8 bits of code to the byte value read
- If range is less than, perform normalization
- Set bound to
- If code is less than bound:
- # Set range to bound
- # Set prob to prob +
- # Return bit 0
- Otherwise :
- # Set range to range − bound
- # Set code to code − bound
- # Set prob to
- # Return bit 1
- If range is less than, perform normalization
- Set range to
- If code is less than range:
- # Return bit 0
- Otherwise :
- # Set code to code − range
- # Return bit 1
rc_direct, for performance reasons, does not include a conditional branch, but instead subtracts range from code unconditionally. The resulting sign bit is used to both decide the bit to return and to generate a mask that is combined with code and added to range.Note that:
- The division by when computing bound and floor operation is done before the multiplication, not after
- Fixed probability decoding is not strictly equivalent to context-based range decoding with any prob value, due to the fact that context-based range decoding discards the lower 11 bits of range before multiplying by prob as just described, while fixed probability decoding only discards the last bit
Range coding of integers
To decode unsigned integers less than limit, an array of 11-bit probability variables is provided, which are conceptually arranged as the internal nodes of a complete binary tree with limit leaves.
Non-reverse bit-tree decoding works by keeping a pointer to the tree of variables, which starts at the root. As long as the pointer does not point to a leaf, a bit is decoded using the variable indicated by the pointer, and the pointer is moved to either the left or right children depending on whether the bit is 0 or 1; when the pointer points to a leaf, the number associated with the leaf is returned.
Non-reverse bit-tree decoding thus happens from most significant to least significant bit, stopping when only one value in the valid range is possible.
Reverse bit-tree decoding instead decodes from least significant bit to most significant bits, and thus only supports ranges that are powers of two, and always decodes the same number of bits. It is equivalent to performing non-reverse bittree decoding with a power of two limit, and reversing the last bits of the result.
In the function in the Linux kernel, integers are actually returned in the range, and the variable at index 0 in the array is unused, while the one at index 1 is the root, and the left and right children indices are computed as 2i and 2i + 1. The function instead adds integers in the range to a caller-provided variable, where limit is implicitly represented by its logarithm, and has its own independent implementation for efficiency reasons.
Fixed probability integer decoding simply performs fixed probability bit decoding repeatedly, reading bits from the most to the least significant.