Lempel–Ziv–Welch
Lempel-Ziv-Welch is a universal lossless compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improvement to the LZ78 algorithm published by Lempel and Ziv in 1978. Claimed advantages include: simple to implement and the potential for high throughput in a hardware implementation.
A large English text file can typically be compressed via LZW to about half its original size.
The algorithm became the first widely used universal data compression method used on computers. The algorithm was used in the compress program commonly included in Unix systems starting around 1986. It has since disappeared from many distributions, because it both infringed the LZW patent and because gzip produced better compression ratios using the LZ77-based DEFLATE algorithm. The algorithm found wide use when it became part of the GIF image format in 1987. It may optionally be used in TIFF and PDF files. Although LZW is available in Adobe Acrobat software, Acrobat by default uses DEFLATE for most text and color-table-based image data in PDF files.
Algorithm
The scenario described by Welch's 1984 paper encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary. The code for the sequence is added to the output, and a new code is added to the dictionary.In the practical application of image compression and for an image based on a color table, a natural character alphabet is the set of color table indexes. In the 1980s, many images had small color tables. For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a variable-width code was introduced: codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum. When the maximum code value is reached, encoding proceeds using the existing table, but new codes are not generated for addition to the table.
Further refinements include reserving a code to indicate that the code table should be cleared and restored to its initial state, and a code to indicate the end of data. The clear code lets the table be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data. Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well.
Since codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size, whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes. Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data.
Encoding
The encoding process can be described as:- Initialize the dictionary to contain all strings of length one.
- Find the longest string W in the dictionary that matches the current input.
- Emit the dictionary index for W to output and remove W from the input.
- Add W followed by the next symbol in the input to the dictionary.
- Go to Step 2.
In this way, successively longer strings are registered in the dictionary and available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum.
Decoding
The decoding process can be described as:- Initialize the dictionary to contain all strings of length one.
- Read the next encoded symbol.
- If the symbol is not encoded in the dictionary, goto step 7.
- Emit the corresponding string W to output.
- Concatenate the previous string emitted to output with the first symbol of W; add this to the dictionary.
- Go to step 9.
- Concatenate the previous string emitted to output with its first symbol; call this string V.
- Add V to the dictionary and emit V to output.
- Repeat step 2 until end of input.
Variable-width codes
If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in the encoded data so they don't disagree on boundaries between individual codes in the stream. In the standard version, the encoder increases the width from p to p + 1 when a sequence ω + s is encountered that is not in the table but the next available code in the table is 2p. The encoder emits the code for ω at width p, and then increases the code width so that the next code emitted is p + 1 bits wide.The decoder is always one code behind the encoder in building the table, so when it sees the code for ω, it generates an entry for code 2p − 1. Since this is the point where the encoder increases the code width, the decoder must increase the width here as well—at the point where it generates the largest code that fits in p bits.
Unfortunately, some early implementations of the encoding algorithm increase the code width and then emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "early change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in the header of each LZW-compressed stream to indicate whether early change is being used. Of the graphics file formats that support LZW compression, TIFF uses early change, while GIF and most others don't.
When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following the clear code.
Packing order
Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes. The two common methods are LSB-first and MSB-first. In LSB-first packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of the first stream byte, and if the code has more than 8 bits, the high-order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its most significant bit falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte.GIF files use LSB-first packing order. TIFF files and PDF files use MSB-first packing order.
Further coding
Many applications extend the algorithm by applying further encoding to the sequence of output symbols. Some package the coded stream as printable characters using some form of binary-to-text encoding. This increases the encoded length and decreases the compression rate. Conversely, increased compression can often be achieved with an adaptive entropy encoder. Such a coder estimates the probability distribution for the value of the next symbol, based on the observed frequencies of values so far. A standard entropy encoding such as Huffman coding or arithmetic coding then uses shorter codes for values with higher probabilities.Example
The following example illustrates the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the data. This example has been constructed to give reasonable compression on a very short message. In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency.The plaintext to be encoded is:
TOBEORNOTTOBEORTOBEORNOT#
There are 26 symbols in the plaintext alphabet. # is used to represent a stop code: a code outside the plaintext alphabet that triggers special handling. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for the stop code '#'.
A computer renders these as strings of bits. Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary is initialized with these 27 values. As the dictionary grows, the codes must grow in width to accommodate the additional entries. A 5-bit code gives 25 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm must switch at that point from 5-bit strings to 6-bit strings. Note that since the all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry is labeled 32.
The initial dictionary, then, consists of the following entries:
| Symbol | Binary | Decimal |
| # | 00000 | 0 |
| A | 00001 | 1 |
| B | 00010 | 2 |
| C | 00011 | 3 |
| D | 00100 | 4 |
| E | 00101 | 5 |
| F | 00110 | 6 |
| G | 00111 | 7 |
| H | 01000 | 8 |
| I | 01001 | 9 |
| J | 01010 | 10 |
| K | 01011 | 11 |
| L | 01100 | 12 |
| M | 01101 | 13 |
| N | 01110 | 14 |
| O | 01111 | 15 |
| P | 10000 | 16 |
| Q | 10001 | 17 |
| R | 10010 | 18 |
| S | 10011 | 19 |
| T | 10100 | 20 |
| U | 10101 | 21 |
| V | 10110 | 22 |
| W | 10111 | 23 |
| X | 11000 | 24 |
| Y | 11001 | 25 |
| Z | 11010 | 26 |