ZIP (file format)
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common.
This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. The ZIP format is since natively supported on major operating systems including Microsoft Windows, Apple macOS, Linux, FreeBSD, Oracle Solaris, Android and iOS.
ZIP files generally use the file extensions or and the MIME media type. ZIP is used as a base file format by many programs, usually under a different name. When navigating a file system via a user interface, graphical icons representing ZIP files often appear as a document or other object prominently featuring a zipper.
History
The file format was designed by Phil Katz of PKWARE and Gary Conway of Infinity Design Concepts. The format was created after Systems Enhancement Associates filed a lawsuit against PKWARE claiming that the latter's archiving products, named PKARC, were derivatives of SEA's ARC archiving system. The name "zip" was suggested by Katz's friend, Robert Mahoney. They wanted to imply that their product would be faster than ARC and other compression formats of the time. By distributing the zip file format within APPNOTE.TXT, compatibility with the zip file format proliferated widely on the public Internet during the 1990s.PKWARE and Infinity Design Concepts made a joint press release on February 14, 1989, releasing the file format into the public domain.
Version history
The.ZIP File Format Specification has its own version number, which does not necessarily correspond to the version numbers for the PKZIP tool, especially with PKZIP 6 or later. At various times, PKWARE has added preliminary features that allow PKZIP products to extract archives using advanced features, but PKZIP products that create such archives are not made available until the next major release. Other companies or organizations support the PKWARE specifications at their own pace.The.ZIP file format specification is formally named "APPNOTE -.ZIP File Format Specification" and it is published on the PKWARE.com website since the late 1990s. Several versions of the specification were not published. Specifications of some features such as BZIP2 compression, strong encryption specification and others were published by PKWARE a few years after their creation. The URL of the online specification was changed several times on the PKWARE website.
A summary of key advances in various versions of the PKWARE software and/or specification:
- 2.0: File entries can be compressed with DEFLATE and use traditional PKWARE encryption.
- 2.1: Deflate64 compression support. APPNOTE may not have been published for 2.1.
- 2.5: PKWARE DCL Implode compression. APPNOTE may not have been published for 2.5.
- 2.5: Deflate64 compression support
- 4.0: Deflate64 compression support.
- 4.5: Documented 64-bit zip format.
- 4.6: BZIP2 compression
- 5.0: SES: DES, Triple DES, RC2, RC4 supported for encryption
- 5.2: AES encryption support for SES and AES from WinZip ; corrected version of RC2-64 supported for SES encryption.
- 6.1: Documented certificate storage.
- 6.2.0: Documented Central Directory Encryption.
- 6.3.0: Documented Unicode filename storage. Expanded list of supported compression algorithms, encryption algorithms, and hashes.
- 6.3.1: Corrected standard hash values for SHA-256/384/512.
- 6.3.2: Documented compression method 97.
- 6.3.3: Document formatting changes to facilitate referencing the PKWARE Application Note from other standards using methods such as the JTC 1 Referencing Explanatory Report as directed by JTC 1/SC 34 N 1621.
- 6.3.4: Updates the PKWARE, Inc. office address.
- 6.3.5: Documented compression methods 16, 96 and 99, DOS timestamp epoch and precision, added extra fields for keys and decryption, as well as typos and clarifications.
- 6.3.6: Corrected typographical error.
- 6.3.7: Added Zstandard compression method ID 20.
- 6.3.8: Moved Zstandard compression method ID from 20 to 93, deprecating the former. Documented method IDs 94 and 95.
- 6.3.9: Corrected a typo in Data Stream Alignment description.
- 6.3.10: Added several z/OS attribute values for APPENDIX B. Added several additional 3rd party Extra Field mappings.
Standardization
In April 2010, ISO/IEC JTC 1 initiated a ballot to determine whether a project should be initiated to create an ISO/IEC International Standard format compatible with ZIP. The proposed project, entitled Document Packaging, envisaged a ZIP-compatible 'minimal compressed archive format' suitable for use with a number of existing standards including OpenDocument, Office Open XML and EPUB. It would solve problems such as the need for a formal standard, the variety of extensions of ZIP, the undesirability of a technology used for Open Standards potentially having proprietary extensions or "submarine" patents, the need for better internationalization, and a desire not to actually fragment the technology further by purporting to provide an alternative specification to the PKWARE APPNOTE document.In 2015, ISO/IEC 21320-1 "Document Container File — Part 1: Core" was published which states that "Document container files are conforming Zip files", normatively referencing the PKWARE APPNOTE document. It requires the following main restrictions of the ZIP file format:
- Files in ZIP archives may only be stored uncompressed, or using the "deflate" compression. The patent on the core "deflate" compression method expired in late 2010.
- The encryption features are prohibited.
- The digital signature features are prohibited.
- The "patched data" features are prohibited.
- Archives may not span multiple volumes or be segmented.
Design
A directory is placed at the end of a ZIP file. This identifies what files are in the ZIP and identifies where in the ZIP that file is located. This allows ZIP readers to load the list of files without reading the entire ZIP archive. ZIP archives can also include extra data that is not related to the ZIP archive. This allows for a ZIP archive to be made into a self-extracting archive, by prepending the program code to a ZIP archive and marking the file as executable. Storing the catalog at the end also makes possible to hide a zipped file by appending it to an innocuous file, such as a GIF image file.
The format uses CRC-32 and includes two copies of each entry metadata to provide greater protection against data loss. The CRC-32 algorithm was contributed by David Schwaderer and can be found in his book "C Programmers Guide to NetBIOS" published by Howard W. Sams & Co. Inc.
Structure
A ZIP file is correctly identified by the presence of an end of central directory record which is located at the end of the archive structure in order to allow the easy appending of new files. If the end of central directory record indicates a non-empty archive, the name of each file or directory within the archive should be specified in a central directory entry, along with other metadata about the entry, and an offset into the ZIP file, pointing to the actual entry data. This allows a file listing of the archive to be performed relatively quickly, as the entire archive does not have to be read to see the list of files. The entries within the ZIP file also include this information, for redundancy, in a local file header. Because ZIP files may be appended to, only files specified in the central directory at the end of the file are valid. Scanning a ZIP file for local file headers is invalid, as the central directory may declare that some files have been deleted and other files have been updated.For example, we may start with a ZIP file that contains files A, B and C. File B is then deleted and C updated. This may be achieved by just appending a new file C to the end of the original ZIP file and adding a new central directory that only lists file A and the new file C. When ZIP was first designed, transferring files by floppy disk was common, yet writing to disks was very time-consuming. If you had a large zip file, possibly spanning multiple disks, and only needed to update a few files, rather than reading and re-writing all the files, it would be substantially faster to just read the old central directory, append the new files then append an updated central directory.
The order of the file entries in the central directory need not coincide with the order of file entries in the archive.
Each entry stored in a ZIP archive is introduced by a local file header with information about the file such as the comment, file size and file name, followed by optional "extra" data fields, and then the possibly compressed, possibly encrypted file data. The "Extra" data fields are the key to the extensibility of the ZIP format. "Extra" fields are exploited to support the ZIP64 format, WinZip-compatible AES encryption, file attributes, and higher-resolution NTFS or Unix file timestamps. Other extensions are possible via the "Extra" field. ZIP tools are required by the specification to ignore Extra fields they do not recognize.
The ZIP format uses specific 4-byte "signatures" to denote the various structures in the file. Each file entry is marked by a specific signature. The end of central directory record is indicated with its specific signature, and each entry in the central directory starts with the 4-byte central file header signature.
There is no BOF or EOF marker in the ZIP specification. Conventionally the first thing in a ZIP file is a ZIP entry, which can be identified easily by its local file header signature. However, this is not necessarily the case, as this is not required by the ZIP specification - most notably, a self-extracting archive will begin with an executable file header.
Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures. However, tools that attempt to recover data from damaged ZIP archives will most likely scan the archive for local file header signatures; this is made more difficult by the fact that the compressed size of a file chunk may be stored after the file chunk, making sequential processing difficult.
Most of the signatures end with the short integer 0x4b50, which is stored in little-endian ordering. Viewed as an ASCII string this reads "PK", the initials of the inventor Phil Katz. Thus, when a ZIP file is viewed in a text editor the first two bytes of the file are usually "PK".
The specification also supports spreading archives across multiple file-system files. Originally intended for storage of large ZIP files across multiple floppy disks, this feature is now used for sending ZIP archives in parts over email, or over other transports or removable media.
The FAT filesystem of DOS has a timestamp resolution of only two seconds; ZIP file records mimic this. As a result, the built-in timestamp resolution of files in a ZIP archive is only two seconds, though extra fields can be used to store more precise timestamps. The ZIP format has no notion of time zone, so timestamps are only meaningful if it is known what time zone they were created in.
In September 2006, PKWARE released a revision of the ZIP specification providing for the storage of file names using UTF-8, finally adding Unicode compatibility to ZIP.