PDF


Portable Document Format, standardized as ISO 32000, is a file format developed by Adobe in 1993 used to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991.
PDF was standardized as ISO 32000 in 2008. It is maintained by ISO TC 171 SC 2 WG8, of which the PDF Association is the committee manager. The last edition as ISO 32000-2:2020 was published in December 2020.
PDF files may contain a variety of content besides flat text and graphics including logical structuring elements, interactive elements such as annotations and form-fields, layers, rich media, three-dimensional objects using U3D or PRC, and various other data formats. The PDF specification also provides for encryption and digital signatures, file attachments, and metadata to enable workflows requiring these features.

History

The development of PDF began in 1991 when John Warnock wrote a paper for a project then code-named Camelot, in which he proposed the creation of a simplified version of PostScript called Interchange PostScript. Unlike traditional PostScript, which was tightly focused on rendering print jobs to output devices, IPS would be optimized for displaying pages to any screen and any platform.
Adobe Systems made the PDF specification available free of charge in 1993. In the early years PDF was popular mainly in desktop publishing workflows, and competed with several other formats, including DjVu, Envoy, Common Ground Digital Paper, Farallon Replica and even Adobe's own PostScript format.
PDF was a proprietary format controlled by Adobe until it was released as an open standard on July 1, 2008, and published by the International Organization for Standardization as ISO 32000-1:2008, at which time control of the specification passed to an ISO Committee of volunteer industry experts. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for all patents owned by Adobe necessary to make, use, sell, and distribute PDF-compliant implementations.
PDF 1.7, the sixth edition of the PDF specification that became ISO 32000-1, includes some proprietary technologies defined only by Adobe, such as Adobe XML Forms Architecture and JavaScript extension for Acrobat, which are referenced by ISO 32000-1 as normative and indispensable for the full implementation of the ISO 32000-1 specification. These proprietary technologies are not standardized, and their specification is published only on Adobe's website. Many of them are not supported by popular third-party implementations of PDF.
ISO published version 2.0 of PDF, ISO 32000-2 in 2017, available for purchase, replacing the free specification provided by Adobe. In December 2020, the second edition of PDF 2.0, ISO 32000-2:2020, was published, with clarifications, corrections, and critical updates to normative references.
In April 2023 the PDF Association made ISO 32000-2 available for download free of charge.

Technical details

A PDF file is often a combination of vector graphics, text, and bitmap graphics. The basic types of content in a PDF are:
  • Typeset text stored as content streams ;
  • Vector graphics for illustrations and designs that consist of shapes and lines;
  • Raster graphics for photographs and other types of images; and
  • Other multimedia objects.
In later PDF revisions, a PDF document can also support links, forms, JavaScript, or any other types of embedded contents that can be handled using plug-ins.
PDF combines three technologies:
  • An equivalent subset of the PostScript page description programming language but in declarative form, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.

    PostScript language

is a page description language run in an interpreter to generate an image. It can handle graphics and has standard features of programming languages such as branching and looping. PDF is a subset of PostScript, simplified to remove such control flow features, while graphics commands remain.
PostScript was originally designed for a drastically different use case: transmission of one-way linear print jobs in which the PostScript interpreter would collect a series of commands until it encountered the showpage command, then execute all the commands to render a page as a raster image to a printing device. PostScript was not intended for long-term storage and real-time interactive rendering of electronic documents to computer monitors, so there was no need to support anything other than consecutive rendering of pages. If there was an error in the final printed output, the user would correct it at the application level and send a new print job in the form of an entirely new PostScript file. Thus, any given page in a PostScript file could be accurately rendered only as the cumulative result of executing all preceding commands to draw all previous pages—any of which could affect subsequent pages—plus the commands to draw that particular page, and there was no easy way to bypass that process to skip around to different pages.
Traditionally, to go from PostScript to PDF, a source PostScript file is used as the basis for generating PostScript-like PDF code. This is done by applying standard compiler techniques like loop unrolling, inlining and removing unused branches, resulting in code that is purely declarative and static. The result is then packaged into a container format, together with all necessary dependencies for correct rendering, and compressed. Modern applications write to printer drivers that directly generate PDF rather than going through PostScript first.
As a document format, PDF has several advantages over PostScript:
  • PDF contains only static declarative PostScript code that can be processed as data, and does not require a full program interpreter or compiler. This avoids the complexity and security risks of an engine with such a higher complexity level.
  • Like Display PostScript, PDF has supported transparent graphics since version 1.4, while standard PostScript does not.
  • PDF enforces the rule that the code for any particular page cannot affect any other pages. That rule is strongly recommended for PostScript code too, but has to be implemented explicitly, as PostScript is a full programming language that allows for such greater flexibilities and is not limited to the concepts of pages and documents.
  • All data required for rendering is included within the file itself, improving portability.
Its disadvantages are:
PDF since v1.6 supports embedding of interactive 3D documents: 3D drawings can be embedded using U3D or PRC and various other data formats.

File format

A PDF file is organized using ASCII characters, except for certain elements that may have binary content.
The file starts with a header containing a magic number and the version of the format, for example %PDF-1.7. The format is a subset of a COS format. A COS tree file consists primarily of objects, of which there are nine types:
  • Boolean values, representing true or false
  • Real numbers
  • Integers
  • Strings, enclosed within parentheses or represented as hexadecimal within single angle brackets. Strings may contain 8-bit characters.
  • Names, starting with a forward slash
  • Arrays, ordered collections of objects enclosed within square brackets
  • Dictionaries, collections of objects indexed by names enclosed within double angle brackets
  • Streams, usually containing large amounts of optionally compressed binary data, preceded by a dictionary and enclosed between the stream and endstream keywords.
  • The null object
Comments using 8-bit characters prefixed with the percent sign may be inserted.
Objects may be either direct or indirect. Indirect objects are numbered with an object number and a generation number and defined between the obj and endobj keywords if residing in the document root. Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique enables non-stream objects to have standard stream filters applied to them, reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF. Object streams do not support specifying an object's generation number.
An index table, also called the cross-reference table, is located near the end of the file and gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file. Before PDF version 1.5, the table would always be in a special ASCII format, be marked with the xref keyword, and follow the main body composed of indirect objects. Version 1.5 introduced optional cross-reference streams, which have the form of a standard stream object, possibly with filters applied. Such a stream may be used instead of the ASCII cross-reference table and contains the offsets and other information in binary format. The format is flexible in that it allows for integer width specification, so that for example, a document not exceeding 64 KiB in size may dedicate only 2 bytes for object offsets. To ensure backward compatibility, a hybrid-reference PDF file may include both traditional cross-reference tables and cross-reference streams, allowing older PDF processors to read the file while still taking advantage of the new features introduced in version 1.5.
At the end of a PDF file is a footer containing
  • The startxref keyword followed by an offset to the start of the cross-reference table or the cross-reference stream object, followed by
  • The %%EOF end-of-file marker.
If a cross-reference stream is not being used, the footer is preceded by the trailer keyword followed by a dictionary containing information that would otherwise be contained in the cross-reference stream object's dictionary:
  • A reference to the root object of the tree structure, also known as the catalog
  • The count of indirect objects in the cross-reference table
  • Other optional information
Within each page, there are one or multiple content streams that describe the text, vector and images being drawn on the page. The content stream is stack-based, similar to PostScript.
There are two layouts to the PDF files: non-linearized and linearized. Non-linearized PDF files can be smaller than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document are scattered throughout the PDF file. Linearized PDF files are constructed in a manner that enables them to be read in a Web browser plugin without waiting for the entire file to download, since all objects required for the first page to display are optimally organized at the start of the file. PDF files may be optimized using Adobe Acrobat software or QPDF.
Page dimensions are not limited by the format itself. However, Adobe Acrobat imposes a limit of 15 million by 15 million inches, or, an area slightly larger than Tajikistan.