PDF/A
PDF/A is an ISO-standardized version of the Portable Document Format specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.
Background
PDF is a standard for encoding documents in an "as printed" form that is portable between systems. However, the suitability of a PDF file for archival preservation depends on options chosen when the PDF is created: most notably, whether to embed the necessary fonts for rendering the document; whether to use encryption; and whether to preserve additional information from the original document beyond what is needed to print it.PDF/A was originally a new joint activity between the Association for Suppliers of Printing, Publishing and Converting Technologies and the Association for Information and Image Management AIIM in conjunction with Adobe to develop an international standard defining the use of the Portable Document Format for archiving documents. The goal was to address the growing need to electronically archive documents in a way that would ensure preservation of their contents over an extended period of time and ensure that those documents would be able to be retrieved and rendered with a consistent and predictable result in the future. This need exists in a wide variety of government, industry and academic areas worldwide, including legal systems, libraries, newspapers, and regulated industries.
Description
The PDF/A standard does not define an archiving strategy or the goals of an archiving system. It identifies a "profile" for electronic documents that ensures the documents can be reproduced exactly the same way using various software in years to come. A key element to this reproducibility is the requirement for PDF/A documents to be 100% self-contained. All of the information necessary for displaying the document in the same manner is embedded in the file. This includes, but is not limited to, all content, fonts, and color information. A PDF/A document is not permitted to be reliant on information from external sources, but may include annotations that link to external documents.Other key elements to PDF/A conformance include:
- Audio and video content are forbidden.
- JavaScript and executable file launches are forbidden.
- All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering. This also applies to the so-called PostScript standard fonts such as Times or Helvetica.
- Colorspaces specified in a device-independent manner.
- Encryption is forbidden.
- Use of standards-based metadata is required.
- External content references are forbidden.
- LZW is forbidden due to intellectual property constraints. JPEG 2000 image compression models are not allowed in PDF/A-1, as it was first introduced in PDF 1.5. JPEG 2000 compression is allowed in PDF/A-2 and PDF/A-3.
- Transparent objects and layers are forbidden in PDF/A-1 but are allowed in PDF/A-2.
- Provisions for digital signatures in accordance with the PAdES standard are supported in PDF/A-2.
- Embedded files are forbidden in PDF/A-1, but PDF/A-2 allows embedding of PDF/A files, facilitating the archiving of sets of PDF/A documents in a single file. PDF/A-3 allows embedding of any file format such as XML, CAD and others into PDF/A documents.
- The use of XML-based XML Forms Architecture forms is forbidden in PDF/A.
- Interactive PDF form fields must have an appearance dictionary associated with the field's data. The appearance dictionary shall be used when rendering the field.
Conformance levels and versions
PDF/A-1
Part 1 of the standard was first published on September 28, 2005, and specifies two levels of conformance for PDF files:- PDF/A-1b – Level B conformance
- PDF/A-1a – Level A conformance
Additional Level A requirements:
- Language specification
- Hierarchical document structure
- Tagged text spans and descriptive text for images and symbols
- Character mappings to Unicode
PDF/A-2
Part 2 of the standard, published on June 20, 2011, addresses some of the new features added with versions 1.5, 1.6 and 1.7 of the PDF Reference. PDF/A-1 files will not necessarily conform to PDF/A-2, and PDF/A-2 compliant files will not necessarily conform to PDF/A-1.Part 2 of the PDF/A Standard is based on a PDF 1.7, rather than PDF 1.4 and offers several new features:
- JPEG 2000 image compression.
- support for transparency effects and layers.
- embedding of OpenType fonts.
- provisions for digital signatures in accordance with the PDF Advanced Electronic Signatures – PAdES standard.
- the option of embedding PDF/A files to facilitate archiving of sets of documents with a single file.
PDF/A-3
Part 3 of the standard, published on October 15, 2012, differs from PDF/A-2 in only one regard: it allows embedding of arbitrary file formats into PDF/A conforming documents.PDF/A-4
Part 4 of the standard, based on PDF 2.0, was published in late 2020.PDF/A supports 2 additional conformance levels:
- PDF/A-4f – embedding of arbitrary files
- PDF/A-4e – Engineering conformance, extending PDF/A-4f to additionally support 3D, RichMedia, and JavaScript.
How to create a PDF/A File
Archives sometimes request from their users to submit PDF/A Files. They thus provide their users with information how to convert their files to PDF/A. There are several methods using standard software that differ in computation time as well as preservation of links, equations, vectorgraphs and special characters.When documents are converted to PDF/A visual inspection is needed since errors in the visual content are common. In a test sample 11 percent of the produced PDF/A-1b document contained visual artefacts. These reproducibility errors included vector graphics issues, loss of links, loss of other document content, updated fields and spelling errors. Archives thus usually do not convert to PDF/A themselves. Instead, some archives ask their users to provide a PDF/A document. Typical computer setups provide several methods for the conversion of documents to PDF/A with different pros and cons.
Converting a simple PDF into a PDF/A-2 usually works as expected, except for problems with glyphs. According to the PDF Association, "Problems can occur before and/or during the generation of PDFs. A PDF/A file can be formally correct yet still have incorrect glyphs. Only a careful visual check can uncover this problem. Because generation problems also affect Unicode mapping, the problem attracts the attention when a visual check is carried out on the extracted text.
In PDF/A, text/font usage is specified uniquely enough to ensure that it cannot be incorrect.
If viewers or printers do not offer complete support for encoding systems, this can result in problems with regard to PDF/A." Meaning that for a document to be completely compliant with the standard, it will be correct internally, while the system used for viewing or printing the document may produce undesired results.
A document produced with optical character recognition conversion into PDF/A-2 or PDF/A-3 doesn't support the
notdefglyph flag. Therefore, this type of conversion can result in unrendered content.PDF/A standard documents can be created with the following software: SoftMaker Office 2021, MS Word 2010 and newer, Adobe Acrobat Distiller, PDF Creator, OpenOffice or LibreOffice since release 3.0, LaTeX with pdfx or pdfTeX addons, Typst, or by using a virtual PDF printer.
Identification
A PDF/A document can be identified as such through PDF/A-specific metadata located in the- a PDF document can be PDF/A-compliant, except for its lack of PDF/A metadata. This may happen for instance with documents that were generated before the definition of the PDF/A standard, by authors aware of features that present long-term preservation issues.
- a PDF document can be identified as PDF/A, but may incorrectly contain PDF features not allowed in PDF/A; hence, documents which claim to be PDF/A-compliant should be tested for PDF/A compliance.
Validation
Validation of PDF/A documents is attempted to reveal whether a produced file really is a PDF/A file or not. Unfortunately, PDF/A validators quite often disagree, since the interpretation of the PDF/A standards is not always clear.Isartor Test Suite
Industry collaboration in the original PDF/A Competence Center led to the development of the Isartor Test Suite in 2007 and 2008. The test suite consists of 204 PDF files intentionally constructed to systematically fail each of the requirements for PDF/A-1b conformance, allowing developers to test the ability of their software to validate against the standard's most basic level of conformance. By mid-2009 the test suite had already made an appreciable difference in the general quality of PDF/A validation software.veraPDF
The veraPDF consortium, led by the Open Preservation Foundation and the PDF Association, was created in response to the EU Commission's PREFORMA challenge to develop an open-source validator for the PDF/A format. The PDF Association launched the PDF Validation Technical Working Group in November 2014 to articulate a plan for developing an industry-supported PDF/A validator.The veraPDF consortium subsequently won phase 2 of the PREFORMA contract in April 2015. Development continued throughout 2016, with Phase 2 completed on-schedule by December 2016. The Phase 3 testing and acceptance period concluded in July, 2017. veraPDF now covers
veraPDF is available for installation on Windows, macOS, or Linux using a PDFBox-based or "Greenfields" PDF parser.
PDF/A viewers
The PDF/A specification also states some requirements for a conforming PDF/A viewer, which must- ignore any data that are not described by the PDF and PDF/A standards;
- ignore any linearization information provided by the file;
- only use the embedded fonts ;
- only display using the embedded color profile;
- ensure that form fields do not change the rendered presentation and are rendered without regard to the form data;
- ensure that annotations are rendered consistently.
Reception
A PDF/A document must embed all fonts in use; accordingly, a PDF/A file will often be larger than an equivalent PDF file that does not include embedded fonts.The use of transparency is forbidden in PDF/A-1. The majority of PDF generation tools that allow for PDF/A document compliance, such as the PDF export in OpenOffice.org or PDF export tool in Microsoft Office 2007 suites, will also make any transparent images in a given document non-transparent. That restriction was removed in PDF/A-2.
Some archivists have voiced concerns that PDF/A-3, which allows arbitrary files to be embedded in PDF/A documents, could result in circumvention of memory institution procedures and restrictions on archived formats.
The PDF Association had addressed various misconceptions regarding PDF/A in its publication "PDF/A in a Nutshell 2.0".