Digital preservation


In library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.
The Association for Library Collections and Technical Services Preservation and Reformatting Section of the American Library Association defined digital preservation as combination of "policies, strategies and actions that ensure access to digital content over time." According to the Harrod's Librarian Glossary, digital preservation is the method of keeping digital material alive so that they remain usable as technological advances render original hardware and software specification obsolete.
The necessity for digital preservation mainly arises because of the relatively short lifespan of digital media. Widely used hard drives can become unusable in a few years due to a variety of reasons such as damaged spindle motors, and flash memory can start to lose data around a year after its last use, depending on its storage temperature and how much data has been written to it during its lifetime. Currently, archival disc-based media is available, but it is only designed to last for 50 years and it is a proprietary format, sold by just two Japanese companies, Sony and Panasonic. M-DISC is a DVD-based format that claims to retain data for 1,000 years, but writing to it requires special optical disc drives and reading the data it contains requires increasingly uncommon optical disc drives, in addition the company behind the format went bankrupt. Data stored on LTO tapes require periodic migration, as older tapes cannot be read by newer LTO tape drives. RAID arrays could be used to protect against failure of single hard drives, although care needs to be taken to not mix the drives of one array with those of another.

Fundamentals

Appraisal

refers to the process of identifying records and other materials to be preserved by determining their permanent value. Several factors are usually considered when making this decision. It is a difficult and critical process because the remaining selected records will shape researchers' understanding of that body of records, or fonds. Appraisal is identified as A4.2 within the Chain of Preservation model created by the InterPARES 2 project. Archival appraisal is not the same as monetary appraisal, which determines fair market value.
Archival appraisal may be performed once or at the various stages of acquisition and processing. Macro appraisal, a functional analysis of records at a high level, may be performed even before the records have been acquired to determine which records to acquire. More detailed, iterative appraisal may be performed while the records are being processed.
Appraisal is performed on all archival materials, not just digital. It has been proposed that, in the digital context, it might be desirable to retain more records than have traditionally been retained after appraisal of analog records, primarily due to a combination of the declining cost of storage and the availability of sophisticated discovery tools which will allow researchers to find value in records of low information density. In the analog context, these records may have been discarded or only a representative sample kept. However, the selection, appraisal, and prioritization of materials must be carefully considered in relation to the ability of an organization to responsibly manage the totality of these materials.
Often libraries, and to a lesser extent, archives, are offered the same materials in several different digital or analog formats. They prefer to select the format that they feel has the greatest potential for long-term preservation of the content. The Library of Congress has created a set of recommended formats for long-term preservation. They would be used, for example, if the Library was offered items for copyright deposit directly from a publisher.

Identification (identifiers and descriptive metadata)

In digital preservation and collection management, discovery and identification of objects is aided by the use of assigned identifiers and accurate descriptive metadata. An identifier is a unique label that is used to reference an object or record, usually manifested as a number or string of numbers and letters. As a crucial element of metadata to be included in a database record or inventory, it is used in tandem with other descriptive metadata to differentiate objects and their various instantiations.
Descriptive metadata refers to information about an object's content such as title, creator, subject, date etc... Determination of the elements used to describe an object are facilitated by the use of a metadata schema. Extensive descriptive metadata about a digital object helps to minimize the risks of a digital object becoming inaccessible.
Another common type of file identification is the filename. Implementing a file naming protocol is essential to maintaining consistency and efficient discovery and retrieval of objects in a collection, and is especially applicable during digitization of analog media. Using a file naming convention, such as the 8.3 filename or the Warez standard naming, will ensure compatibility with other systems and facilitate migration of data, and deciding between descriptive and non-descriptive file names is generally determined by the size and scope of a given collection. However, filenames are not good for semantic identification, because they are non-permanent labels for a specific location on a system and can be modified without affecting the bit-level profile of a digital file.

Integrity

The cornerstone of digital preservation, "data integrity" refers to the assurance that the data is "complete and unaltered in all essential respects"; a program designed to maintain integrity aims to "ensure data is recorded exactly as intended, and upon later retrieval, ensure the data is the same as it was when it was originally recorded".
Unintentional changes to data are to be avoided, and responsible strategies should be put in place to detect unintentional changes and react as appropriately determined. However, digital preservation efforts may necessitate modifications to content or metadata through responsibly-developed procedures and by well-documented policies. Organizations or individuals may choose to retain original, integrity-checked versions of content and/or modified versions with appropriate preservation metadata. Data integrity practices also apply to modified versions, as their state of capture must be maintained and resistant to unintentional modifications.
The integrity of a record can be preserved through bit-level preservation, fixity checking, and capturing a full audit trail of all preservation actions performed on the record. These strategies can ensure protection against unauthorised or accidental alteration.

Fixity

is the property of a digital file being fixed, or unchanged. File fixity checking is the process of validating that a file has not changed or been altered from a previous state. This effort is often enabled by the creation, validation, and management of checksums.
While checksums are the primary mechanism for monitoring fixity at the individual file level, an important additional consideration for monitoring fixity is file attendance. Whereas checksums identify if a file has changed, file attendance identifies if a file in a designated collection is newly created, deleted, or moved. Tracking and reporting on file attendance is a fundamental component of digital collection management and fixity.

Characterization

Characterization of digital materials is the identification and description of what a file is and of its defining technical characteristics often captured by technical metadata, which records its technical attributes like creation or production environment.

Sustainability

Digital sustainability encompasses a range of issues and concerns that contribute to the longevity of digital information. Unlike traditional, temporary strategies, and more permanent solutions, digital sustainability implies a more active and continuous process. Digital sustainability concentrates less on the solution and technology and more on building an infrastructure and approach that is flexible with an emphasis on interoperability, continued maintenance and continuous development. Digital sustainability incorporates activities in the present that will facilitate access and availability in the future. The ongoing maintenance necessary to digital preservation is analogous to the successful, centuries-old, community upkeep of the Uffington White Horse or the Ise Grand Shrine.

Renderability

Renderability refers to the continued ability to use and access a digital object while maintaining its inherent significant properties.

Physical media obsolescence

can occur when access to digital content requires external dependencies that are no longer manufactured, maintained, or supported. External dependencies can refer to hardware, software, or physical carriers. For example, DLT tape was used for backups and data preservation, but is no longer used.

Format obsolescence

File format obsolescence can occur when adoption of new encoding formats supersedes use of existing formats, or when associated presentation tools are no longer readily available.
While the use of file formats will vary among archival institutions given their capabilities, there is documented acceptance among the field that chosen file formats should be "open, standard, non-proprietary, and well-established" to enable long-term archival use. Factors that should enter consideration when selecting sustainable file formats include disclosure, adoption, transparency, self-documentation, external dependencies, impact of patents, and technical protection mechanisms. Other considerations for selecting sustainable file formats include "format longevity and maturity, adaptation in relevant professional communities, incorporated information standards, and long-term accessibility of any required viewing software". For example, the Smithsonian Institution Archives considers uncompressed TIFFs to be "a good preservation format for born-digital and digitized still images because of its maturity, wide adaptation in various communities, and thorough documentation".
Formats proprietary to one software vendor are more likely to be affected by format obsolescence. Well-used standards such as Unicode and JPEG are more likely to be readable in future.