ZFS


ZFS is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris, including ZFS, were published under an open source license as OpenSolaris for around 5 years from 2005 before being placed under a closed source license when Oracle Corporation acquired Sun in 20092010. During 2005 to 2010, the open source version of ZFS was ported to Linux, Mac OS X and FreeBSD. In 2010, the illumos project forked a then-recent version of OpenSolaris, including ZFS, to continue its development as an open source project. In 2013, OpenZFS was founded to coordinate the development of open source ZFS. OpenZFS maintains and manages the core ZFS code, while organizations using ZFS maintain the specific code and validation processes required for ZFS to integrate within their systems. OpenZFS is widely used in Unix-like systems.

Overview

The management of stored data generally involves two aspects: the physical volume management of one or more block storage devices, including their organization into logical block devices as VDEVs as seen by the operating system ; and the management of data and files that are stored on these logical block devices.
ZFS is unusual because, unlike most other storage systems, it unifies both of these roles and acts as both the volume manager and the file system. Therefore, it has complete knowledge of both the physical disks and volumes as well as of all the files stored on them. ZFS is designed to ensure that data stored on disks cannot be lost due to physical errors, misprocessing by the hardware or operating system, or bit rot events and data corruption that may happen over time. Its complete control of the storage system is used to ensure that every step, whether related to file management or disk management, is verified, confirmed, corrected if needed, and optimized, in a way that the storage controller cards and separate volume and file systems cannot achieve.
ZFS also includes a mechanism for dataset and pool-level snapshots and replication, including snapshot cloning, which is described by the FreeBSD documentation as one of its "most powerful features" with functionality that "even other file systems with snapshot functionality lack". Very large numbers of snapshots can be taken without degrading performance, allowing snapshots to be used prior to risky system operations and software changes, or an entire production file system to be fully snapshotted several times an hour in order to mitigate data loss due to user error or malicious activity. Snapshots can be rolled back "live" or previous file system states can be viewed, even on very large file systems, leading to savings in comparison to formal backup and restore processes. Snapshots can also be cloned to form new independent file systems. ZFS also has the ability to take a pool level snapshot, which allows rollback of operations that may affect the entire pool's structure or that add or remove entire datasets.

History

2004–2010: Development at Sun Microsystems

In 1987, AT&T Corporation and Sun announced that they were collaborating on a project to merge the most popular Unix variants on the market at that time: Berkeley Software Distribution, UNIX System V, and Xenix. This became Unix System V Release 4. The project was released under the name Solaris, which became the successor to SunOS 4.
ZFS was designed and implemented by a team at Sun led by Jeff Bonwick, Bill Moore, and Matthew Ahrens. It was announced on September 14, 2004, but development started in 2001. Source code for ZFS was integrated into the main trunk of Solaris development on October 31, 2005 and released for developers as part of build 27 of OpenSolaris on November 16, 2005. In June 2006, Sun announced that ZFS was included in the mainstream 6/06 update to Solaris 10.
Solaris was originally developed as proprietary software, but Sun Microsystems was an early commercial proponent of open source software and in June 2005 released most of the Solaris codebase under the CDDL license and founded the OpenSolaris open-source project. In Solaris 10 6/06, Sun added the ZFS file system and frequently updated ZFS with new features during the next 5 years. ZFS was ported to Linux, Mac OS X, and FreeBSD, under this open source license.
The name at one point was said to stand for "Zettabyte File System", but by 2006, the name was no longer considered to be an abbreviation. A ZFS file system can store up to 256 quadrillion zettabytes.
In September 2007, NetApp sued Sun, claiming that ZFS infringed some of NetApp's patents on Write Anywhere File Layout. Sun counter-sued in October the same year claiming the opposite. The lawsuits were ended in 2010 with an undisclosed settlement.

2010–current: Development at Oracle, OpenZFS

Ported versions of ZFS began to appear in 2005. After the Sun acquisition by Oracle in 2010, Oracle's version of ZFS became closed source, and the development of open-source versions proceeded independently, coordinated by OpenZFS from 2013.

Features

Summary

Examples of features specific to ZFS include:
  • Designed for long-term storage of data, and indefinitely scaled datastore sizes with zero data loss, and high configurability.
  • Hierarchical checksumming of all data and metadata, ensuring that the entire storage system can be verified on use, and confirmed to be correctly stored, or remedied if corrupt. Checksums are stored with a block's parent block, rather than with the block itself. This contrasts with many file systems where checksums are stored with the data so that if the data is lost or corrupt, the checksum is also likely to be lost or incorrect.
  • Can store a user-specified number of copies of data or metadata, or selected types of data, to improve the ability to recover from data corruption of important files and structures.
  • Automatic rollback of recent changes to the file system and data, in some circumstances, in the event of an error or inconsistency.
  • Automated and silent self-healing of data inconsistencies and write failure when detected, for all errors where the data is capable of reconstruction. Data can be reconstructed using all of the following: error detection and correction checksums stored in each block's parent block; multiple copies of data held on the disk; write intentions logged on the SLOG for writes that should have occurred but did not occur ; parity data from RAID/RAID-Z disks and volumes; copies of data from mirrored disks and volumes.
  • Native handling of standard RAID levels and additional ZFS RAID layouts. The [|RAID-Z] levels stripe data across only the disks required, for efficiency, and checksumming allows rebuilding of inconsistent or corrupted data to be minimized to those blocks with defects;
  • Native handling of tiered storage and caching devices, which is usually a volume related task. Because ZFS also understands the file system, it can use file-related knowledge to inform, integrate, and optimize its tiered storage handling which a separate device cannot;
  • Native handling of snapshots and backup/replication which can be made efficient by integrating the volume and file handling. Relevant tools are provided at a low level and require external scripts and software for utilization.
  • Native data compression and deduplication, although the latter is largely handled in RAM and is memory hungry.
  • Efficient rebuilding of RAID arrays—a RAID controller often has to rebuild an entire disk, but ZFS can combine disk and file knowledge to limit any rebuilding to data which is actually missing or corrupt, greatly speeding up rebuilding;
  • Unaffected by RAID hardware changes which affect many other systems. On many systems, if self-contained RAID hardware such as a RAID card fails, or the data is moved to another RAID system, the file system will lack information that was on the original RAID hardware, which is needed to manage data on the RAID array. This can lead to a total loss of data unless near-identical hardware can be acquired and used as a "stepping stone". Since ZFS manages RAID itself, a ZFS pool can be migrated to other hardware, or the operating system can be reinstalled, and the RAID-Z structures and data will be recognized and immediately accessible by ZFS again.
  • Ability to identify data that would have been found in a cache but has been discarded recently instead; this allows ZFS to reassess its caching decisions in light of later use and facilitates very high cache-hit levels ;
  • Alternative caching strategies can be used for data that would otherwise cause delays in data handling. For example, synchronous writes which are capable of slowing down the storage system can be converted to asynchronous writes by being written to a fast separate caching device, known as the SLOG.
  • Highly tunable—many internal parameters can be configured for optimal functionality.
  • Can be used for high availability clusters and computing, although not fully designed for this use.

    Data integrity

One major feature that distinguishes ZFS from other file systems is that it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, power surges, bugs in disk firmware, phantom writes, misdirected reads/writes, DMA parity errors between the array and server memory or from the driver, driver errors, accidental overwrites, etc..
A 1999 study showed that neither any of the then-major and widespread filesystems, nor hardware RAID provided sufficient protection against data corruption problems. Initial research indicates that ZFS protects data better than earlier efforts.
It is also faster than UFS and can be seen as its replacement.
Within ZFS, data integrity is achieved by using a Fletcher-based checksum or a SHA-256 hash throughout the file system tree. Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree. In-flight data corruption or phantom reads/writes are undetectable by most filesystems as they store the checksum with the data. ZFS stores the checksum of each block in its parent block pointer so that the entire pool self-validates.
When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides data redundancy, assuming that the copy of data is undamaged and with matching checksums. It is optionally possible to provide additional in-pool redundancy by specifying , which means that data will be stored twice on the disk, effectively halving the storage capacity of the disk. Additionally, some kinds of data used by ZFS to manage the pool are stored multiple times by default for safety even with the default copies=1 setting.
If other copies of the damaged data exist or can be reconstructed from checksums and parity data, ZFS will use a copy of the data and recalculate the checksum—ideally resulting in the reproduction of the originally expected value. If the data passes this integrity check, the system can then update all faulty copies with known-good data and redundancy will be restored.
If there are no copies of the damaged data, ZFS puts the pool in a faulted state, preventing its future use and providing no documented ways to recover pool contents.
Consistency of data held in memory, such as cached data in the ARC, is not checked by default, as ZFS is expected to run on enterprise-quality hardware with error correcting RAM. However, the capability to check in-memory data exists and can be enabled using "debug flags".