Data contract
In data management, a data contract is an agreement between data producers and data consumers. It contains a detailed schema creating a link between business and technology. A data contract also describes advanced metadata, such as data quality rules, SLA, and behavior. Data contracts can take several forms, but YAML is very common.
The Linux Foundation project Bitol has published a data contract standard called Open Data Contract Standard. Its current version is 3.0.2.
History
In December 2021, Andrew Jones at GoCardless wrote about how they were using Data Contracts, and in October 2022 wrote about their implementation.In August 2022, Jean-Georges Perrin published in the PayPal Technology Blog a popular reference article where he describes the use of data contracts in a Data Mesh implementation. A little later, in May 2023, PayPal open-sourced its Data Contract Template.
In June 2023, Andrew Jones published Driving Data Quality with Data Contracts: A comprehensive guide to building reliable, trusted, and effective data platforms, which is, up to now, the only published book on this topic.
In November 2023, Bitol, a Linux Foundation project, released the first version of ODCS, a compatible fork from the PayPal template.
In September 2024, Ronald Angel at Miro wrote about their implementation of data contracts.
In October 2024, Bitol released ODCS v3.0.0 with enhanced support for data quality.
Implementation
The Apache 2.0-based Bitol project divides data contracts into several sections:- Fundamentals: Contains general information about the contract, like name, domain, version, and much room for information.
- Schema: Describes the dataset and the schema of the data contract. The schema is a critical element of the contract, it is the support for data quality. A data contract focuses on a single dataset with several tables.
- Data quality: Describes data quality rules & their parameters. They are tightly linked to the schema defined in the dataset & schema section.
- Pricing: Explains pricing if/when there is a need to bill customers for using this data product, whether the customer is internal or external.
- Team: Important part lists stakeholders and the history of their relation with this data contract. It usually excludes consumers.
- Roles: Lists the roles that a consumer may need to access the dataset depending on the type of access they require.
- Service-level agreement : Describes the service-level agreements. Data, data quality and SLA are combined together as data quality of service.
- Infrastructure: Describes the servers and storage in potentially several environments, like production, development, test, and so on.
- Business rules: Describes the rules associated with the organization’s business rules.
- Custom properties: Covers custom and other properties in a data contract using a list of key–value pairs. This structure offers flexibility without requiring the creation of a new version of the standard whenever someone needs additional properties.
Product-oriented data engineering and management
Best practices
Usually, a data contract is created by one data producer for one or many data consumers.A data contract is designed to be enhanced iteratively. Data engineers can start with the few elements in the header and the schema. Over time, data engineers and owners can add more information, like data quality and SLA.
Most data contracts are implemented using a YAML file, which is both human -and computer-readable and language-agnostic.
The symbol for a data contract is either an equilateral triangle – symbolizing schema, business meaning, and SLAs or a file icon.