Data quality


Data quality[] refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for intended uses in operations, decision making and planning". Data is deemed of high quality if it correctly represents the real-world construct to which it refers. Apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose.
People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, businesses may adopt recognised international standards for data quality. Data governance can also be used to form agreed upon definitions and standards, including international standards, for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality.

Definitions

Defining data quality is difficult due to the many contexts data are used in, as well as the varying perspectives among end users, producers, and custodians of data.
From a consumer perspective, data quality is:
  • "data that are fit for use by data consumers"
  • data "meeting or exceeding consumer expectations"
  • data that "satisfies the requirements of its intended use"
From a business perspective, data quality is:
  • data that are "'fit for use' in their intended operational, decision-making and other roles" or that exhibits "'conformance to standards' that have been set, so that fitness for use is achieved"
  • data that "are fit for their intended uses in operations, decision making and planning"
  • "the capability of data to satisfy the stated business, system, and technical requirements of an enterprise"
From a standards-based perspective, data quality is:
  • the "degree to which a set of inherent characteristics of an object fulfills requirements"
  • "the usefulness, accuracy, and correctness of data for its application"
Arguably, in all these cases, "data quality" is a comparison of the actual state of a particular set of data to a desired state, with the desired state being typically referred to as "fit for use," "to specification," "meeting consumer expectations," "free of defect," or "meeting requirements." These expectations, specifications, and requirements are usually defined by one or more individuals or groups, standards organizations, laws and regulations, business policies, or software development policies.

Dimensions of data quality

Drilling down further, those expectations, specifications, and requirements are stated in terms of characteristics or dimensions of the data, such as:
  • accessibility or availability
  • accuracy or correctness
  • comparability
  • completeness or comprehensiveness
  • consistency, coherence, or clarity
  • credibility, reliability, or reputation
  • flexibility
  • plausibility
  • relevance, pertinence, or usefulness
  • timeliness or latency
  • uniqueness
  • validity or reasonableness
A systematic scoping review of the literature suggests that data quality dimensions and methods with real world data are not consistent in the literature, and as a result quality assessments are challenging due to the complex and heterogeneous nature of these data.

International standards for data quality

is an international standard for data quality. Managed by the International Organization for Standardization, the ISO 8000 standards address and describe
  • general aspects of data quality including principles, vocabulary and measurement
  • data governance
  • data quality management
  • data quality assessment
  • quality of master data, including exchange of characteristic data and identifiers
  • quality of industrial data

    History

Before the rise of the inexpensive computer data storage, massive mainframe computers were used to maintain name and address data for delivery services. This was so that mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry. This technology saved large companies millions of dollars in comparison to manual correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.
Companies with an emphasis on marketing often focused their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization.
For companies with significant research efforts, data quality can include developing protocols for research methods, reducing measurement error, bounds checking of data, cross tabulation, modeling and outlier detection, verifying data integrity, etc.

Overview

There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science. One framework, dubbed "Zero Defect Data" adapts the principles of statistical process control to data quality. Another framework seeks to integrate the product perspective and the service perspective . Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data. One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously.
A considerable amount of data quality research involves investigating and describing various categories of desirable attributes of data. Nearly 200 such terms have been identified and there is little agreement in their nature, their definitions or measures. Software engineers may recognize this as a similar problem to "ilities".
MIT has an Information Quality Program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field. This program grew out of the work done by Hansen on the "Zero Defect Data" framework.
In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence to customer relationship management and supply chain management. One industry study estimated the total cost to the U.S. economy of data quality problems at over U.S. $600 billion per annum. Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects.
In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed.
One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year.
In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role in the corporation is to be responsible for data quality. In some organizations, this data governance function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.
Problems with data quality don't only arise from incorrect data; inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.
The market is going some way to providing data quality assurance. A number of vendors make tools for analyzing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:
  1. Data profiling - initially assessing the data to understand its current state, often including value distributions
  2. Data standardization - a business rules engine that ensures that data conforms to standards
  3. Geocoding - for name and address data. Corrects data to U.S. and Worldwide geographic standards
  4. Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that "Bob" and "Bbo" may be the same individual. It might be able to manage "householding", or finding links between spouses at the same address, for example. Finally, it often can build a "best of breed" record, taking the best components from multiple data sources and building a single super-record.
  5. Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.
  6. Batch and Real time - Once the data is initially cleansed, companies often want to build the processes into enterprise applications to keep it clean.