Data warehouse
In computing, a data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis and is a core component of business intelligence. Data warehouses are central repositories of data integrated from disparate sources. They store current and historical data organized in a way that is optimized for data analysis, generation of reports, and developing insights across the integrated data. They are intended to be used by analysts and managers to help make organizational decisions.
The data stored in the warehouse is uploaded from operational systems. The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the data warehouse for reporting.
The two main workflows for building a data warehouse system are extract, transform, load and extract, load, transform.
Components
The environment for data warehouses and marts includes the following:- Source systems of data ;
- Data integration technology and processes to extract data from source systems, transform them, and load them into a data mart or warehouse;
- Architectures to store data in the warehouse or marts;
- Tools and applications for varied users;
- Metadata, data quality, and governance processes. Metadata includes data sources, refresh schedules and data usage measures.
Related systems
Operational databases
Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity–relationship model. Operational system designers generally follow database normalization to ensure data integrity. Fully normalized database designs often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected by each transaction. To improve performance, older data are periodically purged.Data warehouses are optimized for analytic access patterns, which usually involve selecting specific fields rather than all fields as is common in operational databases. Because of these differences in access, operational databases benefit from the use of a row-oriented database management system, whereas analytics databases benefit from the use of a column-oriented DBMS. Operational systems maintain a snapshot of the business, while warehouses maintain historic data through ETL processes that periodically migrate data from the operational systems to the warehouse.
Online analytical processing is characterized by a low rate of transactions and complex queries that involve aggregations. Response time is an effective performance measure of OLAP systems. OLAP applications are widely used for data mining. OLAP databases store aggregated, historical data in multi-dimensional schemas. OLAP systems typically have a data latency of a few hours, while data mart latency is closer to one day. The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are roll-up, drill-down, and slicing & dicing.
Online transaction processing is characterized by a large numbers of short online transactions. OLTP systems emphasize fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, performance is the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model. Normalization is the norm for data modeling techniques in this system.
Predictive analytics is about finding and quantifying hidden patterns in the data using complex mathematical models to prepare for different future outcomes, including demand for products, and make better decisions. By contrast, OLAP focuses on historical data analysis and is reactive. Predictive systems are also used for customer relationship management.
Data marts
A data mart is a simple data warehouse focused on a single subject or functional area. Hence it draws data from a limited number of sources such as sales, finance or marketing. Data marts are often built and controlled by a single department in an organization. The sources could be internal operational systems, a central data warehouse, or external data.| Attribute | Data warehouse | Data mart |
| Scope of the data | enterprise | department |
| Number of subject areas | multiple | single |
| How difficult to build | difficult | easy |
| Memory required | larger | limited |
Types of data marts include dependent, independent, and hybrid data marts.
Variants
ETL
The typical extract, transform, load -based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates disparate data sets by transforming the data from the staging layer, often storing this transformed data in an operational data store database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into [|facts] and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition of data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
ELT
-based data warehousing gets rid of a separate ETL tool for data transformation. Instead, it maintains a staging area inside the data warehouse itself. In this approach, data gets extracted from heterogeneous source systems and are then directly loaded into the data warehouse, before any transformation occurs. All necessary transformations are then handled inside the data warehouse itself. Finally, the manipulated data gets loaded into target tables in the same data warehouse.Benefits
A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:- Integrate data from multiple sources into a single database and data model. More congregation of data to single database so a single query engine can be used to present data in an operational data store.
- Mitigate the problem of isolation-level lock contention in transaction processing systems caused by long-running analysis queries in transaction processing databases.
- Maintain data history, even if the source transaction systems do not.
- Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization grows via merging.
- Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
- Present the organization's information consistently.
- Provide a single common data model for all data of interest regardless of data source.
- Restructure the data so that it makes sense to the business users.
- Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
- Add value to operational business applications, notably customer relationship management systems.
- Make decision–support queries easier to write.
- Organize and disambiguate repetitive data.
History
Additionally, with the publication of The IRM Imperative by James M. Kerr, the idea of managing and putting a dollar value on an organization's data resources and then reporting that value as an asset on a balance sheet became popular. In the book, Kerr described a way to populate subject-area databases from data derived from transaction-driven systems to create a storage area where summary data could be further leveraged to inform executive decision-making. This concept served to promote further thinking of how a data warehouse could be developed and managed in a practical way within any enterprise.
Key developments in early years of data warehousing:
- 1960s – General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.
- 1970s – ACNielsen and IRI provide dimensional data marts for retail sales.
- 1970s – Bill Inmon begins to define and discuss the term data warehouse.
- 1975 – Sperry Univac introduces MAPPER, a database management and reporting system that includes the world's first 4GL. It is the first platform designed for building information centers.
- 1983 – Teradata introduces the DBC/1012 database computer specifically designed for decision support.
- 1984 – Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases a hardware/software package and GUI for business users to create a database management and analytic system.
- 1988 – Barry Devlin and Paul Murphy publish the article "An architecture for a business and information system" where they introduce the term "business data warehouse".
- 1990 – Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing.
- 1991 – James M. Kerr authors "The IRM Imperative", which suggests data resources could be reported as an asset on a balance sheet, furthering commercial interest in the establishment of data warehouses.
- 1991 – Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data warehouse.
- 1992 – Bill Inmon publishes the book Building the Data Warehouse.
- 1995 – The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
- 1996 – Ralph Kimball publishes the book The Data Warehouse Toolkit.
- 1998 – Focal modeling is implemented as an ensemble data warehouse modeling approach, with Patrik Lager as one of the main drivers.
- 2000 – Dan Linstedt releases in the public domain the data vault modeling, conceived in 1990 as an alternative to Inmon and Kimball to provide long-term historical storage of data coming in from multiple operational systems, with emphasis on tracing, auditing and resilience to change of the source data model.
- 2008 – Bill Inmon, along with Derek Strauss and Genia Neushloss, publishes "DW 2.0: The Architecture for the Next Generation of Data Warehousing", explaining his top-down approach to data warehousing and coining the term, data-warehousing 2.0.
- 2008 – Anchor modeling was formalized in a paper presented at the International Conference on Conceptual Modeling, and won the best paper award
- 2012 – Bill Inmon develops and makes public technology known as "textual disambiguation". Textual disambiguation applies context to raw text and reformats the raw text and context into a standard data base format. Once raw text is passed through textual disambiguation, it can easily and efficiently be accessed and analyzed by standard business intelligence technology. Textual disambiguation is accomplished through the execution of textual ETL. Textual disambiguation is useful wherever raw text is found, such as in documents, Hadoop, email, and so forth.
- 2013 – Data vault 2.0 was released, having some minor changes to the modeling method, as well as integration with best practices from other methodologies, architectures and implementations including agile and CMMI principles