What is Data Warehouse ?
A Data warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated.This makes it much easier and more efficient to run queries over data that originally came from different sources".Another definition for data warehouse is : " A data warehouse is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favour efficient data analysis and reporting (especially OLAP)". Generally, data warehousing is not meant for current "live" data, although 'virtual' or 'point-to-point' data warehouses can access operational data. A 'real' data warehouse is generally preferred to a virtual DW because stored data has been validated and is set up to provide reliable results to common types of queries used in a business. History of data warehousing
In the 1990's as organizations of scale began to need more timely data about their business, they found that traditional information systems technology was simply too cumbersome to provide relevant data efficiently and quickly. Completing reporting requests could take days or weeks using antiquated reporting tools that were designed more or less to 'execute' the business rather than 'run' the business.
From this idea, the data warehouse was born as a place where relevant data could be held for completing strategic reports for management. The key here is the word 'strategic' as most executives were less concerned with the day to day operations than they were with a more overall look at the model and business functions.
As with all technology, over the course of the latter half of the 20th century, we saw increased numbers and types of databases. Many large businesses found themselves with data scattered across multiple platforms and variations of technology, making it almost impossible for any one individual to use data from multiple sources. A key idea within data warehousing is to take data from multiple platforms/technologies (As varied as spreadsheets, DB2 databases, IDMS records, and VSAM files) and place them in a common location that uses a common querying tool. In this way operational databases could be held on whatever system was most efficient for the operational business, while the reporting / strategic information could be held in a common location using a common language. Data Warehouses take this even a step further by giving the data itself commonality by defining what each term means and keeping it standard. (An example of this would be gender which can be referred to in many ways, but should be standardized on a data warehouse with one common way of referring to each sex.)
All of this was designed to make decision support more readily available and without affecting day to day operations. One aspect of a data warehouse that should be stressed is that it is NOT a location for ALL of a businesses data, but rather a location for data that is 'interesting'. Data that is interesting will assist decision makers in making strategic decisions relative to the organization's overall mission.
Data warehouses often hold large amounts of information which are sometimes subdivided into smaller logical units called dependent data marts. Dependent Datamarts allow for easier reporting by keeping relevant data together in one location.
Usually, two basic ideas guide the creation of a data warehouse:
- Integration of data from distributed and differently structured databases, which facilitates a global overview and comprehensive analysis in the data warehouse.
- Separation of data used in daily operations from data used in the data warehouse for purposes of reporting, decision support, analysis and controlling.
Since OLTP databases contain large volumes of data, it is very critical to unload data quickly without adding significant overhead to production database. Periodically, one imports data from enterprise resource planning (ERP) systems and other related business software systems into the data warehouse for further processing. It is common practice to "stage" data prior to merging it into a data warehouse. In this sense, to "stage data" means to queue it for preprocessing, usually with an ETL tool. The preprocessing program reads the staged data (often a business's primary OLTP databases), performs qualitative preprocessing or filtering (including denormalization, if deemed necessary), and writes it into the warehouse.
A data warehouse is created by analyzing ways to categorize data using dimensions and ways to summarize data using measures. Dimensions can be used to filter and navigate summarised data by excluding results or by displaying data in different reporting styles (cross-tabbing). Measures are performance metrics which a business is interested in following up, these are mainly sum & averages of figures collected by OLTP systems. There seems to be some misunderstanding as to how data warehouses should be designed, since in most cases, technical individuals do not really understand the broader scope of the business of their organisations.
- Source Data
- Data Staging
- Data Storage
- Information Delivery
- Management and Control