What is Data Warehouse ?
A Data
warehouse is a repository of integrated information, available for queries
and analysis. Data and information are extracted from heterogeneous sources as
they are generated.This makes it much easier and more efficient to run queries
over data that originally came from different sources".Another definition
for data warehouse is : " A data
warehouse is a logical collection of information gathered from many
different operational databases used to create business intelligence that
supports business analysis activities and decision-making tasks, primarily, a
record of an enterprise's past transactional and operational information,
stored in a database
designed
to favour efficient data analysis and reporting (especially OLAP)". Generally,
data warehousing is not meant for current "live" data, although
'virtual' or 'point-to-point' data warehouses can access operational data. A
'real' data warehouse is generally preferred to a virtual DW because stored
data has been validated and is set up to provide reliable results to common
types of queries used in a business. History
of data warehousing
In the 1990's as organizations of scale began to
need more timely data about their business, they found that traditional
information systems technology was simply too cumbersome to provide relevant
data efficiently and quickly. Completing reporting requests could take days or
weeks using antiquated reporting tools that were designed more or less to
'execute' the business rather than 'run' the business.
From this idea, the data warehouse was born as a
place where relevant data could be held for completing strategic reports for
management. The key here is the word 'strategic' as most executives were less
concerned with the day to day operations than they were with a more overall
look at the model and business functions.
As with all technology, over the course of the
latter half of the 20th century, we saw increased numbers and types of
databases. Many large businesses found themselves with data scattered across
multiple platforms and variations of technology, making it almost impossible
for any one individual to use data from multiple sources. A key idea within
data warehousing is to take data from multiple platforms/technologies (As
varied as spreadsheets, DB2 databases, IDMS records, and VSAM files) and place
them in a common location that uses a common querying tool. In this way operational
databases could be held on whatever system was most efficient for the
operational business, while the reporting / strategic information could be held
in a common location using a common language. Data Warehouses take this even a
step further by giving the data itself commonality by defining what each term
means and keeping it standard. (An example of this would be gender which can be
referred to in many ways, but should be standardized on a data warehouse with
one common way of referring to each sex.)
All of this was designed to make decision support
more readily available and without affecting day to day operations. One aspect
of a data warehouse that should be stressed is that it is NOT a location for
ALL of a businesses data, but rather a location for data that is 'interesting'.
Data that is interesting will assist decision makers in making strategic
decisions relative to the organization's overall mission.
Design of data warehouses
Data warehouses often hold large amounts of information
which are sometimes subdivided into smaller logical units called dependent data
marts. Dependent Datamarts allow for easier reporting by keeping relevant
data together in one location.
Usually, two basic ideas guide the creation of a
data warehouse:
- Integration of data from distributed and differently structured databases, which facilitates a global overview and comprehensive analysis in the data warehouse.
- Separation of data used in daily operations from data used in the data warehouse for purposes of reporting, decision support, analysis and controlling.
Since OLTP databases contain large volumes of
data, it is very critical to unload data quickly without adding significant
overhead to production database. Periodically, one imports data from enterprise resource planning (ERP)
systems and other related business software systems into the data warehouse for
further processing. It is common practice to "stage" data prior to merging
it into a data warehouse. In this sense, to "stage data" means to
queue it for preprocessing, usually with an ETL tool. The preprocessing program reads
the staged data (often a business's primary OLTP databases),
performs qualitative preprocessing or filtering (including denormalization,
if deemed necessary), and writes it into the warehouse.
Dimensions and measures
A data warehouse is created by analyzing ways to
categorize data using dimensions and ways to summarize data
using measures. Dimensions can be used to filter
and navigate summarised data by excluding results or by displaying data in
different reporting styles (cross-tabbing). Measures are performance metrics
which a business is interested in following up, these are mainly sum &
averages of figures collected by OLTP systems. There seems to be some
misunderstanding as to how data warehouses should be designed, since in most
cases, technical individuals do not really understand the broader scope of the
business of their organisations.
Building blocks or Components
- Source Data
- Data Staging
- Data Storage
- Information Delivery
- Metadata
- Management and Control
Reporting
Business Intelligence reports (e.g., MIS reports) may then be generated
from the data managed by the warehouse. In this way the data warehouse supplies
the data for and supports the business intelligence tools that an organization
might use.