UNIT2DM
UNIT2DM
DATA WAREHOUSING
Presentation Flow
DATA WAREHOUSE
1. INTRODUCTION
2. WHAT IS DATA WAREHOUSE
3. DEFENITION
4. MULTIDIMENSIONAL DATA MODEL
5. OLAP OPERATIONS
6. WAREHOUSE SCHEMA
7. DATA WAREHOUSE ARCHITECTURE
8. WAREHOUSE SERVER
9. META DATA
10. DATA WAREHOUSE BACKEND PROCESS
INTRODUCTION
A Data warehouse is an information system that contains
historical and commutative data from single or multiple
sources.
Data Warehouse Concepts simplify the reporting and analysis
process of organizations.
The concept of the data warehouse has existed since the 1980s,
when it was developed to help transition data from merely
powering operations for decision support systems that
reveal business intelligence.
WHAT IS DATA WAREHOUSE
A single complete and consistent store of data obtained from a
variety of different sources made available to the end user in
what they can understand and use in Business context
Gender
Engineer
Male Female Teacher secretary
CIVIL
Primary Executive
Software
High Junior
Lattice of Cuboids:-
The dimension hierarchy helps us view the multidimensional
data in several different data cube representations.
Multidimensional data can be viewed as a Lattice of cuboids.
The C[A1,A2,A3…..An] at the finest level of granularity is
called the base cuboids and it consists of all data cells.
The coarsest level consists of one cell with numeric measures
of all n dimensions. This is called an apex cuboids.
Summary Measures:-
1. Distributive:-
A numeric measure is distributive if it can be computed in a
distributed manner. The measure can be simply the
aggregation of the measures of all partitions. Ex:- Count,
sum, min, max are distributive measures.
2. Algebraic:-
An aggregate function is algebraic if it can be computed by an
algebraic function with some set of arguments. For example;-
average is obtained by sum/count.
3.Holistic:- An aggregate function is holistic if there is no
constant bound on the storage size needed to describe a sub
aggregate. That is, there does not exist an algebraic function
that can be used to compute this function Example;- Median,
mode, most-frequent etc..
Data Cube
The sales data warehouse includes the sales amount in rupees and
the total number of units sold. Note that we can have more than
one numeric measure. Fig shows the Multidimensional model
for such situations.
- we do not take all the subdivisions of the dimensions into account. The
dimension hierarchies considered for the data cube are:-
Time:( month<quarter<year);
Location:(city<province<country);
and product
1. Star Schema
2. Snowflake Schema
3. Fact Constellation
STAR SCHEMA
It is a modelling paradigm in which the data warehouse
contains a large single central fact table and a set of smaller
dimension table, one for each dimension.
The fact table contains detailed summary data.
Its primary key has one key per dimension.
Every tuple in the Fact table consists of the fact or subject of
interest, and the dimensions that provide that fact.
Each dimension is a single highly denormalized table.
Each tuple of the fact table consists of a key pointing to the
dimension table that provides multidimensional coordinates.
The dimensional table consists of columns corresponding to
the attributes of the dimension.
There exists a 1:n relation between the fact table and
dimension table.
ADVANTAGES OF STAR SCHEMA
Easy to understand
Easy to define hierarchies
Reduces the number of physical joins
Low maintenance
Simple meta data
SNOWFLAKE SCHEMA
The dimension tables are normalized here.
There is a single fact table and multiple dimensional tables.
Like the Star Schema, each tuple of the fact table consists of a
(foreign) key pointing to each of the dimension tables that
provide its multidimensional coordinates.
It also stores numerical values (non-dimensional attributes,
and results of statistical functions) for those coordinates.
ADVANTAGES SNOWFLAKE SCHEMA
Normalized tables are easy to maintain
Saves storage space as it contains non redundant data
Limitations
Navigation across tables is complex.
FACTCONSTELLATION SCHEMA
A Fact Constellation is a kind of schema where we have more
than one Fact Table sharing among them some Dimension
Tables. It is also called Galaxy Schema.
For example, let us assume that Deccan Electronics Company
would like to have supply and delivery fact table. It may
contain five dimensions, or keys: time, item, delivery-agent,
origin and destination along with the numeric measure as the
number of units supplied and the cost of delivery.
Explain Data Warehouse Architecture?
TIER-1
Tier-1 is a Bottom Tier or Physical Layer.
Tier-1 contains the main data warehouse,
It is essentially the warehouse server It can follow 3 models.
Tier-1 includes Meta Data, Data Marts , Data warehouse ,
Monitoring and Administration.
These tools and utilities first perform extract, clean,
transform, load functions on the data.
TIER-2
Tier-2 is the OLAP engine for analytical processing
OLAP engine, namely includes ROLAP ,MOLAP and
Extended SQL OLAP.
TIER-3
This tier is the front-end client Layer.
This layer holds the query tools and reporting tool , analysis
tools , visualization tools and data mining tools.
Backend process which is concerned with extracting data
from multiple operational databases and from external
sources, with cleaning , transforming and integrating the data
for loading into the data warehouse server.
WAREHOUSE SERVER
1) Enterprise Warehouse:-
This model collects all the information about the subjects. It
provides corporate wide data integration, usually from one or
operational system. An enterprise data warehouse requires a
traditional mainframe.
2) Data Mart:-
Data Mart are partitions of the overall data warehouse. Data Mart
is a subset of that huge data warehouse built specifically for a
department. Data mart may contain some overlapping data.
The industry is moving away from a single, physical data
warehouse towards a set of smaller, more manageable,
databases called data marts.
Stand Alone Data Mart:-
This approach enables a department or work group to implement
a data mart with minimal or no impact on the enterprise‟s
operational database.
Dependent Data Mart:-
This approach is similar to the stand alone data mart, except that
management of the data sources by the enterprise database is
required. These data sources include operational databases
and external sources of data.
3)Virtual DataWarehouse:-
It is a set of views over operational databases. A virtual
warehouse is easy to build but requires excess capacity on
operational database servers.
META DATA
Meta Data is a data about the data.
It serves to identify the contents and location of data in the
warehouse.
It can access information across from the data warehouse
Metadata is a bridge between the data warehouse and the
decision support application.
In addition to providing a logical linkage between data and
application
Types of Meta Data:-
1. Build-Time Metadata
2. Usage Metadata
3. Control Metadata
1. Build-time Metadata:-
Whenever a warehouse is designed and built the Meta data
thus generated is known is built time Meta data.
This describes the data‟s technical structure.
This metadata links business and warehouse terminology and
describes the data's technical structure.
It is the most detailed and exact type of metadata and is used
extensively by warehouse designers, developers, and
administrators.
It is the primary source of most of the metadata used in the
warehouse.
2. Usage Metadata:-
This Meta data is derived from build time Meta data. This is an
important tool for users and data administrators.
3. Control Metadata:-
The control metadata provides vital information about the
timeliness of warehouse data and helps users to keep track of
the sequence and timing of warehouse events.
System programmers mostly use it.
However, one subset which is generated and used by the tools
that populate the warehouse, is of considerable interest to
users and data warehouse administrators.
OLAP ENGINE
The main functions of the OLAP engine are to present the user a
multidimensional view of the data warehouse and to provide
tools for OLAP operations.
ROLAP MOLAP
• It stands for Relational Online • It stands for Multidimensional
Analytical Processing. Online Analytical Processing.
• Data stored in relational • Data stored in
Database. multidimensional cube.
• Data retrieved via SQL from
• Data retrieved directly from database for analysis.
cube for analysis. • It is used for less/limited
• It is used for large volumes of volumes of data
data • The access time is quick in
• The access time in ROLAP is MOLAP.
slow. • A sparse matrix is used in
• It uses complex SQL queries MOLAP
DATA WAREHOUSE BACKEND
PROCESS
The data ware server use back end utilities in order to populate
and refresh data. The various functions that take place at the
backend process are
Data Extraction
Data Cleaning
Data Transformation
Data Loading
Data Refreshing.
Data Extraction: This function gathers data from multiple
heterogeneous sources. The data may be collected from
production data, legacy data, internal office system, and
external system and meta data.