Data Warehouse Concepts & Terminology: - Vamshi Myana
Data Warehouse Concepts & Terminology: - Vamshi Myana
&
Terminology
- Vamshi Myana
Contents
What is Datawarehouse?
Why Separate Data Warehouse?
Data Granularity
Difference between OLTP & DW
Datawarehouse Architecture
Top-Down Versus Bottom-Up Approach
Data Warehouses Versus Data Marts
Dimensional Modeling Fundamentals
Extraction, Transformation and Load
Separate Data Warehouse?
ETL(Extract Transform Load) & OLAP
What is Datawarehouse?
A data warehouse is a relational database that is
designed for query and analysis rather than for
transaction processing. It usually contains historical
data derived from transaction data, but it can include
data from other sources. It separates analysis
workload from transaction workload and enables an
organization to consolidate data from several sources.
In addition to a relational database, a data warehouse
environment includes an extraction,
transformation, and loading (ETL) solution, an
online analytical processing (OLAP) engine, client
analysis tools, and other applications that manage the
process of gathering data and delivering it to business
users.
Data Warehouse
Properties
Subject
Oriented
Integrated
Data
Warehouse
Non Volatile
Time Variant
-- Bill Inmon, Building the Data Warehouse 1996
Subject-Oriented
Data is categorized and stored by business subject
rather than by application
OLTP Applications
Equity
Plans
Shares
Insurance
Savings
Loans
Customer
financial
information
Integrated
Constructed by integrating multiple,
heterogeneous data sources
Relational databases, flat files, on-line transaction
records
Time-Variant
Data is stored as a series of snapshots, each
representing a period of time
Time
Jan-97
Feb-97
Mar-97
Data
January
February
March
Nonvolatile
Typically data in the data warehouse is not updated or delelted.
Operational
Warehouse
Load
Insert
Update
Delete
Read
Read
Datawarehouse terminology
Data Mart
Departmental subsets that focus on selected subjects
the knowledge worker (executive, manager, analyst) make faster & better decisions.
Data Granularity
What is Granularity of your DW?
Granularity is the level of details we
want to store in the data warehouse.
For a retail store, Point of Sale (POS) is
the lowest granularity information
available.
For banking its the account level details
based on every day transactions.
Operational
Data Warehouse
Response
Time
Sub seconds to
seconds
Seconds to hours
Operations
DML
Nature of Data
30-60 days
Data Organization
Applications
Subject, time
Size
Small to large
Data Source
Operational, Internal
Operational, Internal,
External
Activities
Processes
Analysis
Data warehouse
Architectures
Data warehouse
Architectures
Top-Down Approach
The advantages of this approach are:
A truly corporate effort, an enterprise
view of data
Inherently architectednot a union of
disparate data marts
Single, central storage of data about
the content
Centralized rules and control
Top-Down Approach
The disadvantages are:
Takes longer to build
High exposure/risk to failure
Needs high level of cross-functional
skills
High outlay without proof of concept
Bottom-Up Approach
The advantages of this approach are:
Faster and easier implementation of
manageable pieces
Favorable return on investment and
proof of concept
Less risk of failure
Inherently incremental; can schedule
important data marts first
Bottom-Up Approach
The disadvantages are:
Each data mart has its own narrow view
of data
Permeates redundant data in every data
mart
Perpetuates inconsistent and
irreconcilable data
Data
Mart
Dimensional Model
Query oriented
Structured around data usage not business rules
Organized roughly into base facts and dimensions of those facts
Based on identification of key grains of data and on characteristics of those grains
Consisting usually of snapshot, business data
Looks to reduce the number and depth of joins
Two general patterns Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called galaxy schema or fact constellation
item
time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Example of Snowflake
Schema
Store
Dimension
STORE KEY
Store Description
City
State
District ID
Region_ID
Regional Mgr.
District_ID
District Desc.
Region_ID
Region_ID
Region Desc.
Regional Mgr.
Dimensional Modeling
Terminology
A Fact table stores measures as well as keys
representing relationships to various dimensions.
Additive - Measures that can be added across all
dimensions.
Semi Additive - Measures that can be added across few
dimensions and not with others.
Non Additive - Measures that cannot be added across all
dimensions.
Conformed Dimension
Dimension tables that adhere to a common
structure, and therefore allow queries to be
executed across star schemas.
Sales Schema
Item Key
DATE KEY
Item Desc.
ITEM KEY
Brand Desc.
STORE KEY
Category
PROMO KEY
..
Sales Fact
Item Key
Inventory Schema
Item Desc.
Brand Desc.
Category
..
DATE KEY
ITEM KEY
STORE KEY
Inventory
Fact
Extraction, Transformation
and Load
OLTP Databases
Staging File
Warehouse Database
What is OLAP?
What is OLAP?
Online Analytical Processing. Viewing data in a
multi dimensional way.
Why OLAP?
Slice and dice for data warehouse.
RDBMS is a 2 dimensional way of storing /
viewing the data
OLAP is a multi dimensional way of storing /
viewing the data
OLAP operations
Roll up (drill-up):
summarize data
by climbing up
hierarchy or by
dimension reduction
OLAP operations
Slicing: Selecting the
dimensions of the cube
to be viewed.
Example: View Sales
volume as a function
of Product by
Country by Quarter
Types in OLAP?
Three types of OLAP in the industry.
1.MOLAP Multi dimensional OLAP (Ex
MSOLAP, Essbase, Cognos).
2.ROLAP Relational OLAP ( Ex Business
Objects, Microstrategy).
3.HOLAP Hybrid OLAP
Architecture diagram of
ROLAP
App Server
ROLAP tools
Like
DataWarehouse
Or
Data Mart
BO
Cognos
Microstrategy
Etc
BI Metadata
OLAP
Report1
OLAP
Report2
OLAP
Report n
When a report is executed by end user the actual SQL is issued to RDBMS to get
the data. Some BI tools can even store the results set in the application server and
periodically refresh that report based on the data refreshes which happen in DW.
Architecture diagram of
MOLAP
Microsoft
Analysis
Services
DataWarehouse
Or
Data Mart
MOLAP
cubes
BI Metadata
Cube defn
etc
MOLAP
cubes
OLAP
Report1
OLAP
Report2
OLAP
Report n
When a report is executed by end user the actual data is retrieved from the MOLAP
cubes. The way it retrieves by using MDX queries based on the report. MDX stands
for Multidimensional expression. SQL is used to get the data RDBMS, MDX is used
to get the data from MOLAP. The MOLAP cubes are refreshed periodically
based on the data refreshes which happen in DW.
Terminology
Cube
A cube is a
multidimensional structure
of data. Cubes are defined
by a set of dimensions and
measures.
Terminology
Products
n
o
i
t
a
c
o
L
Dimension
A structural attribute
of a cube that acts as
an index for identifying
values within a multidimensional array.
If all dimensions have
a single member
selected, then a single
cell is defined.
Time
Terminology
Measures
Numeric data of
interest.
Coffee
Apples
Tea
Time
April
March
1.95
February
Onions
January
Products
a
in
ru
Ch
Pe
n
pa y
Ja
al
It
n
o
i
t
a
c
Lo
Summary
This session covered the following topics:
What is Datawarehouse?
Difference between OLTP & DW
Data warehouse Architecture and
approach
Dimensional Modeling
What is OLAP?
Questions ?
Thank You.