DMDW_Operations
DMDW_Operations
Technology
1
Definition1: A data warehouse is a repository of
information collected from multiple sources, stored under
a unified schema, and that usually resides at a single site.
2
What is Data Warehouse?
Definition2: “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
3
Data Warehouse—Subject-Oriented
4
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
5
Data Warehouse—Time Variant
6
Data Warehouse—Non volatile
7
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation
Data contents
Database design
View
Access patterns
8
Data Warehouse vs. Operational DBMS
OLTP vs. OLAP
9
Why Separate Data Warehouse?
A major reason for such a separation is to help promote the High
performance for both systems
DBMS— tuned for OLTP: access methods (Read/write), indexing
on Primary Key, concurrency control, recovery mechanism.
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
10
Data Warehousing and OLAP
Technology
11
A multi-dimensional data model
12
From Tables and Spreadsheets to Data Cubes
13
14
15
16
Cube: A Lattice of Cuboids
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
17
Conceptual Modeling of Data Warehouses
18
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measure
s
19
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_provinc
Measure e
s country
20
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type
item_key shipper_key
from_location
branch_key
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type
dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 21
Cube Definition Syntax in DMQL
22
Defining Star Schema in DMQL
23
Defining Snowflake Schema in DMQL
24
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
25
A Concept Hierarchy: Dimension (location)
all all
27
Multidimensional Data
Office Day
Month
28
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc
TV
od
PC U.S.A
Pr
VCR
sum
Country
Canada
Mexico
sum
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
30
Typical OLAP Operations
Roll up (drill-up)
Drill down (roll down)
Dice
Slice
Pivot (rotate)
31
1. Roll up (drill-up):
Summarize data By climbing up hierarchy or by dimension reduction
32
2. Drill down (roll down):
•Reverse of roll-up
•From higher level summary to lower level summary or detailed data, or
introducing new dimensions
33
3. Dice:
Dice operation defines a subcube by performing a selection on two or
more dimensions.
34
4. Slice:
Slice operation performs a selection on one dimension of the given
cube, resulting in a subcube.
35
5. Pivot (rotate):
Visualization operation that rotates the data axes in view in order to
provide an alternative presentation of the data.
36
Fig. Typical OLAP
Operations
37
A Starnet Query Model for Querying
Multidimensional Databases
38
A Starnet Query Model for Querying
Multidimensional Databases
39
Data Warehousing and OLAP
Technology: An Overview
40
Design of Data Warehouse: A Business
Analysis Framework
To design an effective data warehouse we need to understand and
analyze business needs and construct a business analysis framework.
41
Design of Data Warehouse: A Business
Analysis Framework
Data source view
This information may be documented at various levels of
detail and accuracy, from individual data source tables to
integrated data source tables.
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from the view
of end-user
42
The Process of Data Warehouse Design
43
The Process of Data Warehouse Design
44
45
Data Warehouse: A Multi Tier (3 Tier)
Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis,
Operational Extract Query/
DBs Transform Data Server Report,
Load
Refresh
Warehouse Data mining
Tools
Data Marts
47
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Data extraction
Data cleaning
Data transformation
Load
Refresh
49
OLAP Server Architectures
Relational OLAP (ROLAP)
Is an extended relational DBMS that maps operations on
50
Data Warehousing and OLAP
Technology
51
Data warehouse implementation
n
T ( L i 1)
i 1
52
Cube Operation
53
Efficient Data Cube Computation
54
Efficient Data Cube Computation
55
Iceberg Cube
Partially materialized cubes are known as iceberg
cubes
Computing only the cuboid cells whose count or
other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup
Motivation
Only a small portion of cube cells may be “above the water’’ in a
sparse cube (For many cells in a cuboid, the measure (count or
sum)value will be zero. If zero valued tuples are stored in the
cuboid, then we say that the cuboid is sparse)
Only calculate “interesting” cells—data above certain threshold
Eg. Count >= 10 , sales $100.
Avoid explosive growth of the cube
56
Eg. Iceberg cube.
compute cube sales_iceberg as
select month, city, customer_group, count(*)
from salesInfo
cube by month, city, customer_group
having count(*) >= min_sup
57
Indexing OLAP Data: Bitmap Index and Join Index
To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views (using cuboids).
58
Indexing OLAP Data: Join indexing
2. Join indexing method gained popularity from its use in relational database
query processing.
Join indexing registers the joinable rows of two relations from a relational database.
The linkage between a fact table and its corresponding dimension tables comprises
the fact table’s foreign key and the dimension table’s primary key
59
Example: the “Main Street” value in the location dimension table joins with
tuples T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value
in the item dimension table joins with tuples T57 and T459 of the sales fact
table.
60
Efficient Processing OLAP Queries
Determine which operations should be performed on the
available cuboids
selection, projection, roll-up (group-by), and drill-down operations
specified in the query into corresponding SQL and/or OLAP
operations
Determine which materialized cuboid(s) should be selected
for OLAP operation
Identify all of the materialized cuboids that may potentially be used
to answer the query and select the cuboid with the least cost.
61
OLAP query processing. Suppose that we define a data cube for AllElectronics of the
form “sales cube [time, item, location]: sum(sales in dollars).”
The dimension hierarchies used are “day < month < quarter < year” for time; “item_
name < brand < type” for item; and “street < city < province _or_state < country” for
location.
Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
if efficient indices are available for cuboid 4, then cuboid 4 may be a better
choice.
62
From Data Warehousing to Data Mining
Data mining
63
From On-Line Analytical Processing (OLAP)
to On-Line Analytical Mining (OLAM)
OLAM (also called OLAP mining) integrates OLAP with data mining technology.
Most data mining tools need to work on integrated, consistent, and cleaned
data, which requires costly preprocessing steps.
Available information processing Infrastructure surrounding data
warehouses
Accessing, integration, consolidation, and transformation of multiple
heterogeneous Databases.
OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
64
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Database Data
Data integration Warehous
s Repository
e 65