DW Notes
DW Notes
Business Intelligence
Slides kindly borrowed from the course
Data Warehousing and Machine Learning
Aalborg University, Denmark
Christian S. Jensen
Torben Bach Pedersen
Christian Thomsen
{csj,tbp,chr}@cs.aau.dk
Course Structure
Business intelligence
Purpose
Business
Intelligence (BI)
Literature
Multidimensional Databases and Data
Warehousing, Christian S. Jensen, Torben Bach
Pedersen, Christian Thomsen, Morgan & Claypool
Publishers, 2010
Data Warehouse Design: Modern Principles and
Methodologies, Golfarelli and Rizzi, McGraw-Hill, 2009
Advanced Data Warehouse Design: From
Conventional to Spatial and Temporal Applications,
Elzbieta Malinowski, Esteban Zimnyi, Springer, 2008
The Data Warehouse Lifecycle Toolkit, Kimball et al.,
Wiley 1998
The Data Warehouse Toolkit, 2nd Ed., Kimball and
Ross, Wiley, 2002
3
Overview
Multidimensional modeling
ETL
Performance optimization
Combination of technologies
Why is BI Important?
Worldwide BI revenue in 2005 = US$ 5.7 billion
Each store maintains its own customer records and sales records
Hard to answer questions like: find the total sales of Product X from
stores in Aalborg
Can you see the problems of using those data for business
analysis?
9
Heterogeneous sources
10
11
Data Warehousing
Solution: new analysis environment (DW) where data are
Extracted
Cleansed
Transformed
Aggregated (?)
Loaded into the DW
13
New databases
and systems (OLAP)
Appl.
DM
DB
OLAP
Appl.
DB
Appl.
DM
Trans.
Data
mining
DW
DB
(Global) Data
Warehouse
Appl.
DM
DB
Appl.
DB
Visualization
(Local)
Data Marts
warehouse
(data) consumers
14
Subject-oriented
systems
Appl.
Sales
DM
DB
D-Appl.
Appl.
D-Appl.
DB
Appl.
DM
Trans.
DB
DW
All subjects,
integrated
Appl.
Costs
Profit
D-Appl.
DM
DB
Selected
subjects
Appl.
DB
15
DM
DB
D-Appl.
Appl.
D-Appl.
DB
Appl.
DM
Trans.
DB
Appl.
DB
Appl.
Top-down: DB
1. Design of DW
2. Design of DMs
DW
In-between:
1. Design of DW for
DM1
2. Design of DM2 and
integration with DW
3. Design of DM3 and
integration with DW
4. ...
D-Appl.
DM
Bottom-up:
1. Design of DMs
2. Maybe integration
of DMs in DW
3. Maybe no DW
16
4th query: use a special table to store IDs of all dairy products,
in advance
Multidimensional Modeling
Example: sales of supermarkets
Facts and measures
Dimensions
Beer
City
County Date
Product
Store
Sales
Time
18
Multidimensional Modeling
How do we model the Time dimension?
Hierarchies with multiple levels
Attributes, e.g., holiday, event
tid day
day week
#
#
month year
#
work
day
Year
January
1st 2009
2009
No
Month
January 2
2nd 2009
2009
Yes
T
Week
Day
Disadvantage?
Data redundancy (but controlled redundancy is acceptable)
19
Type
Category
Price
001
Beer
Beverage
6.00
002
Rice
Cereal
4.00
003
Beer
Beverage
7.00
004
Wheat
Cereal
5.00
Product ID
TypeID
Price
TypeID
Type
CategoryID
CategoryID
Category
001
013
6.00
013
Beer
042
042
Beverage
002
052
4.00
052
Rice
099
099
Cereal
003
013
7.00
067
Wheat
099
004
067
5.00
Redundant data
Modification anomalies
10
OLAP
Target
operational needs
business analysis
Data
Model
normalized
denormalized/
multidimensional
Query language
SQL
Queries
small
large
Updates
Transactional recovery
necessary
not necessary
Optimized for
update operations
query operations
21
Store
Product
Time
Sales
Aalborg
Bread
2000
57
Aalborg
Milk
2000
56
Copenhagen
Bread
2000
123
11
102
250
Interactive analysis
Explorative discovery
Fast response times required
All Time
OLAP operations/queries
9 10
11 15
Bus architecture
Data
marts
Time
Customer
Product
Costs
Profit
Supplier
+
24
12
ETL
Extract
Transformations / cleansing
Load
25
Performance Optimization
Sales
tid
pid
locid
sales
10
20
40
1 billion rows
At query time, such partial result can be utilized to derive the final
result very fast
26
13
Materialization Example
Sales
locid
sales
10
20
40
1 billion rows
pid
VIEW TotalSales
pid
locid
sales
30
40
100,000 rows
27
28
14
Central DW Architecture
Clients
Simplicity
Easy to manage
Central
DW
Cons
Source
Source
29
Federated DW Architecture
Clients
Cons
Finance
mart
Mrktng
mart
Distr.
mart
Logical
DW
Source
Source
30
15
Tiered Architecture
Central DW is materialized
Data is distributed to data marts in
one or more tiers
Only aggregated data in cube tiers
Data is aggregated/reduced as it
moves through tiers
Pros
Central
DW
Cons
Most complex
Hard to manage
31
Common DW Issues
Metadata management
DW project management
16
33
Summary
Multidimensional modeling
ETL
Performance optimization
34
17
Multidimensional Databases
Overview
All types of data are equal, difficult to identify the data that is
important for business analysis
No difference between:
What is important
What just describes the important
What is important
What describes the important
What we want to optimize
Easy for query operations
Facts
Dimensions
Cube Example
Milk
56
67
Bread
Aalborg
57
45
211
Copenhagen
123
127
2000
2001
6
Cubes
A cube may have many dimensions!
Dimensions
Dimensions are the core of multidimensional databases
Selection of data
Grouping of data at the right level of detail
Dimensions
Dimensions have hierarchies with levels
Dimension Example
Location
T
Country
City
Schema
USA
Denmark
Instance
10
Facts
Facts represent the subject of the desired analysis
11
Types of Facts
Event fact (transaction)
Fact-less facts
Snapshot fact
Often both event facts and both kinds of snapshot facts exist
12
Granularity
Granularity of facts is important
Scalability
Example: sale
Sometimes the data is aggregated (total sales per store per day
per product)
Might be necessary due to scalability
Measures
Measures represent the fact property that the users want
to study and optimize
14
Types of Measures
Three types of measures
Additive
Semi-additive
Non-additive
15
Schema Documentation
Store
dimension
Store
Product
dimension
T Product
Customer
dimension
Time
dimension
Customer
TTime
Year
County
Category
Cust. group
Month
No well-defined standard
Our own notation
Store
Product
Customer
Day
Sales price
Count
Avg. sales price
16
Possible reasons
17
ROLAP
Relational OLAP
Data stored in relational tables
Pros
Cons
Product ID
Store ID
Sales
18
MOLAP
Multidimensional OLAP
Data stored in special multidimensional data structures
Pros
Cons
19
HOLAP
Hybrid OLAP
Detail data stored in relational tables (ROLAP)
Aggregates stored in multidimensional structures (MOLAP)
Pros
Cons
High complexity
20
10
Relational Implementation
Goal for dimensional modeling: surround the facts with
as much context (dimensions) as we can
Granularity of the fact table is important
Some properties
21
Relational Design
Product Type Category
Store
Top
Beer
Beverage
Product
City
County Date
Store
Sales
25 May 2009
5.75
Time
Star schemas
Snowflake schemas
22
11
ProductID
1
Product Type
Top
Beer
TimeID
1
Category
Beverage
Day
25.
Month
Maj
Year
1997
StoreID
1
Store
City
Trjborg rhus
County
rhus
23
Relational Implementation
The fact table stores facts
12
TypeID
1
Type
Beer
MonthID
1
CategoryID
1
TimeID
1
Snowflake schemas
Month YearID
May
1
Day
25.
MonthID
1
StoreID
1
CityID
1
Store
CityID
Trjborg
1
City
rhus
CountyId
1
25
Question Time
Suppose that we want to replace the original Store hierarchy
A by a new hierarchy B
How do we modify the star schema to reflect this?
How do we modify the snowflake schema to reflect this?
T
T
Country
County
County
City
City
Store
Store
Store Schema A
Store Schema B
26
13
Star vs Snowflake
Star Schemas
+
+
+
+
-
Snowflake schemas
+
+
+
-
27
Redundancy in the DW
Only very little or no redundancy in fact tables
Star dimension tables have redundant entries for the higher levels
Redundancy problems?
Up to a certain limit
28
14
56
67
Bread
Aalborg
57
45
211
Copenhagen
123
127
2000
2001
29
OLAP Queries
Starting level
(City, Year, Product)
Milk
56
Slice/Dice:
67
Bread
Aalborg
Milk
Bread
57
45
211
Copenhagen
123
Aalborg
127
Copenhagen
2000
2000
2001
Aalborg
Copenhagen
Copenhagen
ALL Time
30
15
drill down
31
32
16
DW Design Steps
Choose the business process(es) to model
Sales
Product dimension
Store dimension
Promotion dimension
34
17
Dollar_sales
Unit_sales
Dollar_cost
Customer_count
36
18
Summary
Cubes: Dimensions, Facts, Measures
OLAP Queries
Relational Implementation
Redundancy
37
19
Overview
Changing Dimensions
In the last lecture, we assumed that
dimensions are stable over time
New rows in dimension tables can be inserted
Existing rows do not change
1. No special handling
2. Versioning dimension values
Example
Store dim.
Time dim.
StoreID
TimeID
Address
Weekday
Week
Month
Quarter
Sales fact
City
Attribute values in
dimensions vary over time
TimeID
District
StoreID
Size
ProductID
SCategory
Year
DayNo
Holiday
ItemsSold
Product dim.
Problems
Amount
ProductID
Description
Brand
PCategory
change
timeline
6
Example
Store dim.
StoreID
Time dim.
Address
Sales fact
Product dim.
City
TimeID
District
StoreID
Size
ProductID
SCategory
ItemsSold
On a certain day,
customers bought 2000
apples from that store.
Amount
ItemsSold
2000
StoreID
001
Size
250
7
001
ItemsSold
2000
StoreID
001
Size
250
001
ItemsSold
2000
StoreID
001
Size
450
ItemsSold
001
2000
001
3500
StoreID
001
Size
450
Solution 1
Solution 1: Overwrite the old values in the
dimension tables
Consequences
Pros
Easy to implement
Useful if the updated attribute is not significant, or the old
value should be updated for error correction
Cons
Solution 2
The key that links dimension and fact table, identifies a version of a
row, not just a row
Surrogate keys make this easier to implement
what if we had used, e.g., the shops zip code as key?
Always use surrogate keys!!!
Consequences
Pros
Cons
001
ItemsSold
2000
StoreID
001
Size
250
001
ItemsSold
2000
StoreID
Size
001
250
002
450
ItemsSold
StoreID
Size
001
2000
001
250
002
3500
002
450
11
Solution 3
Solution 3: Create two versions of each changing attribute
Consequences
Pros
Cons
Not possible to see when the old value changed to the new
Only possible to capture the two latest values
12
001
ItemsSold
2000
37
37
versions of an attribute
StoreID
001
StoreID
ItemsSold
2000
001
ItemsSold
001
2000
001
2100
37
73
37
73
Solution 2A
Solution 2A: Use special facts for capturing
changes in dimensions via the Time dimension
Pros
Cons
234
2000
StoreID
001
Size
250
StoreID
Size
001
234
2000
001
250
002
345
002
450
StoreID
Size
001
234
2000
001
250
002
345
002
450
002
456
3500
15
Solution 2B
Cons
16
234
2000
StoreID
Size
From
To
001
250
98
attributes: From, To
StoreID TimeID ItemsSold
001
234
2000
StoreID
Size
From
To
001
250
98
99
002
450
00
StoreID
Size
From
To
001
234
2000
001
250
98
99
002
456
3500
002
450
00
17
Old versions are still in the stores, new facts can refer
to both the newest and older versions of products
Time value for a fact not necessarily between From
and To values in the facts Product dimension row
Example
19
CustID
CustID
Name
Name
PostalAddress
PostalAddress
Gender
Gender
DateofBirth
DateofBirth
Customerside
Customerside
Customer
dimension (new):
relatively static
attributes
DemographyID
NoKids
MaritialStatus
CreditScore
BuyingStatus
Income
NoKids
MaritialStatus
CreditScoreGroup
Demographics
dimension:
often-changing
attributes
BuyingStatusGroup
IncomeGroup
Education
EducationGroup
20
Solution 4
Solution 4
Insert rows for all combinations of values from these new domains
21
10
Cons
22
Applications change
The modeled reality changes
23
11
24
DW Bus Architecture
What method for DW construction?
25
12
DW Bus Architecture
Data marts built independently by departments
The same definition across data marts (price excl. sales tax)
Observe units of measurement (also currency, etc.)
Use the same name only if it is exactly the same concept
Facts are not copied between data marts (facts > 95% of data)
DW Bus Architecture
Dimension content managed by dimension owner
No common management/control
27
13
28
29
14
Matrix Method
DW Bus Architecture Matrix
Two-dimensional matrix
X-axis: dimensions
Y-axis: data marts
Planning Process
30
Matrix Example
Time
Customer
Product
Sales
Costs
Profit
Supplier
31
15
MS SQL Server
MS Analysis Services
32
Analysis Services
Integration Services
Reporting Services
Easy to use
16
MS Analysis Services
Cheap, easy to use, good, and widely used
Support ROLAP, MOLAP, HOLAP technology
Intelligent pre-aggregation (for improving query
performance)
Programming: MS OLE DB for OLAP interface
Uses the query language MDX (MultiDimensional
eXpressions)
34
Summary
Handling Changes in Dimensions
Coordinating Data Cubes / Data Marts
Multidimensional Database Implementation:
MS SQL Server and Analysis Services
35
17
ETL Overview
The ETL Process
General ETL issues
Building dimensions
Building fact tables
Extract
Transformations/cleansing
Load
Extract
Transform
Load
Phases
Design phase
Loading phase
Refreshment phase
ETL/DW Refreshment
DM
DW
Integration
phase
Preparation
phase
ETL side
Data
sources
Query Services
- Extract
- Transform
- Load
Data Staging
Area
-Warehouse Browsing
-Access and Security
-Query Management
- Standard Reporting
Conformed -Activity Monitor
dimensions
and facts
Data
Warehouse
Bus
Reporting Tools
Desktop Data
Access Tools
Data mining
Operational
system
No user queries
Sequential operations on large data volumes
Plan
1)
2)
3)
Construction of dimensions
4)
5)
6)
High-level diagram
1) Make high-level diagram of source-destination flow
Source
Raw-Sales
(RDBMS)
Raw-Product
(Spreadsheet)
Check R.I.
Add product
type
Destination
Aggregate sales
per product per day
Extract time
Sales
Time
Product
Building Dimensions
Key mapping for the
Product dimension
pid
DW_pid
Time
11
100
22
100
35
200
11
700
Load of dimensions
10
Years 01-08
Incremental update
Extract
12
Cooperative sources
DB triggers is an example
13
Extract
Goal: fast extract of relevant data
Types of extracts:
14
Computing Deltas
Delta = changes since last load
Store sorted total extracts in DSA
Updated by DB trigger
Last extract
time: 300
Timestamp
DKK
100
10
200
20
300
15
400
60
500
33
15
DB triggers
Transform
17
Common Transformations
Data type conversions
EBCDIC ASCII/Unicode
String manipulations
Date/time format conversions
Normalization/denormalization
Building keys
18
Data Quality
Data almost never has decent quality
Data in DW must be:
Precise
Complete
Unique
Consistent
The same thing is called the same and has the same key
Timely
19
Cleansing
Why cleansing? Garbage In Garbage Out
BI does not work on raw data
Spellings, codings,
20
10
Types of Cleansing
Special-purpose cleansing
Domain-independent cleansing
Rule-based cleansing
Cleansing
Data Status
Dimension
SID
Status
Normal
Abnormal
Out of bounds
E.g., for the time dimension, instead of NULL, use special key
values to represent Date not known, Soon to happen
Avoids problems in joins, since NULL is not equal to NULL
Sales
SID
10
20
10000
-1
22
11
DW-controlled improvement
Default values
Not yet assigned 157 note to data steward
Source-controlled improvements
Construct programs that check data quality
23
Load
24
12
Load
Goal: fast loading into DW
Parallellization
25
Load
Relationships in the data
Aggregates
Can be built and loaded at the same time as the detail data
Load tuning
26
13
ETL Tools
ETL tools from the big vendors
Data modeling
ETL code generation
Scheduling DW jobs
Issues
Pipes
Load frequency
14
29
Tools
30
15
Packages
A package is a collection of
31
A package
Arrows show precendence
constraints
Constraint values:
success (green)
failure (red)
completion (blue)
32
16
Structure to packages
Services to tasks
Control flow
Sequence container
Tasks
A task is a unit of work
Workflow Tasks
Scripting Tasks
17
Sources
Transformations
Destinations
36
Transformations
Row Transformations
Other Transformations
18
Summary
39
19