DW Slides
DW Slides
for BS(CS)
2
Assignments
Implementing a data warehouse in four phases
In first phase, the students are required to develop a relational
model for the DBLP database and create this model physically in
a DBMS.
Students are then required to write extraction routines and insert
the DBLP data in the database.
In the third phase, students will create a dimensional model of
the above mentioned database/ data model
In the fourth phase, students will create MOLAP or ROLAP
cubes of the dimensional model to view data for data analysis.
3
Course Introduction
Data Marts
4
Operational Sources (OLTP’s)
Operational computer systems did provide information to run
day-to-day operations, and answer’s daily questions
Also called online transactional processing system (OLTP)
Data is read or manipulated with each transaction
Transactions/queries are simple, and easy to write
Usually for middle management
Examples
Sales systems
HRM Applications
General-Ledger
Payroll
Etc.
5
Typical decision queries
Data set are mounting everywhere, but not useful for
decision support
Strategic Information = Enterprise wide essential data
People need timely strategic information to take appropriate
decisions to get competitive advantages
Decision-making require complex questions from integrated
data.
Decision makers want to know:
Where to build new oil warehouse?
How much is the total sale by month/ year/ quarter for each
offices?
Is there any relation between promotion campaigns and sales
growth?
Can OLTP answer all such questions, efficiently?
6
Characteristics of strategic information
Integrated
Must have a single, enterprise-wide view
Data Integrity
Information must be accurate and must conform to business
rules
Accessible
Easily accessible with intuitive access paths and responsive
for analysis
Credible
Every business factor must have one and only one value
Timely
Information must be available within the stipulated time frame
7
Information Crisis
8
History of Decision Support Systems
(DSS)
Ad-Hoc Reports: IT created small program for each required
report.
Special Extract Programs: Collection of programs written for
anticipated reports.
Small Applications: Small programs with the option of taking
parameters as input from the user
IT centers: Special places where users can go for asking pre-
generated reports. IT people present there helped users to obtain
information.
Decision Support Systems: Much sophisticated systems, menu
driven, online processing, printable reports, but were made using
extracted files.
Executive Information Systems: Simple reports on user desktop,
but again preprogramed reports.
9
Failure of old DSS (cont.)
Inability to provide strategic information
IT receive too many ad hoc requests, so large over
load
Requests are not only numerous, they change
overtime
For more understanding more reports
Users are in spiral of reports
Users have to depend on IT for information
Can't provide enough performance, slow
Strategic information have to be flexible (not
predefined strictly) and conductive (could be guided)
for analysis
10
Failure of old DSS (Cont.)
A typical spiral of user needs and IT efforts
11
OLTP vs. DSS
OLTP (Get the data in)
12
OLTP vs. DSS (cont.)
Trait OLTP DSS
User Middle management Executives, decision-makers
Function For day-to-day operations For analysis & decision support
DB (modeling) E-R based, after normalization Star oriented schemas
Data Current, Isolated Archived, derived, summarized
Unit of work Transactions Complex query
Access, type DML, read Read
Access frequency Very high Medium to Low
13
Expectations of new soln.
14
DW meets expectations
15
Definition of DW
Inmon defined
“A DW is a subject-oriented, integrated, non-volatile, time-variant
collection of data in favor of decision-making”.
Kelly said
“Separate available, integrated, time-stamped, subject-oriented, non-
volatile, accessible”
Four properties of DW
16
Subject-oriented
17
Integrated Data
18
Time variant
19
Non-volatile
Data from operational systems are moved into DW after specific
intervals
Data is persistent/ not removed i.e. non volatile
Every business transaction don’t update in DW
Data from DW is not deleted
Data is neither changed by individual transactions
20
Architecture of DW
21
Components
Major components
Source data component
Data staging component
Data storage component
Information delivery component
Metadata component
Management and control component
22
1. Source Data Component
Source data can be grouped into 4 components
Production data
Comes from operational systems of enterprise
Some segments are selected from it depending on requirements
Narrow scope, e.g. order details
Internal data
Private datasheet, documents, customer profiles etc.
E.g. Customer profiles for specific offering
Special strategies to transform ‘it’ to DW (text document)
Archived data
Old data is archived in operational systems using maybe
separate archival database that might be still online, maybe
stored in flat files and ever more old data on tapes.
DW have snapshots of historical data
External data
Executives depend upon external sources
E.g. market data of competitors, market indicators.
For example, car rental business would like to have information
about the production schedule of car making companies to make
strategic decisions
But integration require conformance with your data
23
Architecture of DW
24
2. Data Staging Component
After data is extracted, data is to be prepared
Data extracted from sources needs to be
changed, converted and made ready in
suitable format
Three major functions to make data ready
Extract
Transform
Load
25
Architecture of DW
26
3. Data Storage Component
Separate repository
Data structured for efficient processing
Redundancy is increased
Updated after specific periods
Only read-only
27
Architecture of DW
28
4. Information Delivery Component
Depending on the type of users, different methods
and means of information delivery are used are
used.
29
Meta-data component
It is kind of information about the data in the data
warehouse
Three kinds of meta-data
Operational: Contains information about the mappings
between source systems and data warehouse, e.g., field
length, data types
30
Management and Control Component
It is concerned to coordinate all activities and
components involved in data warehouse
environment. Usually part of data warehouse DBMS.
31
Difference between Data Warehouse
and Data Marts
Sometimes called synonymously, however, strictly
speaking there is a difference between them on the
basis of their scale
Data warehouse is enterprise wide.
Data marts are local or at departmental level or
targeted for particular group of users.
The main topic of discussion is whether to make
data warehouse (top-down approach) first or data
marts (bottom-up approach). Both approaches have
their advantages and disadvantages.
32
Top-down approach
Advantages:
Single enterprise wide integrated data
Disadvantages:
Takes too longer to build
High chances of failure
Requires experienced professionals
Senior management is not likely to see the results
immediately
33
Bottom-up approach
Advantages
Faster and easier implementations
Faster return on investment and proof of concept
Less risk of failure
Inherently incremental (can prioritize which data
mart to build first)
Disadvantages:
Narrow view of data in each data mart
Spread redundancy in data marts
Can cause inconsistent data
34
Practical approach propose by Ralph
Kimball
Compromise between top-down and bottom-up
approaches
First define the requirements of the enterprise
35
DW Design
36
Designing DW
37
Background (ER Modeling)
For ER modeling, entities are collected from
the environment
Each entity act as a table
Success reasons
Normalized after ER, since it removes redundancy
(to handle update/delete anomalies)
But number of tables is increased
Is useful for fast access of small amount of data
38
ER Drawbacks for DW / Need of Dimensional
Modeling
ER Hard to remember, due to increased number of tables
Complex for queries with multiple tables (table joins)
Ideally no calculated attributes
The DW does not require to update data like in OLTP
system so there is no need of normalization
Efficient indexing scheme to avoid screening of all data
39
Dimensional Modeling
Dimensional Modeling focuses subject-
orientation, critical factors of business
Critical factors are stored in facts
Redundancy is no problem, achieve efficiency
Logical design technique for high performance
Is the modeling technique for storage
40
Dimensional Modeling (cont.)
41
Dimensional Modeling (cont.)
Facts are stored in fact table
Dimensions are represented by dimension
tables
Each fact is surrounded by dimension tables
Looks like a star so called Star Schema
42
Example
PRODUCT
TIME
product_key (PK)
time_key (PK)
SKU
SQL_date
description
day_of_week
brand
month FACT
category
time_key (FK)
store_key (FK)
STORE clerk_key (FK) CUSTOMER
store_key (PK) product_key (FK) customer_key (PK)
store_ID customer_key (FK) customer_name
store_name promotion_key (FK) purchase_profile
address dollars_sold credit_profile
district units_sold address
floor_type dollars_cost
CLERK PROMOTION
clerk_key (PK) promotion_key (PK)
clerk_id promotion_name
clerk_name price_type
clerk_grade ad_type 43
Inside Dimensional Modeling
Inside Dimension table
Key attribute of dimension table, for
identification
Large no of columns, wide table
Non-calculated attributes, textual attributes
Attributes are not directly related (e.g., brand
and package size)
Un-normalized in Star schema
Ability to drill-down and roll-up are two ways
of exploiting dimensions
Can have multiple hierarchies (product
category for marketing and product category
for accounting)
Relatively small number of records 44
Inside Dimensional Modeling
Have two types of attributes
Key attributes, for connections
Facts
Inside fact table
Concatenated key
Grain or level of data identified
Large number of records
Limited attributes
Sparse data set
Degenerate dimensions (order number
Average products per order)
Fact-less fact table
45
Star Schema Keys
Surrogate keys in Dimension tables
Replacement of primary key
System generated
Foreign keys in Fact tables
Collection of primary keys of dimension tables
Primary key in fact table
Collection of P.Ks came from dimension tables
Maybe degenerated dimension
Maybe system generated surrogate key
46
Advantage of Star Schema
Ease for users to understand
Optimized for navigation (less joins
fast)
Most suitable for query processing (drill-
down, roll-up)
Special techniques for join and indexing
for further query optimization
47
Normalization [1]
48
1st Normal Form [2]
“A relation is in first normal form if and only if
every attribute is single-valued for each tuple”
STU_ID STU_NAME MAJOR CREDITS CATEGORY
49
1st Normal Form (Cont.)
STU_ID STU_NAME MAJOR CREDITS CATEGORY
50
Another Example (composite key:
SID, Course) [1]
51
1st Normal Form Anomalies [1]
Update anomaly: Need to update all six rows
for student with ID=1if we want to change his
location from Islamabad to Karachi
Delete anomaly: Deleting the information
about a student who has graduated will
remove all of his information from the
database
Insert anomaly: For inserting the information
about a student, that student must be
registered in a course
52
Solution 2nd Normal Form
54
Anomalies [1]
55
Solution 3rd Normal Form
Campus degree
56
Example in 3 rd Normal Form [1]
57
Denormalization [1]
58
Five techniques to denormalize
relations [1]
Collapsing tables
Pre-joining
Splitting tables (horizontal, vertical)
Adding redundant columns
Derived attributes
59
Collapsing tables (one-to-one) [1]
61
Splitting tables [1]
62
Redundant columns [1]
63
Derived Attributes
64
Updates to Dimension Tables
65
Updates to Dimension Tables (Cont.)
66
Updates to Dimension Tables (Cont.)
67
Updates to Dimension Tables (Cont.)
Proposed solution:
68
Updates to Dimension Tables (Cont.)
69
Updates to Dimension Tables (Cont.)
Solution: Add a new column of attribute
70
Updates to Dimension Tables (Cont.)
71
Rapidly Changing Dimension
72
Rapidly Changing Dimension (Cont.)
“For example, an important attribute for customers
might be their account status (good, late, very late,
in arrears, suspended), and the history of their
account status” [4]
“If this attribute is kept in the customer dimension
table and a type 2 change is made each time a
customer's status changes, an entire row is added
only to track this one attribute” [4]
“The solution is to create a separate account_status
dimension with five members to represent the
account states” [4] and join this new table or
dimension to the fact table.
73
Example
74
Junk Dimensions
Sometimes there are some informative flags and
texts in the source system, e.g., yes/no flags,
textual codes, etc.
If such flags are important then make their own
dimension to save the storage space
However, caution, if you have large number of
combinations then make separate dimensions.
For example, 5 attributes with 100 distinct values
each. Total rows in it will be millions (1005). In
this case make separate dimension for each
attribute
75
Junk Dimension Example [3]
76
Junk Dimension Example (Cont.) [3]
77
The Snowflake Schema
Snowflacking involves normalization of
dimensions in Star Schema
Reasons:
To save storage space
Easy maintenance of tables with low
cardinality
However, some designers discourages it.
According to them it is better to sacrifice
storage space over performance and ease of
use.
78
Example 1 of Snowflake Schema
79
Example 2 of Snowflake Schema
80
Aggregate Fact Tables
81
Example
83
A way of making aggregates
Example:
84
Making Aggregates
85
Example of Aggregate Fact Table
86
Families of Stars
87
Families of Stars (Cont.)
Transaction (day to day) and snapshot tables (data after
some specific intervals)
88
Families of Stars (Cont.)
Core and custom tables
89
Families of Stars (Cont.)
Conformed Dimension: The attributes of a dimension
must have the same meaning for all those fact tables
with which the dimension is connected.
90
Extract, Transform, Load (ETL)
92
Data Extraction
Data can be extracted using third party tools
or in-house programs or scripts
Data extraction issues:
1. Identify sources
2. Method of extraction for each source (manual,
automated)
3. When and how much frequently data will be extracted
from each source
4. Time window
5. Sequencing of extraction processes
93
How data is stored in operational
systems
Current value: Values continue to changes as
daily transactions are performed. We need to
monitor these changes to maintain history for
decision making process, e.g., bank balance,
customer address, etc.
Periodic status: sometimes the history of
changes is maintained in the source system
94
Example
95
Data Extraction Method
96
Incremental data extraction
Immediate data extraction: involves data
extraction in real time when the transaction
occurs.
Possible options:
1. Capture through transactions logs
2. Make triggers/Stored procedures
3. Capture via source application
Deferred data extraction: involves data
extraction at later time.
1. Capture on the basis of time and date stamps
2. Capture by comparing files
97
Data Transformation
98
Data Transformation (Cont.)
99
Data Loading
Determine when (time) and how (as a whole or in
chunks) to load data
Four modes to load data
1. Load: removes old data if available otherwise load
data
2. Append: The old data is not removed, the new data
is appended with the old data
3. Destructive Merge: If primary key of the new record
matched with the primary key of an old record then
update old record
4. Constructive Merge: If primary key of the new
record matched with the primary key of an old
record then do not update old record just add the
new record and mark it as superseding record
Data Loading (cont.)
101
Data Loading (Cont.)
Data Refresh Vs. Data Update
Full refresh reloads whole data after deleting old data and
data updates are used to update the changing attributes
Data Loading (Cont.)
Loading for dimensional tables: You need to define
a mapping between source system key and system
generated key in data warehouse, otherwise you will
not be able to load/update data correctly
Data Loading (Cont.)
Updates to dimension table
Data Quality Management
http://www.humaninference.com/master-data-management/data-
quality/data-cleansing
Google Refine,
http://code.google.com/p/google-refine/
Wrangler
http://vis.stanford.edu/wrangler/
Text Pipe Pro
http://www.datamystic.com/textpipe.html#.UKjm9eQ3vyo
Information Supply
Naive users
Regular users: daily users but cannot make
queries and reports themselves. They need
query templates and predefined reports
Power users: Technically sound users, who
can make queries, reports, scripts, import
and export data themselves
User classes from other perspective
129
Information delivery framework
Online Analytical Processing (OLAP)
Further reading:
What is OLAP?
by Nigel Pendse, Principal of OLAP Solutions
and Co-author of the OLAPreport.com
Source:
http://dssresources.com/papers/features/pends
e04072002.htm
136
Major features of OLAP
Dimensional Analysis
1. What are cubes?
2. What are hyper-cubes?
Drill-down and Roll-up
Drill through, Drill across
Slice and Dice
Pivoting
Dimensional Analysis
OLAP supports multi-dimensional analysis
Cube: have three dimensions
Z-axis
X-axis Y-axis
Same view on spreadsheet
Multidimensional Analysis (Spreadsheet
view) / Hypercube
An example of hyper cube with 4
dimensions
141
Drill-down and Roll-up [5]
Drill through, Drill across
Slicing [5]
Dicing [5]
Pivoting [5]
A SQL server view for cube [6]
OLAP Implementations [1]
155
Relational OLAP (ROLAP) [1]
Different semantics:
1. For example, what constitute an year 1st July to
30th June OR 1st January to 31st December.
Another example is week, whether it ends on
Monday or Saturday? (in various departments, e.g.,
University of Lahore, The CS department holiday is
on Monday and others take holiday on Saturday).
2. Different semantics will force you in creating many
pre-aggregates.
3. You have to figure out the best solution by your
own self (either calculating pre-aggregates or
convincing various departments to follow the same
convention).
ROLAP problems (Cont.) [1]
193
How relations/tables are organized on
disk [11]
File Organizations
Relations are stored in terms of logical structures called files, the
way files are organized on disk is called file organization
Usually files are stored as a sequence of records
A record is a tuple in a relational database
Files are read in terms of physical blocks on disk
There can be many records per block and can be many blocks per
record or there can be one-one relationship which is very rare
File organization simply refers to the way records are stored in terms
of blocks and the way blocks are placed on the storage medium and
interlinked
194
How relations/tables are organized on
disk [11]
Types of File Organizations
1. Unsorted (piled file organization)
2. Sorted
3. Hashing
195
How relations/tables are organized on
disk [11]
Records: A record is basically a tuple in a
table/relation
A file is a sequence of records
Records can be of fixed length and variable
length
Records comprises of sequence of fields
(columns or attributes)
196
How relations/tables are organized on
disk [11]
Blocks: They are basically the physical units of storage
in storage devices (for example, sectors in hard disks)
Usually (but not necessarily, it depends on the file
system structure) stores records of a single file
They are of fixed length, based on physical
characteristics of the storage/computing device and
operating system
Storage device is either defragmented or fragmented
depending on whether contiguous sets of records lie in
contiguous blocks
197
How relations/tables are organized on
disk [11]
Unspanned records
198
How relations/tables are organized on
disk [11]
Spanned records
Record4
Block p
remaining
part
200
How relations/tables are organized on
disk [11]
Unordered or Pile File organization
Simplest file organization
Records are inserted in the order of their
arrival
Usually requires secondary files for efficient
search
Insertion: very easy
Search very expensive because we need to
search linearly
201
Unordered or Pile File organization
[11]
Record deletion causes fragmentation and
inefficient in terms of space
202
Unordered or Pile File organization
[11]
Record update is also problematic in case of
variable length records (needs auxiliary file to
point to new location)
203
Sorted Files [11]
File organization where records are physically
sorted based on the value of some field called
the ordering field
Ordering field should be a key field (unique for
each record) and belong to an ordinal domain
(where values can be compared)
Insertion and deletion: both expensive
Updating may involve physical migration of
record, if ordering field is modified
Searching: efficient based on binary search
204
Sorted Files [11]
205
Benefits of sorted files [11]
206
Hashing Techniques [11]
207
Hashing Techniques [11]
208
Overflow in Hashing Techniques [11]
209
Handling overflow through chaining
[11]
210
Indexing [11]
213
Primary Index [11]
214
Primary Index [11]
215
Primary Index [11]
216
Sparse Index [1]
The index is kept for a block of data items
It takes less space but at the expense of
some efficiency in terms of time
Dense Indexing [1]
For each record store the key of that record and the
pointer where the record is actually placed on disk
If it fits in memory it requires one I/O operation if not then
performance will degrade
Primary index [11]
220
Clustering Index [11]
221
Clustering Index [11]
222
Clustering Index [11]
223
Secondary Index [1]
224
Secondary Index on key field (dense
index) [11]
225
Secondary Index on non-key field [11]
226
Secondary Index on non-key field [11]
227
Secondary Index on non-key field [11]
228
Extra Redirection Levels [11]
229
Extra Redirection Levels [11]
230
Multilevel Indexing [1]
It uses a little more space but it increases efficiency in
terms of time
It is good for queries posing some conditions or range
conditions
Example of B-Tree [1]
B-Tree limitations [1]
234
235
Join Indexes
236
Recommended Indexes for Data
Warehouse
For primary key, foreign keys, and frequently
accessed metrics in Fact tables you can use
B-Tree index (distinct values)
For primary keys in dimensions use B-Tree.
For other attributes that are repetitive in
dimensions use Bitmap index
237
Join Techniques [1]
238
Join Techniques (cont.) [1]
239
Join Techniques (cont.) [1]
240
Join Techniques (cont.) [1]
Sort-merge join
- Sort both relations on join attributes and then
finds matching rows
- Both relations are scanned once
- However, additional cost of sorting
241
Join Techniques (cont.) [1]
Sort merge join
242
Hash join [1]
- Suitable for large databases
- The choice about which table first gets hashed
plays a pivotal role in the overall performance of
the join operation, and left to the optimizer
- The optimizer decides by using the smaller of
the two tables (say) Table_A or data sources to
build a hash table in the main memory on the
join key used in the WHERE clause.
- It then scans the larger table (say) Table_B and
probes the hashed table to find the joined rows
Hash Join [12]
244
Questions?
245
References
[1] Abdullah, A.: “Data warehousing handouts”, Virtual
University of Pakistan
[2] Ricardo, C. M.: “Database Systems: Principles
Design and Implementation”, Macmillan Coll Div.
[3] Junk Dimension,
http://www.1keydata.com/datawarehousing/junk-
dimension.html
[4] Advanced Topics of Dimensional Modeling
https://mis.uhcl.edu/rob/Course/DW/Lectures/Advanced
%20Dimensional%20Modeling.pdf
[5] http://en.wikipedia.org/wiki/OLAP_cube
246
References
[6]
http://www.mssqltips.com/sqlservertutorial/2011/processi
ng-dimensions-and-cube/
[7] http://dev.mysql.com/doc/refman/5.1/en/partitioning-
hash.html
[8] Karen Corral, et al. (2006) The impact of alternative
diagrams on the accuracy of recall: A comparison of
star-schema diagrams and entity-relationship diagrams,
Decision Support Systems, 42(1), 450-468.
[9]
http://www.ibm.com/developerworks/data/library/techarti
cle/dm-0909infospherevirtualcubes/
[10] ”Reading in Database Systems” by By Joseph M.
Hellerstein, Michael Stonebraker, 2005
References
[11] Lecture Series on Database Management
System by Dr.S.Srinath, IIIT Bangalore.
Available at: http://nptel.iitm.ac.in
248