Lecture 3 Data Warehouse Modelling
Lecture 3 Data Warehouse Modelling
Lecture 3
Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Objectives
❑ Identify the different models used in the business environment.
❑ Identify the importance of users in early definition of the data
requirements.
❑ Describe three different data models for the warehouse,
identifying their properties and use.
❑ Discuss modeling considerations in the context of specific
dimensions such as time.
❑ Demonstrate how an operational model may be used for the
warehouse model.
Physical Structure of Data Warehouse
There are three basic architectures for constructing a data
warehouse:
❑ Centralized
❑ Federated
❑ Tiered
The data warehouse is distributed for: load balancing, scalability and
higher availability
Physical Structure of Data Warehouse
❑ Centralized architecture
Client Client Client
Central
Data
Warehous
e
Source Source
Physical Structure of Data Warehouse
End
❑ Federated architecture Users
Marketing
Local Financial
Data Distribution
Marts
Logical
Data
Warehouse
Source Source
Physical Structure of Data Warehouse
Workstations
❑ Tiered architecture (highly summarized data)
Local
Data
Marts
Physical
Data
Warehouse
Source Source
Physical Structure of Data Warehouse
❑ Federated architecture
❑ The logical data warehouse is only virtual
❑ Tiered architecture
❑ The central data warehouse is physical
❑There exist local data marts on different tiers which store copies
or summarization of the previous tier.
Logical And Physical Design
❑ A logical design is conceptual and abstract
❑ In the logical design, you look at the logical relationships among
the objects
❑ The physical design, you look at the most effective way of storing
and retrieving the objects as well as handling them from a
transportation and backup/recovery perspective
LD compare with PD
The Process of Modeling the Warehouse
❑ Determine user requirements
❑ Assist users in understanding new technology
❑ Consider imperatives
❑User analytical requirements
❑Performance objectives
❑ Keep acquainted with changing requirements
The Process of Modeling the Warehouse
❑ Create a business model
❑ Create a dimensional model
❑ Create a physical model
Conceptual Model
Logical Model
Physical Model
Basic Concepts
❑ Dimensional modeling has several basic concepts:
❑ Facts
❑ Dimensions
❑ Measures (variables)
Dimension Measures
ProductWiseSales 3000 units
AgeWisePopulation 2million people
DesignationWiseSalary Rs. 200000
Fact Tables
❑ Fact tables are the large tables in your warehouse schema that
store business measurements.
❑ Fact tables typically contain facts and foreign keys to the dimension
tables. Fact tables represent data, usually numeric and additive, that
can be analyzed and examined. Examples include sales, cost, and
profit
Fact Data (Business Subjects)
❑ Bulk of the warehouse data
❑ Accessed by dimensions
❑ Partitioned
❑ Series of snapshots
❑ Numerical data
❑ Element of time
❑ Dates for calculations
❑ Composite keys
store
storeId
city
Star schema
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la
❑ Dimension Product
❑ Attribute
❑Color, size, weight
❑Days, weeks, holidays on the time
dimension
Warehouse Model Components
❑ Hierarchy of attributes in a dimension
❑Relationship
❑Fact Location Dimension
Product Dimension Region Time Dimension
Product Family Account Year
Store
Product Type Account Week
Item_key
Product Store_key
Acct_Week_key
Sales Data
Example of Star Schema
Date Product
Date ProductNo
Month Sales Fact Table ProdName
Year ProdDesc
Date Category
QOH
Product
Store
Store
StoreID Customer
City Customer CustId
State
CustName
Country unit_sales CustCity
Region
dollar_sales CustCountry
schilling_sales
Measurements
Dimension Hierarchies
For each dimension, the set of associated attributes can be
structured as a hierarchy
sType
store
city region
Cost_dollars Sales_dollars
Qty_on_hand Sales_units
Time Table
Week_id Item Table
Period_id Item_id
Year_id Dept_id
Modeling Dimensions
❑ Model according to data content
❑ Model with aggregation needs in mind
❑ Model to satisfy drilling requirements
❑ May be fully denormalized - star
❑ May be normalized - snowflake
❑ Categorical dimensions may be included
Modeling the Time Dimension
❑ The time dimension
❑ Is mandatory
❑ Is unique
❑ Is powerful
❑ Needs careful design
❑ Include business time periods
❑ Include special dates
Modeling the Time Dimension
Fact Table Fact Table Fact Table
Day_id Day_id Day_id
Time Dimension
Time Dimension Day_id
Day_id Time Dimension
Month_id Day_id
Quarter_id Month_id
Month_id
Half_year_id Quarter_id
Year_id Half_year_id
Year_id
Fiscal_month_id Quarter_id
Fiscal_ quarter_id
Fiscal_ half_year_id
Fiscal_ year_id Half_year_id
Year_id
Modeling the Time Dimension
❑ Analyze and plan
❑ Consider relevant date range
❑ Use uniform time rollup characteristics
❑ Model for flexibility
Modeling the Time Dimension
❑ Include special dates
❑ Identify date requirements
❑ Simple
❑ Fiscal
Calendar
❑ Rolling
Fact Fiscal
Rolling
Modeling Summary Tables
Define before design
SALES BY MONTH/REGION
Month Region Tot_Sales$
SALES FACTS Jan 97 North 41,000
Sales$ Region Month Jan 97 East 10,000
10,000 North Jan 97 Feb 97 South 40,000
12,000 South Feb 97
11,000 North Jan 97 Mar 97 West 17,000
15,000 West Mar 97
18,000 South Feb 97
20,000 North Jan 97
10,000 East Jan 97 SALES BY MONTH
2,000 West Mar 97 Month Tot_Sales
Jan 97 51,000
Feb 97 40,000
Mar 97 17,000
When to Summarize Data
❑ Trade-off between direct access and calculation at the time of
execution
❑ Compression ratio calculations can help you decide
30 22 22/30 0.73
20 22 22/20 1.1
Maintaining History
❑ Maintain history of data changes
❑ Enable in the model
Customer History Table Customer Table
Cust_id Cust_name Version Cust_id
1 ABC Co 1 1
1 ABC Ltd 2 2
2 XYZ Inc 1 3
Step Activity
1 Choosing the process
2 Choosing the grain
3 Identifying and conforming the dimensions
4 Choosing the facts
5 Storing the precalculations in the fact table
6 Rounding out the dimension tables
7 Choosing the duration of the database
8 Tracking slowly changing dimensions
9 Deciding the query priorities and the query modes
Database design methodology for
data warehouses (2)
❑ There are many approaches that offer alternative routes to the
creation of a data warehouse
❑ Typical approach – decompose the design of the data warehouse
into manageable parts – data marts, At a later stage, the
integration of the smaller data marts leads to the creation of the
enterprise-wide data warehouse.
❑ The methodology specifies the steps required for the design of a
data mart, however, the methodology also ties together separate
data marts so that over time they merge together into a coherent
overall data warehouse.
Step 1: Choosing the process
❑ The process (function) refers to the subject matter of a particular
data marts. The first data mart to be built should be the one that
is most likely to be delivered on time, within budget, and to
answer the most commercially important business questions.
❑ The best choice for the first data mart tends to be the one that is
related to ‘sales’
Step 2: Choosing the grain (unit of analysis)
❑ Choosing the grain means deciding exactly what a fact table record
represents. For example, the entity ‘Sales’ may represent the facts about
each property sale. Therefore, the grain of the ‘Property_Sales’ fact
table is individual property sale.
❑ Only when the grain for the fact table is chosen we can identify the
dimensions of the fact table.
❑ The grain decision for the fact table also determines the grain of each
of the dimension tables. For example, if the grain for the
‘Property_Sales’ is an individual property sale, then the grain of the
‘Client’ dimension is the detail of the client who bought a particular
property.
Step 3: Identifying and conforming the
dimensions
❑ Dimensions set the context for formulating queries about the
facts in the fact table.
❑ We identify dimensions in sufficient detail to describe things such
as clients and properties at the correct grain.
❑ If any dimension occurs in two data marts, they must be exactly
the same dimension, or one must be a subset of the other (this is
the only way that two DM share one or more dimensions in the
same application).
❑ When a dimension is used in more than one DM, the dimension is
referred to as being conformed.
Step 4: Choosing the facts
❑ The grain of the fact table determines which facts can be used in
the data mart – all facts must be expressed at the level implied by
the grain.
❑ In other words, if the grain of the fact table is an individual
property sale, then all the numerical facts must refer to this
particular sale (the facts should be numeric and additive).
Step 5: Storing pre-calculations in the
fact table
❑ Once the facts have been selected each should be re-examined to
determine whether there are opportunities to use pre-calculations.
❑ Common example: a profit or loss statement
❑ These types of facts are useful since they are additive quantities,
from which we can derive valuable information.
❑ This is particularly true for a value that is fundamental to an
enterprise, or if there is any chance of a user calculating the value
incorrectly.
Step 6: Rounding out the dimension tables
❑ In this step we return to the dimension tables and add as many
text descriptions to the dimensions as possible.
❑ The text descriptions should be as intuitive and understandable to
the users as possible
Step 7: Choosing the duration of the
data warehouse
❑ The duration measures how far back in time the fact table goes.
❑ For some companies (e.g. insurance companies) there may be a
legal requirement to retain data extending back five or more years.
❑ Very large fact tables raise at least two very significant data
warehouse design issues:
❑ The older data, the more likely there will be problems in
reading and interpreting the old files
❑ It is mandatory that the old versions of the important
dimensions be used, not the most current versions (we will
discuss this issue later on)
Step 8: Tracking slowly changing dimensions
❑ The changing dimension problem means that the proper description of
the old client and the old branch must be used with the old data
warehouse schema
2. Introduction to tools and technologies are used to design the data warehouse
3. Dimensional modelling (logical schema and the physical schema for specific target database management systems
(DBMSs))
5. 10 business queries
6. Results of 10 queries