0% found this document useful (0 votes)
65 views

Lecture 3 Data Warehouse Modelling

The document discusses data warehouse modeling. It describes three common data warehouse architectures: centralized, federated, and tiered. It also covers dimensional modeling concepts like facts, dimensions, and measures. Key aspects of dimensional modeling include star schemas with a central fact table linked to dimension tables, and snowflake schemas which extend dimensions into hierarchies.

Uploaded by

lasithrandima123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Lecture 3 Data Warehouse Modelling

The document discusses data warehouse modeling. It describes three common data warehouse architectures: centralized, federated, and tiered. It also covers dimensional modeling concepts like facts, dimensions, and measures. Key aspects of dimensional modeling include star schemas with a central fact table linked to dimension tables, and snowflake schemas which extend dimensions into hierarchies.

Uploaded by

lasithrandima123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Data Warehouse Modelling

Lecture 3
Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Objectives
❑ Identify the different models used in the business environment.
❑ Identify the importance of users in early definition of the data
requirements.
❑ Describe three different data models for the warehouse,
identifying their properties and use.
❑ Discuss modeling considerations in the context of specific
dimensions such as time.
❑ Demonstrate how an operational model may be used for the
warehouse model.
Physical Structure of Data Warehouse
There are three basic architectures for constructing a data
warehouse:
❑ Centralized
❑ Federated
❑ Tiered
The data warehouse is distributed for: load balancing, scalability and
higher availability
Physical Structure of Data Warehouse
❑ Centralized architecture
Client Client Client

Central
Data
Warehous
e

Source Source
Physical Structure of Data Warehouse
End
❑ Federated architecture Users

Marketing
Local Financial
Data Distribution
Marts

Logical
Data
Warehouse

Source Source
Physical Structure of Data Warehouse
Workstations
❑ Tiered architecture (highly summarized data)

Local
Data
Marts

Physical
Data
Warehouse

Source Source
Physical Structure of Data Warehouse
❑ Federated architecture
❑ The logical data warehouse is only virtual

❑ Tiered architecture
❑ The central data warehouse is physical
❑There exist local data marts on different tiers which store copies
or summarization of the previous tier.
Logical And Physical Design
❑ A logical design is conceptual and abstract
❑ In the logical design, you look at the logical relationships among
the objects
❑ The physical design, you look at the most effective way of storing
and retrieving the objects as well as handling them from a
transportation and backup/recovery perspective
LD compare with PD
The Process of Modeling the Warehouse
❑ Determine user requirements
❑ Assist users in understanding new technology
❑ Consider imperatives
❑User analytical requirements
❑Performance objectives
❑ Keep acquainted with changing requirements
The Process of Modeling the Warehouse
❑ Create a business model
❑ Create a dimensional model
❑ Create a physical model

Conceptual Model

Logical Model

Physical Model
Basic Concepts
❑ Dimensional modeling has several basic concepts:
❑ Facts
❑ Dimensions
❑ Measures (variables)

Dimension Measures
ProductWiseSales 3000 units
AgeWisePopulation 2million people
DesignationWiseSalary Rs. 200000
Fact Tables
❑ Fact tables are the large tables in your warehouse schema that
store business measurements.
❑ Fact tables typically contain facts and foreign keys to the dimension
tables. Fact tables represent data, usually numeric and additive, that
can be analyzed and examined. Examples include sales, cost, and
profit
Fact Data (Business Subjects)
❑ Bulk of the warehouse data
❑ Accessed by dimensions
❑ Partitioned

❑ Series of snapshots
❑ Numerical data
❑ Element of time
❑ Dates for calculations
❑ Composite keys

❑ Indexed primary keys


Dimension Tables
❑ A dimension is a structure, often composed of one or more
hierarchies, that categorizes data.
❑ Dimensional attributes help to describe the dimensional value
(measures in fact table). Dimensional attributes are normally
descriptive, textual values.
❑ Several distinct dimensions, combined with facts, enable you to
answer business questions. Commonly used dimensions are
customers, products, and time.
Dimension Data
❑ Qualifies user query constraints
❑ Drives the query
❑ Is linked to fact by keys
Examples for Dimensions and Facts
❑ DIMENSIONS
❑ Time
❑ Location/region
❑ Customers
❑ Salesperson
❑ FACTS
❑ Sales
Measures
❑ A measure is a numeric attribute of a fact, representing the
performance or behavior of the business relative to the dimensions.
The actual numbers are called as variables.
❑ A measure is determined by combinations of the members of the
dimensions and is located on facts.
❑ Examples of Measures:
❑ Quantity Sold
❑ Unit Price
❑ Amount Sold
❑ Profit
Conceptual Modeling of Data Warehouses
Three basic conceptual schemas:
❑ Star schema
❑ Snowflake schema
❑ Fact constellations
Star schema
Star schema: A single object (fact table) in the middle connected to a
number of dimension tables

❑ The star schema is a data modeling technique used to map multidimensional


decision support into a relational database.
❑ Star schemas yield an easily implemented model for multidimensional data
analysis while still preserving the relational structure of the operational
database.
❑ Others name: star-join schema, data cube, data list, grid file and multi-
dimension schema
Star schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city
Star schema
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
o105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la
Terms
❑ Basic notion: a measure (e.g. sales, qty, etc)
❑ Given a collection of numeric measures
❑ Each measure depends on a set of dimensions (e.g. sales volume
as a function of product, time, and location)
Terms
❑ Relation, which relates the dimensions to the measure of interest,
is called the fact table (e.g. sale)
❑ Information about dimensions can be represented as a collection
of relations – called the dimension tables (product, customer,
store)
❑ Each dimension can have a set of associated attributes
Warehouse Model Components
Location Dimension Product Dimension

Region Product Family

Store Product Type

❑ Dimension Product
❑ Attribute
❑Color, size, weight
❑Days, weeks, holidays on the time
dimension
Warehouse Model Components
❑ Hierarchy of attributes in a dimension
❑Relationship
❑Fact Location Dimension
Product Dimension Region Time Dimension
Product Family Account Year
Store
Product Type Account Week
Item_key
Product Store_key
Acct_Week_key
Sales Data
Example of Star Schema
Date Product

Date ProductNo
Month Sales Fact Table ProdName
Year ProdDesc
Date Category
QOH
Product
Store
Store
StoreID Customer
City Customer CustId
State
CustName
Country unit_sales CustCity
Region
dollar_sales CustCountry

schilling_sales
Measurements
Dimension Hierarchies
For each dimension, the set of associated attributes can be
structured as a hierarchy

sType
store
city region

customer city state country


Dimension Hierarchies
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

region regId name


north cold region
south warm region
Hierarchies and Levels
❑ Hierarchies are logical structures that use ordered levels as a
means of organizing data. A hierarchy can be used to define data
aggregation. For example, in a time dimension, a hierarchy might
aggregate data from the month level to the quarter level to the
year level. A hierarchy can also be used to define a navigational drill
path and to establish a family structure.
❑ A level represents a position in a hierarchy. For example, a time
dimension might have a hierarchy that represents data at the
month, quarter, and year levels
Snowflake Schema
Snowflake schema: A refinement of star schema where the
dimensional hierarchy is represented explicitly by normalizing the
dimension tables
❑ Snowflake schema is a variant of the star schema where dimension tables
contain normalized data
❑ e.g. ‘city’ and ‘province’ can be splitted as separated tables to normalize
dimension tables
❑ ‘Starflake’ schema is a hybrid structure that contains a mixture of star
(denormalized) and snowflake (normalized) schemas
❑ This allows dimension tables to be present in both forms for different
query requirements
Example of Snowflake Schema
Product
ProductNo
Month ProdName
Year ProdDesc
Month
Date Category
Year Year
QOH
Date Sales Fact Table
Month
Date
Product
Store
Store Customer
StoreID unit_sales
City City Cust
dollar_sales
City schilling_sales CustId
State State CustName
State CustCity
Country Country CustCountry
Country Measurements
Region
Fact constellations
Fact constellations: Multiple fact tables share dimension tables
Warehouse Model - Constellation
Warehouse Table Product Table Store Table
Warehouse_id Product_id Store_id
Warehouse_loc Product_desc District_id

Inventory Fact Table Sales Fact Table

Cost_dollars Sales_dollars
Qty_on_hand Sales_units

Time Table
Week_id Item Table
Period_id Item_id
Year_id Dept_id
Modeling Dimensions
❑ Model according to data content
❑ Model with aggregation needs in mind
❑ Model to satisfy drilling requirements
❑ May be fully denormalized - star
❑ May be normalized - snowflake
❑ Categorical dimensions may be included
Modeling the Time Dimension
❑ The time dimension
❑ Is mandatory
❑ Is unique
❑ Is powerful
❑ Needs careful design
❑ Include business time periods
❑ Include special dates
Modeling the Time Dimension
Fact Table Fact Table Fact Table
Day_id Day_id Day_id

Time Dimension
Time Dimension Day_id
Day_id Time Dimension
Month_id Day_id
Quarter_id Month_id
Month_id
Half_year_id Quarter_id
Year_id Half_year_id
Year_id
Fiscal_month_id Quarter_id
Fiscal_ quarter_id
Fiscal_ half_year_id
Fiscal_ year_id Half_year_id

Year_id
Modeling the Time Dimension
❑ Analyze and plan
❑ Consider relevant date range
❑ Use uniform time rollup characteristics
❑ Model for flexibility
Modeling the Time Dimension
❑ Include special dates
❑ Identify date requirements
❑ Simple
❑ Fiscal
Calendar
❑ Rolling
Fact Fiscal

Rolling
Modeling Summary Tables
Define before design
SALES BY MONTH/REGION
Month Region Tot_Sales$
SALES FACTS Jan 97 North 41,000
Sales$ Region Month Jan 97 East 10,000
10,000 North Jan 97 Feb 97 South 40,000
12,000 South Feb 97
11,000 North Jan 97 Mar 97 West 17,000
15,000 West Mar 97
18,000 South Feb 97
20,000 North Jan 97
10,000 East Jan 97 SALES BY MONTH
2,000 West Mar 97 Month Tot_Sales
Jan 97 51,000
Feb 97 40,000
Mar 97 17,000
When to Summarize Data
❑ Trade-off between direct access and calculation at the time of
execution
❑ Compression ratio calculations can help you decide

Queried Rows Displayed Rows Calculation Ratio

1,341 22 22/1,341 0.0164

234 22 22/234 0.09

30 22 22/30 0.73

20 22 22/20 1.1
Maintaining History
❑ Maintain history of data changes
❑ Enable in the model
Customer History Table Customer Table
Cust_id Cust_name Version Cust_id
1 ABC Co 1 1
1 ABC Ltd 2 2
2 XYZ Inc 1 3

Sales Fact Table


Unit Sell Price
Dollar Sales
Unit Sales
Dollar Cost
Entity-Relationship vs. Dimensional Models

Entity-Relationship Dimensional Models


One table per entity One fact table for data
organization
Minimize data redundancy Maximize understandability

Optimize update Optimized for retrieval


The Transaction Processing The data warehousing model
Model
Database design methodology for
data warehouses (1)
Nine-step methodology – proposed by Kimball

Step Activity
1 Choosing the process
2 Choosing the grain
3 Identifying and conforming the dimensions
4 Choosing the facts
5 Storing the precalculations in the fact table
6 Rounding out the dimension tables
7 Choosing the duration of the database
8 Tracking slowly changing dimensions
9 Deciding the query priorities and the query modes
Database design methodology for
data warehouses (2)
❑ There are many approaches that offer alternative routes to the
creation of a data warehouse
❑ Typical approach – decompose the design of the data warehouse
into manageable parts – data marts, At a later stage, the
integration of the smaller data marts leads to the creation of the
enterprise-wide data warehouse.
❑ The methodology specifies the steps required for the design of a
data mart, however, the methodology also ties together separate
data marts so that over time they merge together into a coherent
overall data warehouse.
Step 1: Choosing the process
❑ The process (function) refers to the subject matter of a particular
data marts. The first data mart to be built should be the one that
is most likely to be delivered on time, within budget, and to
answer the most commercially important business questions.
❑ The best choice for the first data mart tends to be the one that is
related to ‘sales’
Step 2: Choosing the grain (unit of analysis)
❑ Choosing the grain means deciding exactly what a fact table record
represents. For example, the entity ‘Sales’ may represent the facts about
each property sale. Therefore, the grain of the ‘Property_Sales’ fact
table is individual property sale.
❑ Only when the grain for the fact table is chosen we can identify the
dimensions of the fact table.
❑ The grain decision for the fact table also determines the grain of each
of the dimension tables. For example, if the grain for the
‘Property_Sales’ is an individual property sale, then the grain of the
‘Client’ dimension is the detail of the client who bought a particular
property.
Step 3: Identifying and conforming the
dimensions
❑ Dimensions set the context for formulating queries about the
facts in the fact table.
❑ We identify dimensions in sufficient detail to describe things such
as clients and properties at the correct grain.
❑ If any dimension occurs in two data marts, they must be exactly
the same dimension, or one must be a subset of the other (this is
the only way that two DM share one or more dimensions in the
same application).
❑ When a dimension is used in more than one DM, the dimension is
referred to as being conformed.
Step 4: Choosing the facts
❑ The grain of the fact table determines which facts can be used in
the data mart – all facts must be expressed at the level implied by
the grain.
❑ In other words, if the grain of the fact table is an individual
property sale, then all the numerical facts must refer to this
particular sale (the facts should be numeric and additive).
Step 5: Storing pre-calculations in the
fact table
❑ Once the facts have been selected each should be re-examined to
determine whether there are opportunities to use pre-calculations.
❑ Common example: a profit or loss statement
❑ These types of facts are useful since they are additive quantities,
from which we can derive valuable information.
❑ This is particularly true for a value that is fundamental to an
enterprise, or if there is any chance of a user calculating the value
incorrectly.
Step 6: Rounding out the dimension tables
❑ In this step we return to the dimension tables and add as many
text descriptions to the dimensions as possible.
❑ The text descriptions should be as intuitive and understandable to
the users as possible
Step 7: Choosing the duration of the
data warehouse
❑ The duration measures how far back in time the fact table goes.
❑ For some companies (e.g. insurance companies) there may be a
legal requirement to retain data extending back five or more years.
❑ Very large fact tables raise at least two very significant data
warehouse design issues:
❑ The older data, the more likely there will be problems in
reading and interpreting the old files
❑ It is mandatory that the old versions of the important
dimensions be used, not the most current versions (we will
discuss this issue later on)
Step 8: Tracking slowly changing dimensions
❑ The changing dimension problem means that the proper description of
the old client and the old branch must be used with the old data
warehouse schema

❑ Usually, the data warehouse must assign a generalized key to these


important dimensions in order to distinguish multiple snapshots of clients
and branches over a period of time

❑ There are different types of changes in dimensions:


❑ A dimension attribute is overwritten

❑ A dimension attribute causes a new dimension record to be created etc.


Step 9: Deciding the query priorities and the
query modes
❑ In this step we consider physical design issues.
❑ The presence of pre-stored summaries and aggregates
❑ Indices
❑ Materialized views
❑ Security issue
❑ Backup issue
❑ Archive issue
Database design methodology for
data warehouses - summary
❑ At the end of this methodology, we have a design for a data mart
that supports the requirements of a particular business process
and allows the easy integration with other related data marts to
ultimately form the enterprise-wide data warehouse.
❑ A dimensional model, which contains more than one fact table
sharing one or more conformed dimension tables, is referred to as
a fact constellation.
Example
Thank You!
Assignment 1
You are the data design specialist on the data warehouse project team . Design a Data warehouse for a
selected business or organization
1. Introduction to business/ organization

2. Introduction to tools and technologies are used to design the data warehouse

3. Dimensional modelling (logical schema and the physical schema for specific target database management systems
(DBMSs))

4. Implementation of data warehouse with dummy data

5. 10 business queries

6. Results of 10 queries

Evaluation methods: Report & Demonstration of warehouse implementation

Assignment Type: Group

You might also like