Data Warehouse Modeling
Data Warehouse Modeling
1
Learning Objective
• Understand Entity Relationship Model
• Understand dimensional modelling of Star Schema and Snowflake
Schema
2
ERD Overview
• Entity Relation Diagram (ERD)
▪ A conceptual blueprint of a database
▪ Graphical representation of all entity relationships
• Basic Concept
▪ Entities – Boxes (Rectangles)
▪ Relationships – Lines and Diamonds
▪ Attributes – Oval
3
ERD Overview
• Entity
▪ A person, place, event, or thing for which we intend to collect Student
data
• Entity type Course
▪ A collection of entities that share common properties or
characteristics
▪ Represented by a rectangle containing the entity’s name Professor
• Entity instance
▪ A single occurrence of an entity type
Department
• Example: University
4
ERD Overview
• Attribute Name
▪ Property or characteristic of an entity type
▪ Represented by an oval containing the Student_No
attribute’s name connected to the entity with a
line
▪ Primary key Address Student
◦ An attribute (or a combination of attributes) that
can uniquely identify each instance in an entity
◦ Each entity must have a primary key Data_of_Birth
◦ Primary keys are underlined in ERD
Phone_No
5
ERD Overview
• Relationship Student
▪ A relationship is an association between/among entities
▪ Represented by diamond-shaped symbols
▪ Usually described by a verb
▪ Three basis types
◦ 1:1, 1:M, M:N
Takes
◦ The number of instances of one entity that can or must be associated
with each instance of another entity
Course
6
ERD Overview
• Examples of One-to-One Relationship
Employee Professor
1 1
Manages Chairs
1 1
Division Department
7
ERD Overview
• Examples of One-to-Many Relationship
Faculty Customer Company
1 1 1
M M M
8
ERD Overview
• Examples of Many-to-Many Relationship
Faculty Bar Student
M M M
N N N
9
ERD Overview
• Example of ERD
CR_id Stud_id
Prof_id Name
1 M M N
Professor Teaches Course Enrolled_By Student
M 1
Work_In CR_Desc
Chairs Name
Name
Dept_id
1 1
Department
Name
10
ERD Overview
• Example of ERD
E_id E_Name
Proj_id
Address Employee M
1 M Work_on
N Project
Manages Works_For
Dept_id Location
1 1
Location Department Proj_Name
Dept_Name
11
Try it Yourself
• ERD Exercise 1: Fun Place Parent Directory
▪ The director of the “Fun Place” preschool wants to keep track of the contact
information of the doctors (M.D.) for children (in case she needs to contact
them during an emergency). She wants to keep track of the ID (a unique
number assigned to each doctor), the name, and telephone number, of the
doctors and some basic information about the child (student no, name, and
age). (Each student will have only one doctor. However, a doctor may have
more than one child in the Preschool)
12
Try it Yourself
• ERD Exercise 2: Extracurricular Activities
▪ A database is being constructed to keep track of data about students'
extracurricular activities in an elementary school. A student can participate in
more than one activity, and each activity may have multiple participating
students. An activity can have only one supervisor, but each supervisor can be
in charge of more than one activity. For each student, the database is to hold
the student's ID (primary key), name, birth date, and telephone number. For
each activity, it will store the activity's ID (primary key), name, description,
and weekly meeting time. For each supervisor, it will store the supervisor's ID
(primary key), name, and telephone number. The database also store data
concerning when a student started to participate in an extracurricular activity.
Draw an E-R diagram based on the description.
13
Relational Data
• Example of Small-Town Book Store
Customer Book Purchase
Cus_Num Name Book_Code Title Pur_Num Cus_Num Book_Code Date
100 Poulos GW4-5 Blazing Sun 1000 114 BS7-8 9/16/02
114 Simmons BS7-8 Bright Star 1001 100 GW4-5 6/17/03
115 Thompson PQ1-2 Cattle Run 1002 114 PQ1-3 9/17/04
118 Simmons PQ1-3 Bright Star 1003 114 GW4-5 7/17/04
1004 115 PQ1-2 9/17/04
Primary keys? 1005 100 BS7-8 9/18/04
Foreign keys?
14
Relational Data
• Example of Trucking Company
Truck
Truck_Num Base_Code Type_Code Truck_Miles Truck_Buy_Date
1001 506 1 32123 11/8/94
1002 502 1 76984 3/23/92
1003 501 2 12346 12/27/95
1004 505 1 894 2/21/96
1005 503 2 45673 4/15/94
1006 501 2 93245 3/23/92
1007 507 3 32012 12/1/94
1008 502 3 74213 11/8/94
1009 503 2 32015 4/15/94
15
Relational Data
• Example of Trucking Company (Continued)
Base
Base_Code Base_City Base_State Area_Code Base_Phone Base_Mgr
501 Murfreesboro TN 615 523-4567 Peter McAvee
502 Columbus OH 614 293-5678 John Smith
503 Hampton MO 456 345-6789 Maria Talindo
504 Columbus GA 770 233-3843 John Smith
Type
Type_Code Description_1 Description_2 Table Primary Key Foreign Key
1 Single box Double-axle Truck Truck_Num Base_Code, Type_Code
2 Single box Single-axle Base Base_Code -
3 Tandem trailer Single-axle Type Type_Code -
16
Integrity Constraints
• Entity Integrity
▪ All primary key must be of unique value
▪ No null value in primary key attribute
• Referential Integrity
▪ A foreign key may have either (1) a null value, or (2) a value that matches a
value in the primary key of a linked relation
17
Integrity Constraints
• Violations of Entity Constraints
Department Department
D_ID Name Location D_ID Name Location
1 MKTG B200 1 MKTG B200
2 FINANCE A308 2 FINANCE A308
ACCTG B332 3 ACCTG B332
3 MIS A322 2 MIS A322
4 PROD A432 4 PROD A432
Violation: Y or N Violation: Y or N
If yes, why? If yes, why?
18
Integrity Constraints
• Violations of Referential Constraints
Department Employee
D_ID Name Location E_ID LName FName Hours Rate Dept
1 MKTG B200 10 Abin Smith 40.00 $10.35 3
2 FINANCE A308 11 Baxter Alex 38.00 $9.50 4
3 ACCTG B332 12 Chen Koo 40.00 $9.25
4 MIS A322 13 Denver Lewis 38.00 $9.50 2
Violation: Y or N
If yes, why?
19
Integrity Constraints
• Violations of Referential Constraints
Department Employee
D_ID Name Location E_ID LName FName Hours Rate Dept
1 MKTG B200 10 Abin Smith 40.00 $10.35 3
2 FINANCE A308 11 Baxter Alex 38.00 $9.50 4
3 ACCTG B332 12 Chen Koo 40.00 $9.25 7
4 MIS A322 13 Denver Lewis 38.00 $9.50 2
Violation: Y or N
If yes, why?
20
Limitations of Entity Relationship Modeling
• Very symmetric
• Cannot tell which table is most
important or largest
Employee Product Project
• Cannot tell which tables hold
static or dynamic business
information
• Joining of any tables is possible Department Employee
by user
21
Store
Star Schema StoreId City
S1 NYC
Product
S2 SFO
ProdId Name Price
S3 LA
P1 Bolt 10
P2 Nut 5 Sale
OrdId Date CustId ProdId StoreId Qty Amt
100 1/7/97 53 P1 S1 1 12
102 2/7/97 53 P2 S1 2 11
105 3/8/97 111 p1 S3 5 50
Customer
CustId Name Address City
53 Joe 10 Main SFO
81 Fred 12 Main SFO
Data Instance of Star Schema
111 Sally 80 Willow LA
22
Star Schema
• The basic star schema contains 4 components:
▪ Fact table
▪ Dimension tables Product Sale Customer
▪ Attributes PrordId OrdId CustId
Name Date Name
▪ Attribute hierarchies Price CustId Address
ProdId City
StoreId
Qty
Amt
Store
StoreId
City
23
Star Schema
• Very asymmetric
Dimension
• Fact table is the only table that Dimension
A
Dimension
has multiple connections H B
connecting it to other tables
• All other tables have only a Dimension Fact Dimension
G C
single connection attaching Table
them to the central table
Dimension Dimension
F D
Dimension
E
24
Star Schema
• Facts
▪ Facts are numerical values that represent certain view onto a business activity
▪ Facts represent performance measures of business activity
◦ Common facts are: productivity, expenses, prices, sales, and profit, etc.
▪ Facts are stored in a fact table
• Fact Table
▪ Fact table is also called detail table
▪ Every fact table constitutes a mid point of a star schema
▪ Fact table is periodically updated by inserting aggregated data from
operational databases
▪ Facts can be calculated in the course of a query executing, as well
▪ In the later case, they are often called metrics 25
Star Schema
• Fact Table (Continued)
▪ Every fact table is associated with corresponding dimension tables
▪ Fact table tends to contain additive facts
▪ Fact tables have composite keys
▪ Tables have composite keys tend to be fact tables
▪ All other tables are dimension tables
▪ Every combination of key values would give rise to a different record in the
fact table
▪ Fact table is naturally highly normalized
26
Start Schema
• Dimensions
▪ Dimensions are characteristics of facts
▪ They describe the context of the facts
▪ Facts are associated with dimensions
▪ Example: If sales of products in locations during a time period is a fact, then
its dimensions are:
◦ Products,
◦ Locations, and
◦ Time
▪ Dimensions are stored in dimension tables
27
Star Schema
• Dimension Tables
▪ Dimension table tends to contain textual or non-additive facts
▪ Dimension tables must not be normalized
▪ Normalized dimension tables destroy the ability to browse
▪ Disk space savings gained by normalizing is not significant
28
Star Schema
• Attributes
▪ Attributes are properties of dimensions
▪ They are used to search, retrieve, and classify facts
▪ Dimensions contain only those attributes that are used in the decision making
process
▪ Example: If product, location, and time are sales dimensions, then the
possible attributes may be:
◦ For the product: ProductId, ProdName, Prod_Type, Supplier,
◦ For the location: District, City, ShopId
◦ For the time: Year, Quarter, Month, Week, DayId
29
Star Schema
• Attribute Hierarchy
▪ Attributes can be arranged in a hierarchical structure
▪ The relationship between hierarchy levels is N:1
▪ An attribute hierarchy determines a sequence of functional dependencies
▪ For example, A product hierarchy:
◦ Product→Product_Type
◦ Product_Type→Industry
30
Star Schema
▪ Example of Attribute Hierarchy
Location hierarchy Time hierarchy
State All
Year
District
City
Month
Shop Date
31
Star Schema
• Use of Attribute Hierarchies
▪ Attribute hierarchies are used:
◦ To analyze facts at the various aggregation levels, usually starting from a higher one, and
◦ To enhance query rewriting
▪ Here, we focus on analysis
▪ If an analysis shows significant differences in the yearly sales in an industry,
we can use the corresponding attribute hierarchy to find products that mainly
contributed to the difference
32
Snowflake Schema
SType
Store
SType
City Region
TId Size Location
Store T1 Small Downtown
StoreId CityId TId Mgr T2 Large Suburbs
S1 SFO T1 Joe City
S2 SFO T2 Fred CityId Pop RegId
S3 LA T1 Nancy SFO 1M North
LA 5M South
Region
RegId Name
Data Instance of Snowflake Schema North Cold region
(Consider a dimension table Store) South Warm region
33
Snowflake Schema
• A snowflake schema is a variation of the star schema in which
dimension tables are in the third or in BCNF normal form
• By the normalization, each dimension attribute hierarchy is split into a
number of relation schemas
• These relation schemas are associated by primary key / foreign key
pairs
34
A Little about Multi-Dimensional Cube
• From the data prospective view
Fact table view: Multi-dimensional cube:
Sale
ProdId StoreId Amt C1 C2 C3
P1 C1 12 P1 12 50
P2 C1 11 P2 11 8
P1 C3 50
P2 C2 8
Dimensions = 2
35
A Little about Multi-Dimensional Cube
36
Star Constellation Schema
• By the rule, a data warehouse contains a larger number of fact tables
• The same dimension table can represent a component of more than
one star schema
• This schema is viewed as collection of stars hence called galaxy
schema or fact constellation.
• Sophisticated application requires such schema.
• Example:
▪ Suppose sales and orders constitute two data warehouse subjects,
represented by two star schemas
▪ Time, products, and customers represent dimensions of the both star
schemas, and they would share the same instances of these dimensions
37
Star Constellation Schema
• Another example of Star Constellation
38
Case Study
• Afco Foods & Beverages is a new company which produces dairy,
bread and meat products with production unit located at Baroda
• There products are sold in North, North West and Western region of
India
• They have sales units at Mumbai, Pune, Ahemdabad, Delhi and
Baroda
• The President of the company wants sales information
39
Case Study
• Sales Information
▪ Report: The number of units sold: 113
▪ Report: The number of units sold over time
January February March April
14 41 33 25
▪ Report: The number of items sold for each product with time
Jan Feb Mar Apr
Wheat Bread 6 17
Cheese 6 16 6 8
Swiss Rolls 8 25 21
40
Case Study
• Sales Information (Continued)
▪ Report: The number of items sold in each City for each product with time
Jan Feb Mar Apr
Mumbai Wheat Bread 3 10
Cheese 3 16 6
Swiss Rolls 4 16 6
Pune Wheat Bread 3 7
Cheese 3 8
Swiss Rolls 4 9 15
41
Case Study
• Sales Information (Continued)
▪ Report: The number of items sold and income in each region for each product
with time
Jan Feb Mar Apr
Rs U Rs U Rs U Rs U
Mumbai Wheat Bread 7.44 3 24.80 10
Cheese 7.95 3 42.40 16 15.90 6
Swiss Rolls 7.32 4 29.98 16 10.98 6
Pune Wheat Bread 7.44 3 17.36 7
Cheese 7.95 3 21.20 8
Swiss Rolls 7.32 4 16.47 9 27.45 15
42
Case Study
• Sales Measures & Dimensions
▪ Measures
◦ Units sold
◦ Amount
▪ Dimensions
◦ Product
◦ Time
◦ Region
43
Case Study
• Sales Data Warehouse Model
▪ Fact Table
Sales Fact Sales Fact
City Product Month Units Rupees City_Id Prod_Id Month Units Rupees
Mumbai Cheese January 3 7.95 1 589 1/1/1998 3 7.95
Mumbai Swiss Rolls January 4 7.32 1 1218 1/1/1998 4 7.32
Pune Cheese January 3 7.95 2 589 1/1/1998 3 7.95
Pune Swiss Rolls January 4 7.32 2 1218 1/1/1998 4 7.32
Mumbai Cheese February 16 42.40 1 589 2/1/1998 16 42.40
44
Case Study
• Sales Data Warehouse Model
▪ Product Dimension Tables
Product
Prod_Id Product_Name Product_Category_Id
589 Cheese 1
1218 Swiss Rolls 1
288 Coconut Cookies 2
Product Category
Product_Category_Id Product_Category
1 Bread
2 Cookies
45
Case Study
• Sales Data Warehouse Model
▪ Region Dimension Table
Region
City_Id City Region Country
1 Mumbai West India
2 Pune North West India
46
Case Study
• Sales Data Warehouse Model
Time
Product
Sales Fact Product
Category
Region
47
Joins
• Generally there are only a few joins in a dimensional database
(typically joining the fact table with one or more of the dimension
tables)
• Each of the joins expresses a fundamental relationship between items
in the underlying business
• Any joins are in principle possible in an ER database
▪ most have little administrative significance
48
Star Join
• A star join is a primary-key to foreign-key join of the dimension tables
to a fact table
• The fact table normally has a concatenated index on the key columns
to facilitate this type of join
• The main advantages of star schemas are that they:
▪ Provide a direct and intuitive mapping between the business entities being
analyzed by end users and the schema design
▪ Provides highly optimized performance for typical data warehouse queries
49
THE END
50