Lecture 3 & 4 - 5610
Lecture 3 & 4 - 5610
Modeling
Lecture 3 and 4
1
Agenda
2
Classroom Activity – Traffic Ticket
Goal: Design a database to capture traffic
violations/tickets for the City
Key Considerations:
• Vehicle Operator and Vehicle can have two
relationships: Driver, and Owner
• Which one of these relationships will you focus
on?
3
Dimensional Modeling
Case for Dimensional Modeling
ER Models:
• Capture microscopic relationships between data elements
• Are extremely helpful for transactional processing
• ERP systems like SAP have thousands of entities, hence thousands of
tables
• Are designed to minimize data redundancy
• Application (Operational) databases are ER modeled databases
• Typical ER models for small/mid size applications have hundreds of
tables with thousands of Joins
• Are NOT ideal for reporting because reporting requires complex
logic, which needs to be calculated at query time in ER models
Simple Retail Sales ER Model
Querying Issues in ER Schemas
• Hundreds of tables and joins make querying slow and difficult
✓Every Join in SQL taxes the query performance
✓More joins mean slower queries
✓More joins also mean longer and difficult queries
• You need to know more than basic SQL to be able to query
✓Most queries are complex because you’re transforming the data in your SQL
✓Data logic at Report/query level
• An army of report writers needed because reporting is complex and most
business users are not comfortable with writing complex SQL
✓Report writing becomes an I.T. thing
✓No Self-service reporting
What is Dimensional Modeling
• Dimensional Modeling is a data modeling technique used for designing Data
Warehouses
✓Dimensional Modeling was introduced by Ralph Kimball
• Dimensional modeled databases are NOT used as operational databases,
rather they are solely used as reporting databases
• The goal of dimensional modeling is to make reporting faster and simpler
• Dimension models de-normalize the data to achieve the above goal
✓Normalization breaks tables to remove redundancies; de-normalization joins tables
to reduce the number of tables and the number of joins
• Every dimensional model contains two kind of tables:
✓ Fact tables
✓ Dimension tables
Retail Sales Dimensional Model
• A fact table only contain foreign keys, measures, and degenerate dimensions (to
be discussed later)
• Design tip: Always design the first fact table(s) with most granular (detailed) data
✓ You can always summarize if you have the details, BUT you cannot generate details from
the summary
What are Dimension tables?
• Dimension tables
• Contain textual attributes that are highly correlated
✓This is where you de-normalize
✓Highly correlated means that you can put all Product attributes including
Product Type, Product Category, etc. in one Product dimension
(remember you broke them apart in 3NF)
✓Highly correlated does not mean that you can put Customer attributes in
the Product table
• Dimensions provide the context of analysis for facts
• Some examples are Product Dimension, Store dimension, Student
dimension, Faculty dimension etc.
Grain (or Granularity) of a Table?
• We already know the concept of uniqueness – which makes every record of a
table unique
✓Primary keys make every record in a table unique
• The grain of a table is the make of a table defined in the form of uniqueness of
each record
• For example, if you have a Customer table & Customer_Id defines the
uniqueness of every record in the table, then the grain of this table will be
defined as ‘One record per Customer’, or simply ‘Per Customer’
• The concept of grain can be applied to any table
✓In dimensional modeling, we will talk about the grain of fact and dimension tables
• For the Sales Fact table on Slide 10, the grain can be defined as ‘Per
Organization Per Product Per Location Per Date’ or simply as ‘Per transaction’
• Remember that it is NOT necessary that all FKs become part of the grain
Surrogate Keys
• All dimension tables have single part , meaningless, locally generated keys
called surrogate keys
• These keys are generated as part of the ETL process and are not dependent
on the source system(s)
• Surrogate keys remove the data warehouse dependency on source system
primary keys
✓As a data warehouse developer, you do not have any control over the source system
✓What if the source system keys change? Your data warehouse accuracy will change
• Since these keys are meaningless, they are merely used to join the tables
and are not used in reporting
✓The BI tools hide these keys from the users
Time Dimension
• Typically the first dimension created in the Date Time Dimension
Warehouse Date_Key (PK)
• Typically stores one row per day – the grain is per day Date
Calendar Year
✓If it possible to have an hour, or minute grain, but it is very Calendar Quarter
Calendar Month
rare yyyymm
Fiscal Year
• Calendar logic is handled through time dimension Fiscal Quarter
instead of queries Fiscal Month
Reporting Period
years of data
Classroom Activity – Time Dimension Design
• Question: How many rows will this table have if you have 50 years of
data? Write down your answer on the paper.
• Question: How many rows will this table have with 50 years of data at
a minute level grain? Write down your answer on the paper.
Important Characteristic of Dimensional Models
• The fact table is the central component of every query
✓Dimensions only provide additional attributes
✓For a query, if your fact table returns 20 records, you will have 20 records in the
result set
• Each fact table record links to one and only one record in the
dimension table(s)
✓Think of the attributes from the dimension table as extension of the fact table
records
How things tie together in Dimensional Models
4-Step Dimensional Design Process
1. Identify the Business process
• In other words, identify the data you’re dealing with
• Come up with the logic to load the Fact table
2. Identify the Grain of the Fact table
• What does 1 row in fact table represent?
• Remember, you should have data available at the most detailed level
3. Identify the Dimensions
• Descriptive context, true to the grain of the Fact table
• The Dimension table grain should not change the Fact table grain
✓ Every dimension should ONLY provide 1 record for the Fact record
4. Identify the Facts
• Numeric additive measurements, true to the grain
Classroom Activity
Create a dimensional model for the diagram shown.
Assumption(s):
• This is the first fact table you’re designing for this problem
• There is a $ Amount on the ticket for the violation