0% found this document useful (0 votes)
18 views

Dimensional Modeling

Uploaded by

vanshikaedu0105
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Dimensional Modeling

Uploaded by

vanshikaedu0105
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Dimensional Modeling

Nature of business data


Business Data
 The users tend to think in terms of business
dimensions and analyze measurements along
such business dimensions.
 The business dimensions are different and
relevant to the industry and to the subject for
analysis.
 Time dimension is a common dimension.
 Almost all business analyses are performed
over time.
Nature of business data
 Sometimes, the users are unable to describe
fully what they expect.
 So, when requirements cannot be fully
determined, a new and innovative concept is
needed to gather and record the requirements.
 The traditional methods are not adequate in this
context.
 The new methodology for determining
requirements for a data warehouse system is
based on business dimensions.
 The new concept incorporates the basic
measurements and the business dimensions
along which the users analyze these basic
measurements.
 You come up with what is known as an
information package for specific subject.
 Primary goal in the requirements definition
phase is to compile information packages for all
the subjects for the data warehouse.
An automobile manufacturer- analyze sales.
Product, dealer, customer demographic, method of payment, and time.
A hotel chain- hotel occupancy.
Hotel, room type, and time.
Metrics for analyzing hotel occupancy
• Occupied rooms
• Vacant rooms
• Unavailable rooms
• Number of occupants
• Revenue
Information Subject: Hotel Occupancy
Dimensions
Time Hotel Room Type
Hierarchies/ Year Hotel Line Room Type
Categories Quarter Branch Name Room Size
Day of Month
Holiday Flag Month Branch Code Number of
Beds
Date Region Type of Bed
Day of Week Address Max.
Occupants
Suite
Refrigerator

Kitchenette

Facts: Occupied Rooms, Vacant Rooms, Unavailable


Rooms, Number of Occupants, Revenue
Design Decisions
• Choosing the process Selecting the subjects from the
information packages for the first set of logical structures to be
designed.
• Choosing the grain Determining level of detail for data in the
data structures.
• Identifying and conforming the dimensions Choosing the
business dimensions (such as product, market, time, etc.) to be
included in the first set of structures and making sure that each
particular data element in every business dimension is
conformed to one another.
• Choosing the facts Selecting the metrics or units of
measurements (such as product sale units, dollar sales, dollar
revenue, etc.) to be included in the first set of structures.
• Choosing the duration of the database Determining how far
back in time you should go for historical data.
• Dimensional modeling gets its name from the business
dimensions we need to incorporate into the logical data
model.
• It is a logical design technique to structure the business
dimensions and the metrics that are analyzed along
these dimensions.
• This modeling technique is intuitive.
• The model provides high performance for queries and
analysis.
• Consists of the specific data structures needed to
represent the business dimensions. These data
structures also contain the metrics or facts.
From STAR schema, the users can easily visualize
answers to these questions:
For a given amount of dollars, what was the product sold?
Who was the customer?
Which salesperson brought the order?
When was the order placed?
• Let us examine a typical query against the automaker
sales data. How much sales proceeds did the Jeep
Cherokee, Year 2000 Model with standard options,
generate in January 2000 at Big Sam Auto dealership
for buyers who own their homes and who took 3-year
leases, financed by Daimler-Chrysler Financing?
• The attributes in the dimension tables act as constraints
and filters in our queries. Any or all of the attributes of
each dimension table can participate in a query.
• Each dimension table has an equal chance to be part of
a query.
The marketing department wants the quantity sold and order dollars for product
bigpart-1, relating to customers in the state of Maine, obtained by salesperson
Jane Doe, during the month of June.
Drill-down analysis from the STAR schema
Inside a Dimension Table
Inside the Fact Table
The Factless Fact Table
Data Granularity
• Fact tables at the lowest grain facilitate "graceful"
extensions.
• But we have to pay the price in terms of storage and
maintenance for the fact table at the lowest grain.
• In practice, however, we build aggregate fact tables to
support queries looking for summary numbers.
STAR SCHEMA KEYS
Primary Keys
product code in the operational system is an 8-position code,
2 - code of the warehouse where the product is normally stored
2 - product category

What if a product is now stored in a different warehouse of the


company?
Problems in aggregation
Foreign Keys
Primary key of each dimension table must be a foreign key in the fact
table.
1) A single compound primary key whose length is the total length
of the keys of the individual dimension tables. Under this option,
in addition to the compound primary key, the foreign keys must
also be kept in the fact table as additional attributes. This option
increases the size of the fact table.
2) Concatenated primary key that is the concatenation of all the
primary keys of the dimension tables. Here you need not keep the
primary keys of the dimension tables as additional attributes to
serve as foreign keys. The individual parts of the primary keys
themselves will serve as the foreign keys.
3) A generated primary key independent of the keys of the
dimension tables. In addition to the generated primary key, the
foreign keys must also be kept in the fact table as additional
attributes. This option also increases the size of the fact table.
 The STAR schema reflects exactly how the users
think and need data for querying and analysis.

 STAR schema defines the join paths in exactly


the same way users normally visualize the
relationships.

 The STAR schema is intuitively understood by


the users.

 It is easy to use it as a vehicle for communicating


with the users during the development of the
data warehouse.
Irrespective of the number of dimensions that
participate in the query and irrespective of the
complexity of the query, every query is simply
executed first by selecting rows from the
dimension tables using the filters based on the
query parameters and then finding the
corresponding fact table rows.
UPDATES TO THE DIMENSION TABLES

 The fact table continues to grow in the number of rows


over time.
 Very rarely are the rows in a fact table updated with

changes.
 Even when there are adjustments to the prior numbers,

these are also processed as additional adjustment rows


and added to the fact table.
 Compared to the fact table, the dimension tables are

more stable and less volatile. However, a dimension table


changes through the attributes themselves.
Slowly Changing Dimensions

1) A customer's status changes from rental home


to own home
2) When finance type changes for one of the
payment methods
Slowly Changing Dimensions
 Most dimensions are generally constant over time
 Many dimensions, though not constant over time, change
slowly
 The product key of the source record does not change

 The description and other attributes change slowly

over time
 In the source OLTP systems, the new values overwrite
the old ones
 Overwriting of dimension table attributes is not always the
appropriate option in a data warehouse
 The ways changes are made to the dimension tables
depend on the types of changes and what information
must be preserved in the data warehouse
Type 1 Changes: Correction of Errors

1) A spelling error in the customer name


2) Customer name is changed
3) The marital status changed from single to
married.
Type 1 Changes: Correction of Errors

 Usually, the changes relate to correction of


errors in source systems
 Sometimes the change in the source system

has no significance
 The old value in the source system needs to be

discarded
 The change in the source system need not be

preserved in warehouse
Type 1 Changes: Correction of Errors
Type 2 Changes: Preservation of History

Eg: Change in marital status and customer


address

They usually relate to true changes in source


systems
There is a need to preserve history in the data
warehouse
This type of change partitions the history in the
data warehouse
Every change for the same attribute must be
preserved
Type 2 Changes: Preservation of History

• Add a new dimension table row with new


value of the changed attribute
• An effective date field may be included in
the dimension table
• There are no changes to the original row in
the dimension table
• The key of the original row is not affected
• The new row is inserted with a new
surrogate key
Type 2 Changes: Preservation of History
Type 3 Changes: Tentative Soft Revisions
Type 1 changes are more common.

Type 2 changes preserve the history. When a Type 2


change is applied on a certain date, that date is a cut-off
point.

Sometimes, there is a need to track both the old and


new values of changed attributes for a certain period, in
both forward and backward directions.
These types of changes are Type 3 changes.
Type 3 changes are tentative or soft changes.
Eg. realignment of the territorial assignments for
salespersons.
Type 3 Changes: Tentative Soft Revisions

• They usually relate to "soft" or tentative


changes in the source systems
• There is a need to keep track of history with old
and new values of the changed attribute
• They are used to compare performances
across the transition
• They provide the ability to track forward and
backward
Type 3 Changes: Tentative Soft Revisions
• Add an "old" field in the dimension table for the affected
attribute
• Push down existing value of attribute from "current" field to
the "old" field
• Keep the new value of the attribute in the "current" field
• Also, you may add a "current" effective date field for the
attribute
• The key of the row is not affected
• No new dimension row is needed
• The existing queries will seamlessly switch to the "current"
value
• Any queries that need to use the "old" value must be
revised accordingly
• The technique works best for one "soft" change at a time
• If there is a succession of changes, more sophisticated
techniques must be devised
Type 3 Changes: Tentative Soft Revisions
Large Dimensions

very deep - very large number of rows.


very wide - large number of attributes.
Eg. The customer and product dimensions
Customer Product
Huge—20 million rows 100,000 product variations
Up to 150 dimension attributes 100 dimension attributes
Can have multiple hierarchies Can have multiple hierarchies

Data warehouse functions could be slow and inefficient.


Inefficiencies in fact table queries when large dimensions
need to be used
Additional rows created to handle Type 2 slowly changing
dimensions
Multiple Hierarchies
Rapidly Changing Dimensions
Junk Dimensions
• Some fields like miscellaneous flags and textual
fields from source data structures may not be
included as significant fields in the major
dimensions, but cannot be discarded either.
• Keep only those flags and texts that are
meaningful; group all the useful flags into a
single "junk" dimension.
• "Junk" dimension attributes are useful for
constraining queries based on flag/text values.
THE SNOWFLAKE SCHEMA

500 - product brands


10 - product categories
500,000 - product dimension rows
a query constraining just on product category
THE SNOWFLAKE SCHEMA
THE SNOWFLAKE SCHEMA

1. Partially normalize only a few dimension


tables, leaving the others intact
2. Partially or fully normalize only a few
dimension tables, leaving the rest intact
3. Partially normalize every dimension table
4. Fully normalize every dimension table
THE SNOWFLAKE SCHEMA
THE SNOWFLAKE SCHEMA
Eliminating all long text fields from the dimension
tables can save storage space. For example:
category name-"men's furnishings"
Product dimension table - 500,000 rows.
snowflaking can remove 500,000 20-byte category
names.
4-byte artificial category key to the dimension
table.
The net savings = 500,000 *16 byte = 8 MB.
500,000-row product dimension table - 200 MB
Fact table - 20 GB.
The savings are just 4%.
THE SNOWFLAKE SCHEMA
Advantages
Small savings in storage space
Normalized structures are easier to update
and maintain

Disadvantages
Schema less intuitive and end-users are put
off by the complexity
Ability to browse through the contents difficult
Degraded query performance because of
additional joins
THE SNOWFLAKE SCHEMA
Aggregate Fact Tables
Query 1: Total sales for customer number
12345678 during the first week of December
2000 for product Widget-1

Query 2: Total sales for customer number


12345678 during the first three months of 2000
for product Widget-1.

Query 3: Total sales for all customers in the


South-Central territory for the first two quarters of
2000 for product category Bigtools.

No. of rows-7,90,large
Assume that there is at least one sale per product
per store per week
1. Query involves 1 product, 1 store, 1 week—only
1 fact table row
2. Query involves 1 product, all stores, 1 week—
300 fact table rows
3. Query involves 1 brand, 1 store, 1 week—500
fact table rows
4. Query involves 1 brand, all stores, 1 year—
7,800,000 fact table rows

If summarized the totals for a brand, per store, per


week-3rd query only one row
4th query -15600 rows
When you rise to higher levels in hierarchy of one dimension and keep the level
at the lowest in the other dimensions, you create one-way aggregate tables.

Product category by store by date


Product department by store by date
All products by store by date
Territory by product by date
Region by product by date
All stores by product by date
Month by store by product
Quarter by store by product
Year by store by product
When you rise to higher levels in the hierarchies of two dimensions and keep the level at the lowest
in the other dimension, you create two-way aggregate tables.

Product category by territory by date


Product category by region by date
Product category by all stores by date
Product category by month by store
Product category by quarter by store
Product category by year by store
Product department by territory by date
Product department by region by date
Product department by all stores by date
Product department by month by store
Product department by quarter by store
Product department by year by store
When you rise to higher levels in the hierarchies of all the three
dimensions, you create three-way aggregate tables.

Product category by territory by month


Product department by territory by month
All products by territory by month
Product category by region by month
Product department by region by month
All products by region by month
Product category by all stores by month
Product department by all stores by month
Product category by territory by quarter
Product department by territory by quarter
All products by territory by quarter
Product category by region by quarter
Product department by region by quarter
All products by region by quarter
Product category by all stores by quarter
Product department by all stores by quarter

You might also like