Dimensional Modeling Tutorial
Dimensional Modeling Tutorial
Dimensional modeling is the design concept used by many data warehouse designers to
build their data warehouse. Dimensional model is the underlying data model used by many
of the commercial OLAP products available today in the market. In this model, all data is
contained in two types of tables called Fact Table and Dimension Table.
Fact Table
Fact table contains the measurements or metrics or facts of business processes. If your
business process is Sales, then a measurement of this business process such as "monthly
sales number" is captured in the fact table. In addition to the measurements, the only other
things a fact table contains are foreign keys for the dimension tables.
Dimension Table
Context of the measurements are represented in dimension tables. You can also think of the
context of a measurement as the characteristics such as who, what, where, when, how of a
measurement (subject ). In your business process Sales, the characteristics of the 'monthly
sales number' measurement can be a Location (Where), Time (When), Product Sold (What).
The Dimension Attributes are the various columns in a dimension table. In the Location
dimension, the attributes can be Location Code, State, Country, Zip code. Generally the
Dimension Attributes are used in report labels, and query constraints such as where
Country='USA'. The dimension attributes also contain one or more hierarchical
relationships.
Before designing your data warehouse, you need to decide what this data warehouse
contains. Say if you want to build a data warehouse containing monthly sales numbers
across multiple store locations, across time and across products then your dimensions are:
Location
Time
Product
Each dimension table contains data for one dimension. In the above example you get all
your store location information and put that into one single table called Location. Your
store location data may be spanned across multiple tables in your OLTP system (unlike
OLAP), but you need to de-normalize all that data into one single table.
Online Transaction Processing (OLTP) Schema
Unlike OLAP, normalization is very important to reduce duplicates and also cut down on
the size of the data. our OLTP schema may look like this:
Locations Table
Field Name Type
Loc_Id INTEGER (4)
Loc_Code VARCHAR (5)
Loc_Name VARCHAR (30)
State_Id INTEGER (4)
Country_Id INTEGER (4)
States Table
Field Name Type
Sate_Id INTEGER (4)
State_Name VARCHAR (50)
Countries Table
Field Name Type
Country_Id INTEGER (4)
Country_Name VARCHAR (50)
In order to query for all locations that are in country 'USA' we will have to join these three
tables. The SQL will look like:
Dimensional modeling allows only one table per dimension. But your OLTP data spans
across multiple tables as described. So we need de-normalize the OLTP schema and export
into your Dimension Tables.
For example, for the location dimension, you achieve this by joining the three OLTP tables
and inserting the data into the single Location table.
All Dimension tables contain a key column called the dimension key. In this example
Dim_Id is our dimension Id. This is the unique key into our Location dimension table.
The actual data in your Location Table may look like this
You may notice that some of the information is repeated in the above dimension table. The
State Name and Country Name are repeated through out the table. You may feel that this is
waste of data space and against the normalization principles. But in dimensional modeling
this type of design makes the querying very optimized and reduces the query times. Also
we will learn later that in a typical data warehouse, the dimension tables make up only 10 to
15 % of the storage as the fact table is by far the largest table and takes up the rest of the
storage allocation.
Time Dimension Table
The actual data in your Time Table may will look like this:
In this table PR_Dim_Id is our dimension Id. This is the unique key into our Product
dimension table.
The actual data in your Product Table may look like this:
Product Dimension Table Data
Dim_Id SKU Name Category
1001 DOVE6K Dove Soap 6Pk Sanitary
1002 MLK66F# Skim Milk 1 Gal Dairy
1003 SMKSAL55 Smoked Salmon 6oz Meat
Fact Table
Fact table contains the actual business process measurements or metrics called facts.
Usually these facts are numeric.
In the above example where you are building a data warehouse of monthly sales in dollars,
your fact table will contain the actual sales numbers, one row per month. In addition to the
data itself, you will have the foreign keys for the various dimensions in this row.
The level of detail of the fact table is known as the grain of the fact table. In this example
the grain of your fact table is monthly sales number per location per product.
In this table the combination of all three dimension table foreign keys make up the primary
key in the fact table. (TM_Dim_Id, PR_ Dim_Id, LOC_ Dim_Id) is our primary key. This
is the unique key into our Sales fact table.
The actual data in your MonthlySales Table will look like this:
MonthlySales Fact Table Data
TM_Dim_Id PR_ Dim_Id LOC_ Sales
Dim_Id
1001 1001 1003 435677
1002 1002 1001 451121
1003 1001 1003 98765
1001 1004 1001 65432
A fact table may contain one or more Facts. Usually you create one fact table per business
process or event. For example if you want to analyze the sales numbers and also advertising
spending, they are two separate business processes. So you will create two separate fact
tables, one for sales data and one for advertising cost data. On the other hand if you want to
track the sales tax in addition to the sales number, you simply create one more fact column
in the Sales fact table called Tax.
Star Schema
Now if you look at our new dimensional modeled schema, it will look like this:
You can easily tell this looks like a STAR. Hence it is also known as Star Schema.
Star Schema is very easy to understand, even for non technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily
A Typical SQL Query Template for the Sales Schema will look like:
--- Select the measurements that you want to aggregate using SUM clause
-- finally the 'group by' clause identifies the aggregation level. In this example you are
aggregating
-- all sales within a product category.
GROUP BY P.Category
Now if you look at a report generated from your dimensional modeled data warehouse, it
will look like this:
You can tell from the above report that all your dimensions are the row and column
headings in the report. The facts are the numeric numbers in the report.
Star Schema is very easy to understand, even for non technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily
Snowflake Schema
If we did not de-normalize our dimensions into one table each, then the schema would have
looked like this
You can see, this looks like a snow flake, hence this type of schema is called SNOW
FLAKE schema. General rule of thumb is keep away from snow flake schemas as even
though they may save you some space, they will cost a lot in terms of query times.