Unit 4
Unit 4
1. Dimensional Modeling:
Definition: A technique used in data warehouse design to model data in a way that makes it
easy to understand and retrieve, primarily for reporting and analysis purposes.
Key Concepts:
o Fact Tables: Central tables that store quantitative data (metrics or measurements).
Dimensional Data Modeling is one of the data modeling techniques used in data warehouse design.
The concept of Dimensional Modeling was developed by Ralph Kimball which is comprised of facts
and dimension tables. Since the main goal of this modeling is to improve the data retrieval so it is
optimized for SELECT OPERATION. The advantage of using this model is that we can store data in
such a way that it is easier to store and retrieve the data once stored in a data warehouse. The
dimensional model is the data model used by many OLAP systems.
Facts
Facts are the measurable data elements that represent the business metrics of interest. For example,
in a sales data warehouse, the facts might include sales revenue, units sold, and profit margins. Each
fact is associated with one or more dimensions, creating a relationship between the fact and the
descriptive data.
Dimension
Dimensions are the descriptive data elements that are used to categorize or classify the data. For
example, in a sales data warehouse, the dimensions might include product, customer, time, and
location. Each dimension is made up of a set of attributes that describe the dimension. For example,
the product dimension might include attributes such as product name, product category, and
product price.
Attributes
Characteristics of dimension in data modeling are known as characteristics. These are used to filter,
search facts, etc. For a dimension of location, attributes can be State, Country, Zipcode, etc.
Fact Table
In a dimensional data model, the fact table is the central table that contains the measures or metrics
of interest, surrounded by the dimension tables that describe the attributes of the measures. The
dimension tables are related to the fact table through foreign key relationships
Dimension Table
Dimensions of a fact are mentioned by the dimension table and they are basically joined by a foreign
key. Dimension tables are simply de-normalized tables. The dimensions can be having one or more
relationships.
Conformed Dimension
Outrigger Dimension
Shrunken Dimension
Role-Playing Dimension
Junk Dimension
Degenerate Dimension
Swappable Dimension
Step Dimension
Step-1: Identifying the business objective: The first step is to identify the business objective. Sales,
HR, Marketing, etc. are some examples of the need of the organization. Since it is the most important
step of Data Modelling the selection of business objectives also depends on the quality of data
available for that process.
Step-2: Identifying Granularity: Granularity is the lowest level of information stored in the table. The
level of detail for business problems and its solution is described by Grain.
Step-3: Identifying Dimensions and their Attributes: Dimensions are objects or things. Dimensions
categorize and describe data warehouse facts and measures in a way that supports meaningful
answers to business questions. A data warehouse organizes descriptive attributes as columns in
dimension tables. For Example, the data dimension may contain data like a year, month, and
weekday.
Step-4: Identifying the Fact: The measurable data is held by the fact table. Most of the fact table
rows are numerical values like price or cost per unit, etc.
Step-5: Building of Schema: We implement the Dimension Model in this step. A schema is a
database structure. There are two popular schemes: Star Schema and Snowflake Schema.
Dimensional data modeling is a technique used in data warehousing to organize and structure data in
a way that makes it easy to analyze and understand. In a dimensional data model, data is organized
into dimensions and facts.
Overall, dimensional data modeling is an effective technique for organizing and structuring data in a
data warehouse for analysis and reporting. By providing a simple and intuitive structure for the data,
the dimensional model makes it easy for users to access and understand the data they need to make
informed business decisions
Simplified Data Access: Dimensional data modeling enables users to easily access data
through simple queries, reducing the time and effort required to retrieve and analyze data.
Enhanced Query Performance: The simple structure of dimensional data modeling allows for
faster query performance, particularly when compared to relational data models.
Increased Flexibility: Dimensional data modeling allows for more flexible data analysis, as
users can quickly and easily explore relationships between data.
Improved Data Quality: Dimensional data modeling can improve data quality by reducing
redundancy and inconsistencies in the data.
Easy to Understand: Dimensional data modeling uses simple, intuitive structures that are
easy to understand, even for non-technical users.
Limited Complexity: Dimensional data modeling may not be suitable for very complex data
relationships, as it relies on simple structures to organize data.
Limited Integration: Dimensional data modeling may not integrate well with other data
models, particularly those that rely on normalization techniques.
Limited Scalability: Dimensional data modeling may not be as scalable as other data
modeling techniques, particularly for very large datasets.
Limited History Tracking: Dimensional data modeling may not be able to track changes to
historical data, as it typically focuses on current data.
Definition: Organizes data into dimensions and facts to enable complex queries and analysis.
Data is viewed in multiple dimensions (like time, geography, product) rather than in flat,
relational structures.
Data Cube: A multi-dimensional array of values, typically used in OLAP (Online Analytical
Processing). A data cube allows data to be modeled and viewed in three or more dimensions,
making it easy to pivot and drill into data.
The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow
customers to access data in the form of queries. They allow users to rapidly receive answers
to the requests which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data
from many dimensions and perspectives. It is defined by dimensions and facts and is
represented by a fact table. Facts are numerical measures and fact tables contain measures
of the related dimensional tables or names of the facts.
On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional
Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect
the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying
the related qualities. These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In
the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality
from the factors which are collected by it. These actually play a significant role in the
arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the
basis of different factors such as geographical location of firm’s workplace, products of the
firm, advertisements done, time utilized to flourish a product, etc.
Example 1
2. Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below :
2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the time
dimension, which is organized into quarters and the dimension of items, which is sorted
according to the kind of item which is sold. The facts here are represented in rupees (in
thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is
represented in the diagram given below. Here the data of the sales is represented as a
two dimensional table. Let us consider the data according to item, time and location (like
Kolkata, Delhi, Mumbai). Here is the table :
3D data representation as 2D
This data can be represented in the form of three dimensions conceptually, which is shown
in the image below :
3D data representation
Measures: Measures are numerical data that can be analyzed and compared, such as sales or
revenue. They are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location, or
product. They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between measures
and dimensions in a data model. They provide a fast and efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels of detail.
This is a key feature of multidimensional data models, as it enables users to quickly analyze data at
different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of data to a
lower level of detail, while roll-up is the opposite process of moving from a lower-level detail to a
higher-level summary. These features enable users to explore data in greater detail and gain insights
into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example, a time
dimension might be organized into years, quarters, months, and days. Hierarchies provide a way to
navigate the data and perform drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that supports
fast and efficient querying of large datasets. OLAP systems are designed to handle complex queries
and provide fast response times.
It is easy to maintain.
Its performance is better than that of normal databases (e.g. relational databases).
The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.
The compatibility in this type of database is an upliftment for projects having lower
bandwidth for maintenance staff.
During the work of a Multi-Dimensional Data Model, when the system caches, there is a
great effect on the working of the system.
It is complicated in nature due to which the databases are generally dynamic in design.
The path to achieving the end product is complicated most of the time.
As the Multi Dimensional Data Model has complicated systems, databases have a large
number of databases due to which the system is very insecure when there is a security
break.
There are three types of facts in Multi-dimensional data modeling, they are:
Additive facts: These facts can be summoned up on any dimension in a database. Example use cases
are total profit, revenue, income, or quantity.
Usecase-1:
o If a person gets a profit of 100 units by selling product A and a profit of 500 units by
selling product B
o The total profit of the person = profit gained by selling product A + profit gained by
selling product B
o 600 units
Usecase-2:
o If a person buys 250 units of product A and buys 300 units of product B
o 550 units
Semi-Additive facts: These facts can be summoned up on some dimensions and can not be
summoned up on other dimensions in a database. Example use cases are inventory levels and bank
account balances.
1. Usecase-1:
If a person has a balance of 500 units in account A, deposits 1000 units of money in
account A, and deposits 400 units of money in account A
1900 units
2. Usecase-2:
If a person has a balance of 500 units in account A, deposit 1000 units of money in
account A and deposit 400 units of money in account B
1500 units
o but the result by summing up is 500 units(initial balance) + 1000 units + 400
units = 1900 units
o The above use cases come under the category of Semi-Additive facts as in
some scenarios summing up them, doesn’t give accurate results.
Non-Additive facts: These are the facts that any dimension in a database cannot summon. Example
use cases are profit margin or average temperatures.
1. Usecase-1:
If a company has a profit margin on day-1 is 20% and day-2 is 80%. the current profit
margin is 80%
but, the profit margin by summing up the day-1 and day-2 will be 20% + 80% = 100%
The above use case comes under the category of non-Additive facts as they don’t
give accurate results in any dimension.
3. Star Schema:
Definition: The simplest form of dimensional modeling. It consists of a central fact table
surrounded by dimension tables.
Structure:
o Simple design.
4. Snowflake Schema:
Definition: A more normalized version of the star schema where dimension tables are
further split into related sub-dimensions.
Structure: Similar to the star schema, but dimension tables are normalized (split into
multiple related tables).
Pros:
o Reduces redundancy.
Cons:
Star Schema: Easier to design and faster to query, but can lead to data redundancy.
Snowflake Schema: More storage efficient due to normalized dimensions, but queries are
slower because of the need for more joins.
Use Case:
o Star schema is often used when performance is critical (e.g., in OLAP systems).
o Snowflake schema is used when minimizing storage and maintaining data integrity
are more important.
Definition: A more complex schema design that includes multiple fact tables sharing
dimension tables. It's also called a "galaxy schema."
Use Case: When multiple related fact tables are needed, or when there are different levels of
granularity in the data warehouse.
7. Schema Definition:
Schema: The structure that defines the organization of data in a database or data
warehouse.
Types of Schemas:
o Logical Schema: Describes the design of the database at a high level, detailing
entities and relationships without considering technical details.
o Physical Schema: Describes how the data is stored physically, including storage
format, indexes, and partitions.
Staging Area: A temporary location where raw data is cleaned and transformed before being
loaded into the data warehouse.
Metadata: Data about the data, such as descriptions of tables, columns, and their
relationships.
OLAP: Systems that support analytical queries and reporting, often involving multi-
dimensional data models (e.g., data cubes).
Intra-query Parallelism: Breaks a single query into smaller parts that can be processed
concurrently by multiple processors.
o Pipeline Parallelism: Breaks query execution into stages, which are executed in
parallel.
ETL Tools:
o Informatica: Popular ETL tool used to extract data from multiple sources, transform
it into a standard format, and load it into the data warehouse.
OLAP Tools:
o Cognos: Business intelligence tool used for OLAP reporting and dashboarding.
o RapidMiner: Open-source tool for data science, offering data mining and machine
learning features.