Data Modeling
Data Modeling
Data Model:
Presents the relationship between objects / tables / queries to provide valid data.
Advantages:
a) Report accuracy
b) Multiple data sources data {files, tables, cloud sources etc…} you can use in A single view.
Means an excel data can relate to a flat file, a flat can relate to a table with different
cardinalities to use all in a single visual or report level.
A general understanding to the three data models is that business analyst uses a conceptual
and logical model to model the business objects exist in the system, while database designer or
database engineer elaborates the conceptual and logical ER model to produce the physical
model that presents the physical database structure ready for database creation. The table
below shows the difference between the three data models.
Conceptual data model
Conceptual ERD models the business objects that should exist in a system and the relationships
between them. A conceptual model is developed to present an overall picture of the system by
recognizing the business objects involved. It defines what entities exist, NOT which tables. For
example, 'many to many' tables may exist in a logical or physical data model but they are just
shown as a relationship with no cardinality under the conceptual data model.
NOTE: Conceptual ERD supports the use of generalization in modeling the 'a kind of'
relationship between two entities, for instance, Triangle, is a kind of Shape. The usage is like
generalization in UML. Notice that only conceptual ERD supports generalization.
Logical ERD is a detailed version of a Conceptual ERD. A logical ER model is developed to enrich
a conceptual model by defining explicitly the columns in each entity and introducing
operational and transactional entities. Although a logical data model is still independent of the
actual database system in which the database will be created, you can still take that into
consideration if it affects the design.
Physical ERD represents the actual design blueprint of a relational database. A physical data
model elaborates on the logical data model by assigning each column with type, length, null
etc. Since a physical ERD represents how data should be structured and related in a specific
DBMS it is important to consider the convention and restriction of the actual database system
in which the database will be created. Make sure the column types are supported by the DBMS
and reserved words are not used in naming entities and columns.
NOTE: MANY –MANY PHYSICALLY MENTIONED BETWEEN ORDER AND PRODUCT WITH ONE
BRIDGE OR INTERMEDIATE TABLE AND TWO 1-MANY RELATIONSHIPS
For example, package size is not directly related to product brand; nevertheless, package size
and product brand could both be attributes of the product dimension table.
Not normalized
The attributes in a dimension table are used over and over again in queries. An attribute is
taken as a constraint in a query and applied directly to the metrics in the fact table.
For efficient query performance, it is best if the query picks up an attribute from the dimension
table and goes directly to the fact table and not through other intermediary tables. If you
normalize the dimension table, you will be creating such intermediary tables and that will not
be efficient. Therefore, a dimension table is flattened out, not normalized.
Multiple hierarchies
In the example of the customer dimension, there is a single hierarchy going up from individual
customer to zip, city, and state. But dimension tables often provide for multiple hierarchies, so
that drilling down may be performed along any of the multiple hierarchies.
Fact is measurable attribute. Set of measurable attributes table is Fact table /Measure Group
Table. It contains two sections a) Foreign key section b) Measures section
Concatenated Key( Combination of foreign keys)
A row in the fact table relates to a combination of rows from all the dimension tables. The
primary key of the fact table must be the concatenation of the primary keys of all the
dimension tables.
Data Grain This is an important characteristic of the fact table. As we know, the data grain is
the level of detail for the measurements or metrics. If we keep the quantity ordered as the
quantity of a specific product for each month, then the data grain is different and is at a higher
level.
Fully Additive Measures Let us look at the attributes order dollars, extended cost, and quantity
ordered. Each of these relates to a particular product on a certain date for a specific customer
procured by an individual sales representative. When we run queries to aggregate measures in
the fact table, we will have to make sure that these measures are fully additive. Otherwise, the
aggregated numbers may not show the correct totals.
Semi additive Measures Consider the margin dollars attribute in the fact table. For example, if
the order dollars is 120 and extended cost is 100, the margin percentage is 20. This is a
calculated metric derived from the order dollars and extended cost. Derived attributes such as
margin percentage are not additive. They are known as semi additive measures. Distinguish
semi additive measures from fully additive measures when you perform aggregations in
queries.
Table Deep, Not Wide Typically a fact table contains fewer attributes than a dimension table. If
you lay the fact table out as a two-dimensional table, you will note that the fact table is narrow
with a small number of columns, but very deep with a large number of rows.
Sparse Data It is important to realize this type of sparse data and understand that the fact table
could have gaps.
Type of Dimensions
Various types of dimensions available.
Conformed dimension Sharable dimension, where the dimension is used by the multiple fact
tables. Ex: Time, Location etc… tables can be shared, so we call them as confirmed dimensions.
Degenerated dimension This is neither dimension nor fact, but available in fact table. Ex:
InvoiceNo, TransactionID, SalesOrderNO etc…
Role playing dimension If a dimension has multiple foreign keys in the fact table, then it is role
playing dimension. Ex: If DimDate table has Enquiry_Date, Join_Date, CourseStart_Date keys in
the fact table, then the Date table is Role playing dimension table.
Dirty dimension If a dimension table has different non key values for the same business key
and difficult to identify business operation, then it is dirty dimension.
Ex:
ID, NAME, LOC
1 RAVI HYD
1 RAVI MUM
1 RAVI BGLR
Junk Dimension If a dimension has non business data and that is used to store status / indicator
/ flags, or any other kind of information, then it is Junk dimension.
Ex: Country and Currency code tableJunk table
Gender information table—>Junk table
Status or flag tableOn/ Off, Current / Expired etc…
Inferred Dimensions While loading fact records, a dimension record may not yet be ready. One
solution is to generate a surrogate key with null for all the other attributes. This should
technically be called an inferred member, but is often called an inferred dimension.
Static Dimensions
Static dimensions are not extracted from the original data source, but are created within the
context of the data warehouse. A static dimension can be loaded manually — for example with
status codes — or it can be generated by a procedure, such as a date or time dimension.
Types of Facts:
Fully Additive Facts-
– Can be summed across any and all dimensions
– Stored in fact table
Examples: revenue, quantity
Anything that measures a “level” must be careful with ad-hoc reporting often aggregated
across the “forbidden dimension” by averaging
Non Additive Facts-
– Cannot be summed across any dimension
– All ratios are non-additive
– Breakdown to fully additive components, store them in fact table
Additive
Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Eg: Sales fact
Semi-Additive
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact
table, but not the others. Eg: Daily balances fact can be summed up through the customers
dimension but not through the time dimension.
Non-Additive
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the
fact table. Eg: Facts which have percentages, Ratios calculated.
Cumulative
This type of fact table describes what has happened over a period of time. For example, this
fact table may describe the total sales by product by store by day. The facts for this type of fact
tables are mostly additive facts. The first example presented here is a cumulative fact table.
Eg: Sales fact
The following is an example of a table with a natural key (SSN column) along with some sample
data. Notice that the key for the data in this table has business meaning.
Natural Key Pros
Key values have business meaning and can be used as a search key when querying the
table
Column(s) and primary key index already exist so no disk extra space is required for the
extra column/index that would be used by a surrogate key column
Fewer table joins since join columns have meaning. For example, this can reduce disk
IO by not having to perform extra reads on a lookup table
Natural Key Cons
May need to change/rework key if business requirements change. For example, if you
used SSN for your employee as in the example above and your company expands
outside of the United States not all employees would have a SSN so you would have to
come up with a new key.
More difficult to maintain if key requires multiple columns. It's much easier from the
application side dealing with a key column that is constructed with just a single column.
Poorer performance since key value is usually larger and/or is made up of multiple
columns. Larger keys will require more IO both when inserting/updating data as well as
when you query. Can't enter record until key value is known. It's sometimes beneficial
for an application to load a placeholder record in one table then load other tables and
then come back and update the main table.
Can sometimes be difficult to pick a good key. There might be multiple candidate keys
each with their own trade-offs when it comes to design and/or performance.
Surrogate Key Pros
No business logic in key so no changes based on business requirements. For example, if
the Employee table above used a integer surrogate key you could simply add a separate
column for SIN if you added an office in Canada (to be used in place of SSN)
Less code if maintaining same key strategy across all entities. For example, application
code can be reused when referencing primary keys if they are all implemented as a
sequential integer
Better performance since key value is smaller. Less disk IO is required on when
accessing single column indexes.
Surrogate key is guaranteed to be unique.
For example, when moving data between test systems you don't have to worry about duplicate
keys since new key will be generated as data is inserted.
If a sequence used then there is little index maintenance required since the value is ever
increasing which leads to less index fragmentation.
Surrogate Key Cons
Extra column(s)/index for surrogate key will require extra disk space
Extra column(s)/index for surrogate key will require extra IO when insert/update data
Requires more table joins to child tables since data has no meaning on its own.
Can have duplicate values of natural key in table if there is no other unique constraint
defined on the natural key
Difficult to differentiate between test and production data. For example, since
surrogate key values are just auto-generated values with no business meaning it's hard
to tell if someone took production data and loaded it into a test environment.
Key value has no relation to data so technically design breaks 3NF
The surrogate key value can't be used as a search key
Different implementations are required based on database platform.
For example, SQL Server identity columns are implemented a little bit different than they are in
Postgres or DB2.
An Entity–relationship model (ER model) describes the structure of a database with the help of
a diagram, which is known as Entity Relationship Diagram (ER Diagram). An ER model is a
design or blueprint of a database that can later be implemented as a database. The main
components of E-R model are: entity set and relationship set.
What is an Entity Relationship Diagram (ER Diagram)?
An ER diagram shows the relationship among entity sets. An entity set is a group of similar
entities and these entities can have attributes. In terms of DBMS, an entity is a table or
attribute of a table in database, so by showing relationship among tables and their attributes,
ER diagram shows the complete logical structure of a database. Lets have a look at a simple ER
diagram to understand this concept.
A simple ER Diagram:
In the following diagram we have two entities Student and College and their relationship. The
relationship between Student and College is many to one as a college can have many students
however a student cannot study in multiple colleges at the same time. Student entity has
attributes such as Stu_Id, Stu_Name & Stu_Addr and College entity has attributes such as
Col_ID & Col_Name.
Here are the geometric shapes and their meaning in an E-R Diagram. We will discuss these
terms in detail in the next section(Components of a ER Diagram) of this guide so don’t worry
too much about these terms now, just go through them once.
Components of a ER Diagram
1. Entity
An entity is an object or component of data. An entity is represented as rectangle in an ER
diagram.
For example: In the following ER diagram we have two entities Student and College and these
two entities have many to one relationship as many students study in a single college. We will
read more about relationships later, for now focus on entities.
Weak Entity:
An entity that cannot be uniquely identified by its own attributes and relies on the relationship
with other entity is called weak entity. The weak entity is represented by a double rectangle.
For example – a bank account cannot be uniquely identified without knowing the bank to which
the account belongs, so bank account is a weak entity.
2. Attribute
1. Key attribute
2. Composite attribute
3. Multivalued attribute
4. Derived attribute
1. Key attribute:
A key attribute can uniquely identify an entity from an entity set. For example, student roll
number can uniquely identify a student from a set of students. Key attribute is represented by
oval same as other attributes however the text of key attribute is underlined.
2. Composite attribute:
An attribute that can hold multiple values is known as multivalued attribute. It is represented
with double ovals in an ER Diagram. For example – A person can have more than one phone
numbers so the phone number attribute is multivalued.
4. Derived attribute:
A derived attribute is one whose value is dynamic and derived from another attribute. It is
represented by dashed oval in an ER Diagram. For example – Person age is a derived attribute
as it changes over time and can be derived from another attribute (Date of birth).
3. Relationship
Cardinality
Cardinality defines the possible number of occurrences in one entity which is associated with
the number of occurrences in another. For example, One student can join multiple courses [I:
Many]. Multiple cardinalities available (1:1, 1: Many, Many:Many).
1. One to One
2. One to Many
3. Many to One
4. Many to Many
1. One to One Relationship
When a single instance of an entity is associated with a single instance of another entity then it
is called one to one relationship. For example, a person has only one passport and a passport is
given to one person.
Note:
1. One to one cardinality, single and bidirectional indicate same.
2. No Bridge table is required
When a single instance of an entity is associated with more than one instances of another
entity then it is called one to many relationship. For example – a customer can place many
orders but a order cannot be placed by many customers.
When more than one instances of an entity is associated with a single instance of another
entity then it is called many to one relationship. For example – many students can study in a
single college but a student cannot study in many colleges at the same time.
4. Many to Many Relationship
When more than one instances of an entity is associated with more than one instances of
another entity then it is called many to many relationship. For example, a can be assigned to
many projects and a project can be assigned to many students.
Note:
A Total participation of an entity set represents that each entity in entity set must have at least
one relationship in a relationship set. For example: In the below diagram each college must
have at-least one associated Student.
How Bidirectional is helpful in real-time
1-Many Single Direction DimCourse to Fact having one to many relationship, If I would like to
know how many unique courses used in a year in the Fact table, you will get all courses count
as a result however you slice (year or location or any…)
Bidirectional relationship
Here Dimcourse to Fact relationship both the directions, so if you slice on any column (Ex: year),
then respective courses used in the fact table you will find.