0% found this document useful (0 votes)
17 views

Data Modeling

The document discusses different types of data models including conceptual, logical, and physical models. It describes how conceptual models define business objects and relationships, logical models add more detail, and physical models represent the actual database structure. The document also covers different types of dimensions, facts, and keys used in data modeling.

Uploaded by

Ashok kari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Data Modeling

The document discusses different types of data models including conceptual, logical, and physical models. It describes how conceptual models define business objects and relationships, logical models add more detail, and physical models represent the actual database structure. The document also covers different types of dimensions, facts, and keys used in data modeling.

Uploaded by

Ashok kari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Modeling

Data Model:

Presents the relationship between objects / tables / queries to provide valid data.

Advantages:
a) Report accuracy
b) Multiple data sources data {files, tables, cloud sources etc…} you can use in A single view.
Means an excel data can relate to a flat file, a flat can relate to a table with different
cardinalities to use all in a single visual or report level.

Types of models [Business End-End Models]

A general understanding to the three data models is that business analyst uses a conceptual
and logical model to model the business objects exist in the system, while database designer or
database engineer elaborates the conceptual and logical ER model to produce the physical
model that presents the physical database structure ready for database creation. The table
below shows the difference between the three data models.
Conceptual data model

Conceptual ERD models the business objects that should exist in a system and the relationships
between them. A conceptual model is developed to present an overall picture of the system by
recognizing the business objects involved. It defines what entities exist, NOT which tables. For
example, 'many to many' tables may exist in a logical or physical data model but they are just
shown as a relationship with no cardinality under the conceptual data model.
NOTE: Conceptual ERD supports the use of generalization in modeling the 'a kind of'
relationship between two entities, for instance, Triangle, is a kind of Shape. The usage is like
generalization in UML. Notice that only conceptual ERD supports generalization.

Logical data model

Logical ERD is a detailed version of a Conceptual ERD. A logical ER model is developed to enrich
a conceptual model by defining explicitly the columns in each entity and introducing
operational and transactional entities. Although a logical data model is still independent of the
actual database system in which the database will be created, you can still take that into
consideration if it affects the design.

Physical data model

Physical ERD represents the actual design blueprint of a relational database. A physical data
model elaborates on the logical data model by assigning each column with type, length, null
etc. Since a physical ERD represents how data should be structured and related in a specific
DBMS it is important to consider the convention and restriction of the actual database system
in which the database will be created. Make sure the column types are supported by the DBMS
and reserved words are not used in naming entities and columns.
NOTE: MANY –MANY PHYSICALLY MENTIONED BETWEEN ORDER AND PRODUCT WITH ONE
BRIDGE OR INTERMEDIATE TABLE AND TWO 1-MANY RELATIONSHIPS

Two Different Type of Physical Models


OLTP Terminology

Entity: A row / table


Tuple: Collections of columns / row
Attribute: Column
Primary Key: Which take only unique values and not NULL
Foreign Key: One place it is unique and other place is it normal, then it is foreign key
Dimension: Textual attribute. [Does not support aggregations {sum, avg, min, etc…}]
Fact: Measurable attribute. [Support aggregations]
SalesID[Numeric]: 1000,2000, 3000 --Never used for aggregations, so dimension
Salesincome[Numeric]: 1000, 2000, 3000 –Support aggregations, so measure / fact

Dimension Table & Features


DEFINITION: Dimension is a textual attribute and set of textual attributes available table is
dimension.
Attributes not directly related
Frequently you will find that some of the attributes in a dimension table are not directly related
to the other attributes in the table.

For example, package size is not directly related to product brand; nevertheless, package size
and product brand could both be attributes of the product dimension table.

Not normalized
The attributes in a dimension table are used over and over again in queries. An attribute is
taken as a constraint in a query and applied directly to the metrics in the fact table.

For efficient query performance, it is best if the query picks up an attribute from the dimension
table and goes directly to the fact table and not through other intermediary tables. If you
normalize the dimension table, you will be creating such intermediary tables and that will not
be efficient. Therefore, a dimension table is flattened out, not normalized.

Drilling down, rolling up


The attributes in a dimension table provide the ability to get to the details from higher levels of
aggregation to lower levels of details. For example, the three attributes zip, city, and state form
a hierarchy. You may get the total sales by state, then drill down to total sales by city, and then
by zip. Going the other way, you may first get the totals by zip, and then roll up to totals by city
and state.

Multiple hierarchies
In the example of the customer dimension, there is a single hierarchy going up from individual
customer to zip, city, and state. But dimension tables often provide for multiple hierarchies, so
that drilling down may be performed along any of the multiple hierarchies.

Fewer number of records


A dimension table typically has few number of records or rows than the fact table. A product
dimension table for an automaker may have just 500 rows. On the other hand, the fact table
may contain millions of rows.

Facts or Measures or Measure Group or Fact Table

Fact is measurable attribute. Set of measurable attributes table is Fact table /Measure Group
Table. It contains two sections a) Foreign key section b) Measures section
Concatenated Key( Combination of foreign keys)
A row in the fact table relates to a combination of rows from all the dimension tables. The
primary key of the fact table must be the concatenation of the primary keys of all the
dimension tables.
Data Grain This is an important characteristic of the fact table. As we know, the data grain is
the level of detail for the measurements or metrics. If we keep the quantity ordered as the
quantity of a specific product for each month, then the data grain is different and is at a higher
level.
Fully Additive Measures Let us look at the attributes order dollars, extended cost, and quantity
ordered. Each of these relates to a particular product on a certain date for a specific customer
procured by an individual sales representative. When we run queries to aggregate measures in
the fact table, we will have to make sure that these measures are fully additive. Otherwise, the
aggregated numbers may not show the correct totals.
Semi additive Measures Consider the margin dollars attribute in the fact table. For example, if
the order dollars is 120 and extended cost is 100, the margin percentage is 20. This is a
calculated metric derived from the order dollars and extended cost. Derived attributes such as
margin percentage are not additive. They are known as semi additive measures. Distinguish
semi additive measures from fully additive measures when you perform aggregations in
queries.
Table Deep, Not Wide Typically a fact table contains fewer attributes than a dimension table. If
you lay the fact table out as a two-dimensional table, you will note that the fact table is narrow
with a small number of columns, but very deep with a large number of rows.
Sparse Data It is important to realize this type of sparse data and understand that the fact table
could have gaps.

Type of Dimensions
Various types of dimensions available.

Conformed dimension Sharable dimension, where the dimension is used by the multiple fact
tables. Ex: Time, Location etc… tables can be shared, so we call them as confirmed dimensions.
Degenerated dimension This is neither dimension nor fact, but available in fact table. Ex:
InvoiceNo, TransactionID, SalesOrderNO etc…
Role playing dimension If a dimension has multiple foreign keys in the fact table, then it is role
playing dimension. Ex: If DimDate table has Enquiry_Date, Join_Date, CourseStart_Date keys in
the fact table, then the Date table is Role playing dimension table.

Dirty dimension If a dimension table has different non key values for the same business key
and difficult to identify business operation, then it is dirty dimension.
Ex:
ID, NAME, LOC
1 RAVI HYD
1 RAVI MUM
1 RAVI BGLR
Junk Dimension If a dimension has non business data and that is used to store status / indicator
/ flags, or any other kind of information, then it is Junk dimension.
Ex: Country and Currency code tableJunk table
Gender information table—>Junk table
Status or flag tableOn/ Off, Current / Expired etc…

Rapidly Changing Dimensions


A dimension attribute that changes frequently is a rapidly changing attribute. If you don’t need
to track the changes, the rapidly changing attribute is no problem, but if you do need to track
the changes, using a standard slowly changing dimension technique can result in a huge
inflation of the size of the dimension. One solution is to move the attribute to its own
dimension, with a separate foreign key in the fact table. This new dimension is called a rapidly
changing dimension.

Slowly Changing Dimensions


Attribute of a dimension that would undergo changes over time. It depends on the business
requirement whether particular attribute history of changes should be preserved in the data
warehouse. This is called a slowly changing attribute and a dimension containing such an
attribute is called a slowly changing dimension.

Inferred Dimensions While loading fact records, a dimension record may not yet be ready. One
solution is to generate a surrogate key with null for all the other attributes. This should
technically be called an inferred member, but is often called an inferred dimension.

Shrunken Dimensions A shrunken dimension is a subset of another dimension. For example,


the orders fact table may include a foreign key for product, but the target fact table may
include a foreign key only for product category, which is in the product table, but much less
granular. Creating a smaller dimension table, with product category as its primary key, is one
way of dealing with this situation of heterogeneous grain. If the product dimension is
snowflake, there is probably already a separate table for product category, which can serve as
the shrunken dimension.

Static Dimensions
Static dimensions are not extracted from the original data source, but are created within the
context of the data warehouse. A static dimension can be loaded manually — for example with
status codes — or it can be generated by a procedure, such as a date or time dimension.
Types of Facts:
Fully Additive Facts-
– Can be summed across any and all dimensions
– Stored in fact table
Examples: revenue, quantity

Semi Additive Facts-


– Can be summed across most dimensions but not all
Examples: Inventory quantities, account balances, or personnel counts

Anything that measures a “level” must be careful with ad-hoc reporting often aggregated
across the “forbidden dimension” by averaging
Non Additive Facts-
– Cannot be summed across any dimension
– All ratios are non-additive
– Breakdown to fully additive components, store them in fact table

Additive
Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Eg: Sales fact

Semi-Additive
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact
table, but not the others. Eg: Daily balances fact can be summed up through the customers
dimension but not through the time dimension.

Non-Additive
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the
fact table. Eg: Facts which have percentages, Ratios calculated.

Types of Fact Table


 Snapshot
 Cumulative
 Factless Fact Table
Snapshot
This type of fact table describes the state of things in a particular instance of time, and usually
includes more semi-additive and non-additive facts. The second example presented here is a
snapshot fact table.
Eg: Daily balances fact can be summed up through the customers dimension but not through
the time dimension.

Cumulative
This type of fact table describes what has happened over a period of time. For example, this
fact table may describe the total sales by product by store by day. The facts for this type of fact
tables are mostly additive facts. The first example presented here is a cumulative fact table.
Eg: Sales fact

Factless Fact Table


In the real world, it is possible to have a fact table that contains no measures or facts. These
tables are called “Factless Fact tables”.
Eg: A fact table which has only product key and date key is a factless fact. There are no
measures in this table. But still you can get the number products sold over a period of time.

FACTLESS FACT TABLE


A table without any value added measures (additive measure) is called factless fact table.
Features: Used for covering an event / recording an event.
SURROGATE KEY & REAL-TIME USAGES
1. A surrogate key is a system generated (could be GUID, sequence, etc.) value with no business
meaning that is used to uniquely identify a record in a table.
2. The key itself could be made up of one or multiple columns.
3. Usually created at dimension table and placed in Fact table to recognize dimension data
quickly
4. Using regular sequence generation mechanisms and custom algorithms we create them.
5. Surrogate key values are non-changeable. The following diagram shows an example of a
table with a surrogate key (AddressID column) along with some sample data. Notice the key
itself has no business meaning, it's just a sequential integer.

Natural Key Overview


A natural key is a column or set of columns that already exist in the table (e.g. they are
attributes of the entity within the data model) and uniquely identify a record in the table. Since
these columns are attributes of the entity they obviously have business meaning.

The following is an example of a table with a natural key (SSN column) along with some sample
data. Notice that the key for the data in this table has business meaning.
Natural Key Pros
 Key values have business meaning and can be used as a search key when querying the
table
 Column(s) and primary key index already exist so no disk extra space is required for the
extra column/index that would be used by a surrogate key column
 Fewer table joins since join columns have meaning. For example, this can reduce disk
IO by not having to perform extra reads on a lookup table
Natural Key Cons
 May need to change/rework key if business requirements change. For example, if you
used SSN for your employee as in the example above and your company expands
outside of the United States not all employees would have a SSN so you would have to
come up with a new key.
 More difficult to maintain if key requires multiple columns. It's much easier from the
application side dealing with a key column that is constructed with just a single column.
 Poorer performance since key value is usually larger and/or is made up of multiple
columns. Larger keys will require more IO both when inserting/updating data as well as
when you query. Can't enter record until key value is known. It's sometimes beneficial
for an application to load a placeholder record in one table then load other tables and
then come back and update the main table.
 Can sometimes be difficult to pick a good key. There might be multiple candidate keys
each with their own trade-offs when it comes to design and/or performance.
Surrogate Key Pros
 No business logic in key so no changes based on business requirements. For example, if
the Employee table above used a integer surrogate key you could simply add a separate
column for SIN if you added an office in Canada (to be used in place of SSN)
 Less code if maintaining same key strategy across all entities. For example, application
code can be reused when referencing primary keys if they are all implemented as a
sequential integer
 Better performance since key value is smaller. Less disk IO is required on when
accessing single column indexes.
 Surrogate key is guaranteed to be unique.

For example, when moving data between test systems you don't have to worry about duplicate
keys since new key will be generated as data is inserted.

If a sequence used then there is little index maintenance required since the value is ever
increasing which leads to less index fragmentation.
Surrogate Key Cons
 Extra column(s)/index for surrogate key will require extra disk space
 Extra column(s)/index for surrogate key will require extra IO when insert/update data
 Requires more table joins to child tables since data has no meaning on its own.
 Can have duplicate values of natural key in table if there is no other unique constraint
defined on the natural key
 Difficult to differentiate between test and production data. For example, since
surrogate key values are just auto-generated values with no business meaning it's hard
to tell if someone took production data and loaded it into a test environment.
 Key value has no relation to data so technically design breaks 3NF
 The surrogate key value can't be used as a search key
 Different implementations are required based on database platform.

For example, SQL Server identity columns are implemented a little bit different than they are in
Postgres or DB2.

RELATIONSHIP CONCEPTS IN DBMS

Entity Relationship Diagram – ER Diagram in DBMS

An Entity–relationship model (ER model) describes the structure of a database with the help of
a diagram, which is known as Entity Relationship Diagram (ER Diagram). An ER model is a
design or blueprint of a database that can later be implemented as a database. The main
components of E-R model are: entity set and relationship set.
What is an Entity Relationship Diagram (ER Diagram)?
An ER diagram shows the relationship among entity sets. An entity set is a group of similar
entities and these entities can have attributes. In terms of DBMS, an entity is a table or
attribute of a table in database, so by showing relationship among tables and their attributes,
ER diagram shows the complete logical structure of a database. Lets have a look at a simple ER
diagram to understand this concept.

A simple ER Diagram:
In the following diagram we have two entities Student and College and their relationship. The
relationship between Student and College is many to one as a college can have many students
however a student cannot study in multiple colleges at the same time. Student entity has
attributes such as Stu_Id, Stu_Name & Stu_Addr and College entity has attributes such as
Col_ID & Col_Name.

Here are the geometric shapes and their meaning in an E-R Diagram. We will discuss these
terms in detail in the next section(Components of a ER Diagram) of this guide so don’t worry
too much about these terms now, just go through them once.

Rectangle: Represents Entity sets.


Ellipses: Attributes
Diamonds: Relationship Set
Lines: They link attributes to Entity Sets and Entity sets to Relationship Set
Double Ellipses: Multivalued Attributes
Dashed Ellipses: Derived Attributes
Double Rectangles: Weak Entity Sets
Double Lines: Total participation of an entity in a relationship set

Components of a ER Diagram

As shown in the above diagram, an ER diagram has three main components:


1. Entity
2. Attribute
3. Relationship

1. Entity
An entity is an object or component of data. An entity is represented as rectangle in an ER
diagram.
For example: In the following ER diagram we have two entities Student and College and these
two entities have many to one relationship as many students study in a single college. We will
read more about relationships later, for now focus on entities.
Weak Entity:
An entity that cannot be uniquely identified by its own attributes and relies on the relationship
with other entity is called weak entity. The weak entity is represented by a double rectangle.
For example – a bank account cannot be uniquely identified without knowing the bank to which
the account belongs, so bank account is a weak entity.

2. Attribute

An attribute describes the property of an entity. An attribute is represented as Oval in an ER


diagram. There are four types of attributes:

1. Key attribute
2. Composite attribute
3. Multivalued attribute
4. Derived attribute

1. Key attribute:

A key attribute can uniquely identify an entity from an entity set. For example, student roll
number can uniquely identify a student from a set of students. Key attribute is represented by
oval same as other attributes however the text of key attribute is underlined.
2. Composite attribute:

An attribute that is a combination of other attributes is known as composite attribute. For


example, In student entity, the student address is a composite attribute as an address is
composed of other attributes such as pin code, state, country.
3. Multivalued attribute:

An attribute that can hold multiple values is known as multivalued attribute. It is represented
with double ovals in an ER Diagram. For example – A person can have more than one phone
numbers so the phone number attribute is multivalued.

4. Derived attribute:

A derived attribute is one whose value is dynamic and derived from another attribute. It is
represented by dashed oval in an ER Diagram. For example – Person age is a derived attribute
as it changes over time and can be derived from another attribute (Date of birth).

E-R diagram with multivalued and derived attributes:

3. Relationship

A relationship is represented by diamond shape in ER diagram, it shows the relationship among


entities. There are four types of relationships/Cardinality

Cardinality
Cardinality defines the possible number of occurrences in one entity which is associated with
the number of occurrences in another. For example, One student can join multiple courses [I:
Many]. Multiple cardinalities available (1:1, 1: Many, Many:Many).

1. One to One
2. One to Many
3. Many to One
4. Many to Many
1. One to One Relationship

When a single instance of an entity is associated with a single instance of another entity then it
is called one to one relationship. For example, a person has only one passport and a passport is
given to one person.

Note:
1. One to one cardinality, single and bidirectional indicate same.
2. No Bridge table is required

2. One to Many Relationship

When a single instance of an entity is associated with more than one instances of another
entity then it is called one to many relationship. For example – a customer can place many
orders but a order cannot be placed by many customers.

Note: 1. One to Many cardinality, single and bidirectional are different.

2. No Bridge table is required…………………

3. Many to One Relationship

When more than one instances of an entity is associated with a single instance of another
entity then it is called many to one relationship. For example – many students can study in a
single college but a student cannot study in many colleges at the same time.
4. Many to Many Relationship

When more than one instances of an entity is associated with more than one instances of
another entity then it is called many to many relationship. For example, a can be assigned to
many projects and a project can be assigned to many students.

Note:

1. Many to Many cardinality, single and bidirectional are different.

2. Bridge table is required

3. Note that a many-to-many relationship is split into a pair of one-to-many


relationships.

Total Participation of an Entity set

A Total participation of an entity set represents that each entity in entity set must have at least
one relationship in a relationship set. For example: In the below diagram each college must
have at-least one associated Student.
How Bidirectional is helpful in real-time

1-Many Single Direction DimCourse to Fact having one to many relationship, If I would like to
know how many unique courses used in a year in the Fact table, you will get all courses count
as a result however you slice (year or location or any…)
Bidirectional relationship
Here Dimcourse to Fact relationship both the directions, so if you slice on any column (Ex: year),
then respective courses used in the fact table you will find.

You might also like