Unit I : Database Design
Normalization
It is a way of organizing data in a database.
Normalization involves organizing the columns and tables in the database to ensure that their
dependencies are correctly implemented using database constraints.
Normalization is the process of organizing data properly. It is used to minimize the
duplication of various relationships in the database.
It is also used to troubleshoot exceptions such as inserts, deletes, and updates in the table. It
helps to split a large table into several small normalized tables.
Relational links and links are used to reduce redundancy. Normalization, also known as
database normalization or data normalization, is an important part of relational database
design because it helps to improve the speed, accuracy, and efficiency of the database.
SQL is the language used to interact with the database.
Normalization in SQL improves data distribution.
To initiate interaction, the data in the database must be normalized. Otherwise, we cannot
continue because it will cause an exception.
Normalization can also make it easier to design the database to have the best structure for
atomic elements (that is, elements that cannot be broken down into smaller parts).
Usually, we break large tables into small tables to improve efficiency.
Edgar F. Codd defined the first paradigm in 1970, and finally other paradigms.
When normalizing a database, organize data into tables and columns. Make sure that
each table contains only relevant data.
If the data is not directly related, create a new table for that data. Normalization is
necessary to ensure that the table only contains data directly related to the primary key,
each data field contains only one data element, and to remove redundant (duplicated
and unnecessary) data.
Normalization is a process of organizing the data in database to avoid data
redundancy, insertion anomaly, update anomaly & deletion anomaly
Inference Rules
Rules in databases are also known as Armstrong’s Axioms in Functional
Dependency. These rules govern the functional dependencies in a relational database.
From inference rules a new functional dependency can be derived using other FDs.
These rules were introduced by William W. Armstrong
Prerequisites
• Attributes: When we talk about databases, we think of them as organized collections of
information. Imagine that you have a table called “Student.” Now, this table has
columns, which we also call “Attributes.”
• These columns define specific details about the students. For example:
o Student_name: This column stores the names of the students.
o Roll_no: Here, we keep track of their roll numbers.
o Marks: And finally, we record their exam scores.
• Functional Dependencies (FDs) are like the building blocks of a database. Imagine you
have a bunch of attributes (think of them as characteristics) in a table. These attributes
can be related to each other in interesting ways or say logically. For example, Roll_no
→ Marks means that from Roll_no we can get the Marks of the student, which shows
that they are Roll_no is logically related to Marks.
Inference Rules
There are 6 inference rules, which are defined below:
Reflexive Rule: According to this rule, if B is a subset of A then A logically determines B.
Formally, B ⊆ A then A → B.
o Example: Let us take an example of the Address (A) of a house, which contains so many
parameters like House no, Street no, City etc. These all are the subsets of A. Thus, address
(A) → House no. (B).
Augmentation Rule: It is also known as Partial dependency. According to this rule, If A logically
determines B, then adding any extra attribute doesn't change the basic
o Example: A → B, then adding any extra attribute let say C will give AC → BC and doesn't
make any change.
If X is customer ID and Y is customer name and Z is birth date, then :
X->Y means that if you are given a customer ID, it is possible to determine a customer name
from that.
XZ->YZ means that if you are given both a customer ID and a birth date, then it is possible to
derive customer name and birth date from that.
• Transitive rule: Transitive rule states that if A determines B and B determines C, then it can be
said that A indirectly determines B.
o Example: If A → B and B → C then A → C.
•The primary key is roll_no but we can
identify the city using zip-code where city
and zip-code both are the primary key
•So here roll_no → city and city→zip-
code eventually resulting into roll_no →zip-
code. so we can find a non-primary attribute
using another non-primary attribute.
•For example, roll-no = 1 has city=pune and
city=pune will have zip-code=411044.So
wherever city is pune , zip-code will be
411044
• Union Rule: Union rule states that If A determines B and C, then A determines BC.
o Example: If A → B and A → C then A → BC.
X → Y and X → Z, then X → YZ.
• Explanation: If X determines Y and Z, it also determines their union.
• Example: A → B and A → C imply A → BC.
If two tables are separate, and the PK is the same, you may want to consider putting them
together. It states that if X determines Y and X determines Z then X must also determine Y and Z
For example, if:
• SIN —> EmpName
• SIN —> SpouseName
You may want to join these two tables into one as follows:
SIN –> EmpName, SpouseName
Some database administrators (DBA) might choose to keep these tables separated for a couple of
reasons. One, each table describes a different entity so the entities should be kept apart. Two, if
SpouseName is to be left NULL most of the time, there is no need to include it in the same table
as EmpName.
• Decomposition Rule: It is perfectly reverse of the above Union rule. According
to this rule, If A determined BC then it can be decomposed as A → B and A →
C.
o Example: If A → BC then A → B and A → C.
o If you have a table that appears to contain two entities that are determined by
the same PK, consider breaking them up into two tables. This rule states that if
X determines Y and Z, then X determines Y and X determines Z separately
Normalization
Here are the most commonly used normal forms:
• First normal form(1NF)
• Second normal form(2NF)
• Third normal form(3NF)
First normal form (1NF)
As per the rule of first normal form, an attribute (column) of a table cannot hold multiple
values. It should hold only atomic values.
Example: Suppose a company wants to store the names and contact details of its
employees. It creates a table that looks like this:
emp_id emp_name emp_address emp_mobile
101 Herschel New Delhi 8912312390
8812121212
102 Jon Kanpur
9900012222
103 Ron Chennai 7778881212
9990000123
104 Lester Bangalore
8123450987
This table is not in 1NF as the rule says “each attribute of a table must have
atomic (single) values”, the emp_mobile values for employees Jon & Lester
violates that rule.
emp_id emp_name emp_address emp_mobile
101 Herschel New Delhi 8912312390
102 Jon Kanpur 8812121212
102 Jon Kanpur 9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore 9990000123
104 Lester Bangalore 8123450987
Second normal form (2NF)
A table is said to be in 2NF if both the following conditions hold:
• Table is in 1NF (First normal form)
• No non-prime attribute is dependent on the proper subset of any candidate
key of table.
An attribute that is not part of any candidate key is known as non-prime
attribute.
Example: Suppose a school wants to store the data of teachers and the
subjects they teach. They create a table that looks like this: Since a teacher
can teach more than one subjects, the table can have multiple rows for a same
teacher.
teacher_id subject teacher_age
111 Maths 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
Candidate Keys: {teacher_id, subject}
Non prime attribute: teacher_age
The table is in 1 NF because each attribute has atomic values.
However, it is not in 2NF because non prime attribute teacher_age is
dependent on teacher_id alone which is a proper subset of candidate
key. This violates the rule for 2NF as the rule says “no non-prime
attribute is dependent on the proper subset of any candidate key of the
table”.
To make the table complies with 2NF we can break it in two tables like
this:
teacher_details table:
teacher_id teacher_age
111 38
222 38
333 40
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
Now the tables comply with Second normal form (2NF).
3NF
A table design is said to be in 3NF if both the following conditions hold:
• Table must be in 2NF
• Transitive functional dependency of non-prime attribute on any super key should be
removed.
An attribute that is not part of any candidate key is known as non-prime attribute.
In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each
functional dependency X-> Y at least one of the following conditions hold:
• X is a super key of table
• Y is a prime attribute of table
An attribute that is a part of one of the candidate keys is known as prime attribute
Example: Suppose a company wants to store the complete address of
each employee, they create a table named employee_details that looks
like this:
emp_id emp_name emp_zip emp_state emp_city emp_district
1001 John 282005 UP Agra Dayal Bagh
1002 Ajeet 222008 TN Chennai M-City
1006 Lora 282007 TN Chennai Urrapakkam
1101 Lilly 292008 UK Pauri Bhagwan
1201 Steve 222999 MP Gwalior Ratan
Super keys: {emp_id}, {emp_id, emp_name}, {emp_id, emp_name,
emp_zip}…so on
CandidateKeys: {emp_id}
Non-prime attributes: all attributes except emp_id are non-prime as they
are not part of any candidate keys.
Here, emp_state, emp_city & emp_district dependent on emp_zip. And,
emp_zip is dependent on emp_id that makes non-prime attributes (emp_state,
emp_city & emp_district) transitively dependent on super key (emp_id). This
violates the rule of 3NF.
To make this table complies with 3NF we have to break the table into two
tables to remove the transitive dependency
emp_id emp_name emp_zip
1001 John 282005
1002 Ajeet 222008
1006 Lora 282007
1101 Lilly 292008
1201 Steve 222999
emp_zip emp_state emp_city emp_district
282005 UP Agra Dayal Bagh
222008 TN Chennai M-City
282007 TN Chennai Urrapakkam
292008 UK Pauri Bhagwan
222999 MP Gwalior Ratan
Surrogate key
A column that is not generated from the data in the database is known as a
surrogate key. Rather, the DBMS generates a unique identifier for you. In
database tables, surrogate keys are frequently utilized as primary keys.
It is the sequential number outside of the database that is made available to
the user and the application or it acts as an object that is present in the
database but is not visible to the user or application
In case we do not have a natural primary key in a table, then we need to
artificially create one in order to uniquely identify a row in the table, this
key is called the surrogate key or synthetic primary key of the table
registration_no name percentage
210101 Harry 90
210102 Maxwell 65
210103 Lee 87
210104 Chris 76
registration_no name percentage
CS107 Taylor 49
CS108 Simon 86
CS109 Sam 96
CS110 Andy 58
If we want to merge the details of both the schools in a single table.
Resulting table is given here. The values does not match with all the records of
the table though it is holding all unique values of the table . In this case, we have
to artificially create one primary key for this table. We can do this by adding a
column surr_no in the table that contains anonymous integers and has no direct
relation with other columns. surr_no registration_no name percentage
1 210101 Harry 90
2 210102 Maxwell 65
3 210103 Lee 87
4 210104 Chris 76
5 CS107 Taylor 49
6 CS108 Simon 86
7 CS109 Sam 96
8 CS110 Andy 58
Boyce-Codd Normal Form (BCNF)
When a table has more than one candidate key, anomalies may result
even though the relation is in 3NF. Boyce-Codd normal form is a
special case of 3NF. A relation is in BCNF if, and only if, every
determinant is a candidate key.
BCNF Example 1
Consider the following table (St_Maj_Adv). Student_id Major Advisor
111 Physics Smith
111 Music Chan
320 Math Dobbs
671 Physics White
803 Physics Smith
It is stricter than 3NF. A table is in BCNF if every functional dependency
X → Y, X is the super key of the table. o For BCNF, the table should be in
3NF, and for every FD, LHS is super key. Example: Let's assume there is a
company where employees work in more than one department.
In the above table Functional dependencies are as follows:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT} The table is not in BCNF
because neither EMP_DEPT nor EMP_ID alone are keys. To convert
the given table into BCNF, we decompose it into three tables:
Functional dependencies:
1.EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys: For the first table: EMP_ID For the second
table: EMP_DEPT For the third table: {EMP_ID,
EMP_DEPT} Now, this is in BCNF because left side part of
both the functional dependencies is a key.
Let us see another one example: Below we have a college enrolment
table with columns student_id, subject and professor.
In the table above: One student can enrol for multiple subjects. For example,
student with student_id 101, has opted for subjects - Java & C++
For each subject, a professor is assigned to the student.
And, there can be multiple professors teaching one subject like we have for Java.
In the table above student_id, subject together form the primary key, because using
student_id and subject, we can find all the columns of the table.
One more important point to note here is, one professor teaches only one subject,
but one subject may have two different professors.
Hence, there is a dependency between subject and professor here, where subject
depends on the professor name.
This table satisfies the 1st Normal form because all the values are atomic, column
names are unique and all the values stored in a particular column are of same
domain. This table also satisfies the 2nd Normal Form as their is no Partial
Dependency. And, there is no Transitive Dependency, hence the table also satisfies
the 3rd Normal Form.
Why this table is not in BCNF? In the table above, student_id, subject
form primary key, which means subject column is a prime attribute. But,
there is one more dependency, professor → subject. And while subject is a
prime attribute, professor is a non-prime attribute, which is not allowed by
BCNF. How to satisfy BCNF? To make this relation(table) satisfy BCNF,
we will decompose this table into two tables, student table and professor
table. Below we have the structure for both the tables.
4th & 5th Normal Form
Two of the highest levels of database normalization are the fourth normal form (4NF) and the
fifth normal form (5NF). Multivalued dependencies are handled by 4NF, whereas join
dependencies are handled by 5NF.
If two or more independent relations are kept in a single relation or we can say multivalue
dependency occurs when the presence of one or more rows in a table implies the presence of one
or more other rows in that same table. Put another way, two attributes (or columns) in a table are
independent of one another, but both depend on a third attribute. A multivalued
dependency always requires at least three attributes because it consists of at least two attributes
that are dependent on a third.
For a dependency A -> B, if for a single value of A, multiple values of B exist, then the table may
have a multi-valued dependency. The table should have at least 3 attributes and B and C should
be independent for A ->> B multivalued dependency.
Person->->mobile
person->->food_likes
It is a “person multi determines mobile” and “person multi determines food_likes.”
is a level of database normalization where there are no non-trivial multivalued dependencies other than a
candidate key. It builds on the first three normal forms (1NF, 2NF, and 3NF) and the Boyce-Codd
Normal Form (BCNF). It states that, in addition to a database meeting the requirements of BCNF, it must
not contain more than one multivalued dependency.
A relation R is in 4NF if and only if the following conditions are satisfied
1. It should be in Boyce –Codd Normal Form(BCNF)
2. The table should not have Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of the Fourth Normal
Form (4NF) because it creates unnecessary redundancies and can contribute to inconsistent data. To
bring this up to 4NF, it is necessary to break this information into two tables.
Example: Consider the database table of a class that has two relations R1 contains student ID(SID)
and student name (SNAME) and R2 contains course id(CID) and course name (CNAME)
When their cross-product is done it resulted in multivalued dependencies.
Multivalued dependencies (MVD) are
SID->->CID; SID->->CNAME; SNAME->->CNAME
Join decomposition is a further generalization of Multivalued dependencies. If the join of
R1 and R2 over C is equal to relation R then we can say that a join dependency (JD) exists,
where R1 and R2 are the decomposition R1(A, B, C) and R2(C, D) of a given relations R
(A, B, C, D).
Alternatively, R1 and R2 are a lossless decomposition of R. A JD ⋈ {R1, R2, …, Rn} is
said to hold over a relation R if R1, R2, ….., Rn is a lossless-join decomposition.
The *(A, B, C, D), (C, D) will be a JD of R if the join of joins attribute is equal to the
relation R. Here, *(R1, R2, R3) is used to indicate that relation R1, R2, R3 and so on are a
JD of R. Let R is a relation schema R1, R2, R3……..Rn be the decomposition of R. r( R ) is
said to satisfy join dependency if and only if
company->->product Agent->->company
Agent->->product Agent->->product
Fifth Normal Form/Projected Normal Form (5NF)
A relation R is in Fifth Normal Form if and only if everyone joins dependency
in R is implied by the candidate keys of R. A relation decomposed into two
relations must have lossless join Property, which ensures that no spurious or
extra tuples are generated when relations are reunited through a natural join.
A relation R is in 5NF if and only if it satisfies the following conditions:
R should be in 4NF
It cannot be further non loss decomposed(join dependency)
Consider the above schema, with a case as “if a company makes a product and
an agent is an agent for that company, then he always sells that product for the
company”. Under these circumstances, the ACP table is shown as
The relation ACP is again decomposed into 3 relations. Now, the natural Join
of all three relations will be shown as:
The result of the Natural Join of R1 and R3 over ‘Company’ and then
the Natural Join of R13 and R2 over ‘Agent’ and ‘Product’ will be Table ACP.
Hence, in this example, all the redundancies are eliminated, and the
decomposition of ACP is a lossless join decomposition. Therefore, the relation
is in 5NF as it does not violate the property of lossless join.