Normalization
Different anomalies in designing a database, The idea of
normalization, Functional dependency,
Armstrong’s Axioms (proofs not required), Closures and
their computation, Equivalence of
Functional Dependencies (FD), Minimal Cover (proofs not
required). First Normal Form (1NF),
Second Normal Form (2NF), Third Normal Form (3NF),
Boyce Codd Normal Form (BCNF),
Lossless join and dependency preserving decomposition,
Algorithms for checking Lossless Join
(LJ) and Dependency Preserving (DP) properties.
Different anomalies in designing a
database
If a table is not properly normalized and have
data redundancy then it will not only eat up
extra memory space but will also make it
difficult to handle and update the database,
without facing data loss.
Insertion, Updation and Deletion Anomalies
are very frequent if database is not
normalized.
Student table
rollno name branch hod office_tel
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337
In the table above, we have data of 4
Computer Sci. students.
As we can see, data for the
fields branch, hod(Head of Department)
and office_tel is repeated for the students
who are in the same branch in the college,
this is Data Redundancy.
Insertion Anomaly
Suppose for a new admission, until and unless
a student opts for a branch, data of the student
cannot be inserted, or else we will have to set
the branch information as NULL.
Also, if we have to insert data of 100 students
of same branch, then the branch information
will be repeated for all those 100 students.
These scenarios are nothing but Insertion
anomalies.
Updation Anomaly
What if Mr. X leaves the college? or is no
longer the HOD of computer science
department?
In that case all the student records will
have to be updated, and if by mistake we
miss any record, it will lead to data
inconsistency.
This is Updation anomaly.
Deletion Anomaly
In our Student table, two different
informations are kept together, Student
information and Branch information.
Hence, at the end of the academic year, if
student records are deleted, we will also
lose the branch information.
This is Deletion anomaly.
What is Normalization?
Normalization is a database design
technique that reduces data redundancy and
eliminates undesirable characteristics like
Insertion, Update and Deletion Anomalies.
Normalization rules divides larger tables into
smaller tables and links them using
relationships.
The purpose of Normalisation in SQL is to
eliminate redundant (repetitive) data and
ensure data is stored logically.
Functional dependencies in DBMS
A functional dependency is a constraint that
specifies the relationship between two sets
of attributes where one set can accurately
determine the value of other sets.
It is denoted as X → Y, where X is a set of
attributes that is capable of determining the
value of Y.
X is called Determinant,
Y is called the Dependent.
dept_buildin
roll_no name dept_name
g
42 abc CO A4
43 pqr IT A3the
44 xyz CO A4
valid functional dependencies:
roll_no→ { name, dept_name,
dept_building },→
◦ Here, roll_no can determine values of fields name,
dept_name and dept_building, hence a valid
Functional dependency
roll_no → dept_name ,
◦ Since, roll_no can determine whole set of {name,
dept_name, dept_building}, it can determine its
subset dept_name also.
dept_name → dept_building ,
◦ Dept_name can identify the dept_building
accurately, since departments with different
dept_name will also have a different dept_building
invalid functional dependencies:
name → dept_name
◦ Students with the same name can have
different dept_name, hence this is not a valid
functional dependency.
dept_building → dept_name
◦ There can be multiple departments in the
same building, For example, in the above table
departments ME and EC are in the same
building B2,
◦ hence dept_building → dept_name is an
invalid functional dependency.
Armstrong’s axioms/properties of
functional dependencies:
Reflexivity: If Y is a subset of X, then X→Y holds by
reflexivity rule
◦ For example, {roll_no, name} → name is valid.
Augmentation: If X → Y is a valid dependency, then XZ →
YZ is also valid by the augmentation rule.
◦ For example, If {roll_no, name} → dept_building is valid, hence
{roll_no, name, dept_name} → {dept_building, dept_name} is
also valid.→
Transitivity: If X → Y and Y → Z are both valid
dependencies, then X→Z is also valid by the Transitivity
rule.
◦ For example, roll_no → dept_name & dept_name →
dept_building, then roll_no → dept_building is also valid.
Types of Functional dependencies
in DBMS:
Fully functional dependency
Partial Functional Dependency
Trivial functional dependency
Non-Trivial functional dependency
Multivalued functional dependency
Transitive functional dependency
Trivial functional dependency
A → B has trivial functional dependency
if B is a subset of A.
◦ {Emp_id, Emp_name} -> Emp_id is a trivial
functional dependency as Emp_id is a subset
of {Emp_id,Emp_name}.
Non Trivial Functional
Dependency
when A->B holds true where B is not a
subset of A. In a relationship, if attribute
B is not a subset of attribute A, then it is
conside
◦ (Company} -> {CEO} (if we know the
Company, we knows the CEO name)
◦ But CEO is not a subset of Company, and
hence it's non-trivial functional dependency.
Multivalued Dependency
Multivalued dependency occurs in the
situation where there are multiple
independent multivalued attributes in a
single table.
A multivalued dependency is a complete
constraint between two sets of attributes in a
relation.
It requires that certain tuples be present in a
relation.
Car(carmodel,mafyear,colour)
◦ maf_year and color are independent of each
other but dependent on car_model.
◦ In this example, these two columns are said to
be multivalue dependent on car_model.
◦ This dependence can be represented like this:
◦ car_model -> maf_year
◦ car_model-> colour
Transitive Dependency in DBMS
Company} -> {CEO} (if we know the
compay, we know its CEO's name)
{CEO } -> {Age} If we know the CEO,
we know the Age
Therefore according to the rule of
transitive dependency:
{ Company} -> {Age} should hold, that
makes sense because if we know the
company name, we can know his age.
Closure of an Attribute
Closure of an Attribute: Closure of an
Attribute can be defined as a set of
attributes that can be functionally
determined from it.
Closure of an attribute X is X+
Find the closure of A,B,C,D given
R(A,B,C,D),FD : {A->B,B->D,C-
>B}
A+=A->B->D
◦ A+=ABD
B+=B->D
=BD
C+=C->B->D
=CBD
D+=D
Closure of attribute A-
Consider a relation A+ ={A}
R(A,B,C,D, = { A , B , C } ( Using A →
E , F , G ) with the BC )
functional = { A , B , C , D , E }
dependencies- ( Using BC → DE )
A → BC = { A , B , C , D , E , F }
BC → DE ( Using D → F )
D → F = { A , B , C , D , E , F , G }
CF → G ( Using CF → G )
Thus,
A+ = { A , B , C , D , E , F ,
G}
A → BC D+ ={D}
BC → DE = { D , F } ( Using D
D → F →F)
CF → G
A → BC { B , C }+= { B , C }
BC → DE = { B , C , D , E } ( Using
D → F BC → DE )
CF → G = { B , C , D , E , F }
( Using D → F )
= { B , C , D , E , F , G }
( Using CF → G )
Thus,
{ B , C }+ = { B , C , D , E ,
F,G}
Given relational schema R( P Q R S T U V) having
following attribute P Q R S T U and V, also there is a set of
functional dependency denoted by FD = { P->Q, QR->ST,
PTV->V }. Determine Closure of (QR)+ and (PR)+
QR+ = QR FD QR→ST
=QRST
PR + = PR → P → Q
=PRQ →ST
=PRQST
Given relational schema R( P Q R S T) having following
attributes P Q R S and T, also there is a set of functional
dependency denoted by FD = { P->QR, RS->T, Q->S, T->
P }. Determine Closure of ( T )+
T+=T → P → QR → S
=TPQRS
Different kinds of keys
candidate key
A candidate key may be defined as-
◦ A set of minimal attribute(s) that can identify
each tuple uniquely in the given relation is
called as a candidate key.
OR
◦ A minimal super key is called as a candidate
key.
Consider the following Student schema-
Student ( roll , name , sex , age , address , class , section )
Given below are the examples of candidate keys-
( class , section , roll )
( name , address )
These are candidate keys because each set consists of
minimal attributes required to identify each student
uniquely in the Student table.
Let R = (A, B, C, D, Determine all
E, F) be a relation essential attributes of
scheme with the the given relation.
following Essential attributes of
dependencies- the relation are- C and
C → F E.
E → A So, attributes C and E
EC → D will definitely be a
part of every
A → B
candidate key.
So, we have-
{ CE }+
C →F
= { C , E }
E → A
= { C , E , F } ( Using C → F )
EC → D = { A , C , E , F } ( Using E →
A → B A)
= { A , C , D , E , F } ( Using EC
→D)
= { A , B , C , D , E , F } ( Using
A→B)
We conclude that CE can
determine all the attributes of the
given relation.
So, CE is the only possible
candidate key of the relation.
Let R = (A, B, C,
D, E) be a
relation scheme
with the
following
dependencies-
AB → C
C → D
B → E
Determine the
total number of
candidate keys
Closures of a set of functional dependencies
A Closure is a set of FDs is a set of all
possible FDs that can be derived from a
given set of FDs.
It is also referred as a Complete set of
FDs.
If F is used to donate the set of FDs for
relation R, then a closure of a set of FDs
implied by F is denoted by F+.
Find F+
Closure Algorithm
Closure=X
Repeat
{
Foreach FD u v in F
Such that u C closure
Then set closure = closure U v
}
Until there is no change to closure
Ssn+
a) ssn+
result=ssn
repeat
{
pno →(pname,ploc) (u v)
Pno C ssn
}
result=ssn
repeat
{
ssn →ename (u v)
ssn C ssn
Then result ssn U ename
Result=(ssn,ename)
}
Normalization Rule
Normalization rules are divided into the
following normal forms:
First Normal Form
Second Normal Form
Third Normal Form
BCNF
Fourth Normal Form
Database Normalization is a technique of organizing
the data in the database.
Normalization is a systematic approach of
decomposing tables to eliminate data
redundancy(repetition) and undesirable
characteristics like Insertion, Update and Deletion
Anomalies.
It is a multi-step process that puts data into tabular
form, removing duplicated data from the relation
tables.
Normalization is used for mainly two purposes,
◦ Eliminating redundant(useless) data.
◦ Ensuring data dependencies make sense i.e data is logically
stored.
First Normal Form (1NF)
For a table to be in the First Normal Form,
it should follow the following 4 rules:
It should only have single(atomic) valued
attributes/columns.
Values stored in a column should be of the same
domain
All the columns in a table should have unique
names.
And the order in which data is stored, does not
matter.
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++
Our table already satisfies 3 rules out of
the 4 rules, as all our column names are
unique, we have stored data in the order
we wanted to and we have not inter-mixed
different type of data in columns.
But out of the 3 different students in our
table, 2 have opted for more than 1
subject.
And we have stored the subject names in
a single column. But as per the 1st
Normal form each column must contain
atomic value.
How to solve this Problem?
roll_no name subject
101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++
By doing so, although a few values are
getting repeated but values for
the subject column are now atomic for
each record/row.
Using the First Normal Form, data
redundancy increases, as there will be
many columns with same data in multiple
rows but each row as a whole will be
unique.
Second Normal Form
For a table to be in the Second Normal
Form, it must satisfy two conditions:
◦ The table should be in the First Normal Form.
◦ There should be no Partial Dependency.
What is Dependency?
Let's take an example of a Student table
with columns student_id, name, reg_no
branch and address .
In this table, student_id is the primary key
and will be unique for every row, hence we
can use student_id to fetch any row of data
from this table
Even for a case, where student names are
same, if we know the student_id we can
easily fetch the correct record.
student_id name reg_no branch address
10 Akon 07-WY CSE Kerala
11 Akon 08-WY IT Gujarat
Hence we can say a Primary Key for a
table is the column or a group of
columns(composite key) which can
unique
if I ask for name of student
with student_id 10 or 11, I will get it.
So all I need is student_id and every other
column depends on it, or can be fetched
using it.
This is Dependency and we also call
it Functional Dependency.
Partial Dependency
For a simple table like Student, a single
column like student_id can uniquely identfy
all the records in a table.
But this is not true all the time.
Let's create another table for Subject,
which will
have subject_id and subject_name fields
and subject_id will be the primary key.
Subject
subject_id(Primary Key) subject_name
1 Java
2 C++
3 Php
Now we have a Student table with
student information and another
table Subject for storing subject
information.
Let's create another table Score, to store
the marks obtained by students in the
respective subjects.
We will also be saving name of the
teacher who teaches that subject along
with marks.
Score
score_id student_id subject_id marks teacher
1 10 1 70 Java Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teacher
In the score table we are saving the student_id to know
which student's marks are these and subject_id to know for
which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key
for this table, which can be the Primary key
Now if you look at the Score table, we have a column
names teacher which is only dependent on the subject, for
Java it's Java Teacher and for C++ it's C++ Teacher & so
on.
primary key for this table is a composition of two columns
which is student_id & subject_id but the teacher's name
only depends on subject, hence the subject_id, has nothing
to do with student_id.
This is Partial Dependency, where an attribute in a table
depends on only a part of the primary key and not on the
How to remove Partial Dependency
There can be many different solutions for
this, but out objective is to remove teacher's
name from Score table.
The simplest solution is to remove
columns teacher from Score table and add it
to the Subject table. Hence, the Subject table
will become:
The simplest solution is to remove
columns teacher from Score table and add it to the
Subject table. Hence, the Subject table will
become:
id subject_name teacher
1 Java Java Teacher
2 C++ C++ Teacher
3 Php Php Teacher
And our Score table is now in the second normal
form, with no partial dependency
score_id student_id subject_id marks
1 10 1 70
2 10 2 75
3 11 1 80
Third Normal Form (3NF)
A relation will be in 3NF
◦ if it is in 2NF and not contain any transitive
partial dependency.
3NF is used to reduce the data
duplication.
It is also used to achieve the data
integrity.
If there is no transitive dependency for
non-prime attributes, then the relation
must be in third normal form.
Employee Table
EMP_ID EMP_NAM EMP_ZIP EMP_STAT EMP_CITY
E E
222 Harry 201010 UP Noida
333 Stephan 02228 US Boston
444 Lan 60007 US Chicago
Super key in the table above:
◦ {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, E
M
Non-prime attributes: In the given table, all
attributes except EMP_ID are non-prime.
◦ Here, EMP_STATE & EMP_CITY dependent on
EMP_ZIP and
◦ EMP_ZIP dependent on EMP_ID.
◦ The non-prime attributes (EMP_STATE,
EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal
form.(ab,bc ac)
◦ So need to move the EMP_CITY and EMP_STATE
to the new <EMPLOYEE_ZIP> table, with EMP_ZIP
as a Primary key.
Employee Table
EMP_ID EMP_NAME EMP_ZIP
222 Harry 201010
333 Stephan 02228
444 Lan 60007
EMPLOYEE_ZIP table
EMP_ZIP EMP_STATE EMP_CITY
201010 UP Noida
02228 US Boston
60007 US Chicago
Boyce Codd normal form (BCNF)
BCNF is the advance version of 3NF. It is
stricter than 3NF.
◦ A table is in BCNF if every functional
dependency X → Y, X is the super key of the
table.
◦ For BCNF, the table should be in 3NF, and for
every FD, LHS is super key.
EMPLOYEE table
EMP_ID EMP_COU EMP_DEPT DEPT_TYP EMP_DEPT
NTRY E _NO
264 India Designing D394 283
264 India Testing D394 300
364 UK Stores D283 232
In the above table Functional
dependencies are as follows:
◦ EMP_ID → EMP_COUNTRY
◦ EMP_DEPT → {DEPT_TYPE, EMP_DEP
T_NO}
Candidate key: {EMP-ID, EMP-DEPT}
The table is not in BCNF because neither
EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we
decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
Designing D394 283
Testing D394 300
Stores D283 232
Developing D283 549
EMP_DEPT_MAPPING table
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Candidate keys:
◦ For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Functional dependencies:
◦ EMP_ID → EMP_COUNTRY
◦ EMP_DEPT → {DEPT_TYPE, EMP_DEPT_
NO}
Now, this is in BCNF because left side part
of both the functional dependencies is a
key.
Fourth Normal Form (4NF)
For a table to satisfy the Fourth Normal
Form, it should satisfy the following two
conditions:
◦ It should be in the Boyce-Codd Normal
Form.
◦ And, the table should not have any Multi-
valued Dependency.
STUDENT
STU_ID COURSE HOBBY
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
The given STUDENT table is in 3NF, but the
COURSE and HOBBY are two independent
entity. Hence, there is no relationship between
COURSE and HOBBY.
In the STUDENT relation, a student with
STU_ID, 21 contains two
courses, Computer and Math and two
hobbies, Dancing and Singing. So there is a
Multi-valued dependency on STU_ID, which leads
to unnecessary repetition of data.
So to make the above table into 4NF, we can
decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
Relational Decomposition
When a relation in the relational model is not in
appropriate normal form then the decomposition of
a relation is required.
In a database, it breaks the table into multiple
tables.
If the relation has no proper decomposition, then it
may lead to problems like loss of information.
Decomposition is used to eliminate some of the
problems of bad design like anomalies,
inconsistencies, and redundancy.
Types of Decomposition
Lossless Decomposition
If the information is not lost from the
relation that is decomposed, then the
decomposition will be lossless.
The lossless decomposition guarantees that
the join of relations will result in the same
relation as it was decomposed.
The relation is said to be lossless
decomposition if natural joins of all the
decomposition give the original relation.
Lossless Join Decomposition
If we decompose a relation R into
relations R1 and R2,
Decomposition is lossy if R1 ⋈ R2 ⊃ R
Decomposition is lossless if R1 ⋈ R2 =
R
EMPLOYEE_DEPARTMENT
table:
EMP_ID EMP_NA EMP_AG EMP_CI DEPT_ID DEPT_N
ME E TY AME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
The above relation is decomposed
into two relations EMPLOYEE and
DEPARTMENT
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai
33 Alina 25 Delhi
46 Stephan 30 Bangalore
52 Katherine 36 Mumbai
60 Jack 40 Noida
DEPARTMENT table
DEPT_ID EMP_ID DEPT_NAME
827 22 Sales
438 33 Marketing
869 46 Finance
575 52 Production
678 60 Testing
Employee ⋈ Department
EMP_ID EMP_NA EMP_AG EMP_CI DEPT_ID DEPT_N
ME E TY AME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
To check for lossless join decomposition using
FD set, following conditions must hold:
Union of Attributes of R1 and R2 must be equal to
attribute of R. Each attribute of R must be either in
R1 or in R2.
◦ Att(R1) U Att(R2) = Att(R)
Intersection of Attributes of R1 and R2 must not
be NULL.
◦ Att(R1) ∩ Att(R2) ≠ Φ
Common attribute must be a key for at least one
relation (R1 or R2)
◦ Att(R1) ∩ Att(R2) -> Att(R1) or Att(R1) ∩ Att(R2) ->
Att(R2)
A relation R (A, B, C, D) with FD set{A-
>BC}.Perform decomposition and check whether it
is lossy or lossless ?
R is decomposed into R1(ABC) and R2(AD)
First condition holds true as Att(R1) U Att(R2)
= (ABC) U (AD) = (ABCD) = Att(R).
Second condition holds true as Att(R1) ∩
Att(R2) = (ABC) ∩ (AD) ≠ Φ
Third condition holds true as Att(R1) ∩ Att(R2)
= A is a key of R1(ABC) because A->BC is
given.
Dependency Preserving
It is an important constraint of the database.
In the dependency preservation, at least one decomposed
table must satisfy every dependency.
If a relation R is decomposed into relation R1 and R2, then
the dependencies of R either must be a part of R1 or R2 or
must be derivable from the combination of functional
dependencies of R1 and R2.
For example, suppose there is a relation R (A, B, C, D) with
functional dependency set (A->BC).
The relational R is decomposed into R1(ABC) and R2(AD)
which is dependency preserving because FD A->BC is a
part of relation R1(ABC).