Module 4 - Normalization
Module 4 - Normalization
Lecture by:
Prathima M. G ,B.E., M.E ,
Assisstant professor
Dept. of Computer Science & Engineering
Bangalore Institute of Technology
Bangalore – 560 004
[email protected]
Relational Database Design
Schema Refinement
RELATIONAL DATABASE DESIGN
NORMALIZATION
BOTTOM - UP APPROACH
(THROUGH SYNTHESIS)
Phases of database
design:
Step-01: Requirement Gathering
Step-02: Conceptual database design
( ER_modeling)
Step-03: Logical database design
(ER- Relational Mapping)
Step-04: Schema refinement
(Normalization)
Step-05: Implementation
(using SQL)
Overview
Informal guidelines for database design
The concept of Functional Dependencies (FDs)
Trivial and Nontrivial Dependencies
Closure of set of Functional Dependencies
Minimal Set of FD
Finding the Candidate key
Normal Forms (1NF, 2NF,3 NF, and BCNF)
Examples on Normalization
Relational Database Design
Schema refinement :
The process of evaluating relational schemas
for design quality or
Measuring the appropriateness/ Goodness of
relational schema other than the intuition of
designer
Approaches to database design
Analysis:
Top- Down approach
Identify ENTITIES and associate ATTRIBUTES.
Synthesis:
Bottom-Up approach
Consider Individual attributes of a TABLE and
associate appropriately with a TABLE.
Normalization
The formal process that can be followed to achieve
a good database design
Also used to check that an existing design is of
good quality
The different stages of normalization are known as
“normal forms”
To understand this, we need to understand the
concept of functional dependency
Informal Guidelines for Good
Database Design
Four Informal Measures
SEMANTICS of the relation attributes must be
maintained.
Disallowing
the possibility of GENERATING
SPURIOUS TUPLES.
Drawbacks of an unnormalized relation
Consider a WASE DB.
WASE needs to keep track of details regarding
its STUDENT (like USN, Name, DOB, Gen, Addr,..)
the COURSES/SUBJECTS offered like (Cno, Cname,
Sem)
and also keep track of details regarding Student
being enrolled for many Courses and a Course
having many students enrolled for it along with the
Marks_Range obtained by each student in each
course he/she is enrolled
and then award Grade based on Marks_range.
ER Diagram for WASE DB
USN
M_Range
WASE Student_Course Details
Grade
Cno Cname Sem
ER_Relational Mapping
WASE Student_Course Details Primary Key: USN,CNO
???? CS14 CN 3
Insertion Anomalies
Identifying Attribute: USN,CNO
USN Sname DOB Addr CNO Cname Sem M-range Grade
Insertion Anomalies:
Experienced when we attempt to store a value for one field but
cannot do so because the value of another field is unknown.
Eg: To add a STUDENT to the database, we MUST specify the
course to which he has enrolled.
To add a COURSE to the database, we MUST specify the
student who has enrolled for the course..
Deletion Anomalies
Experienced when the value of an attribute or field of a
relation is unexpectedly removed when value for another
an attribute/field is deleted.
Assume that a particular Student is no more. we need to
delete the student details.
E.g., If we delete a Student S6 from the Table, then the
corresponding Cno, Cname, Sem, .. values of that row is
also deleted.
This results in the loss of information. Here Course CS14
is removed from the database.
Deletion Anomalies
DELETE FROM STUDENT_COURSE WHERE USN=‘S6’;
WASE Student_Course Details Primary Key: USN,CNO
USN Sname DOB Addr CNO Cname Sem M-range Grade
P6 ERP Hubli D1
Project_Department
Pno Pname DNo Deptno Dname loc
Spurious Tuples
Report(USN,CNO,Sname,DOB,Addr,Cname,
Sem,Marks_Range, Grade)
CNOCname, Sem
USN Sname,DOB,Addr
Report(USN,CNO,Sname,DOB,Addr,Cname, Sem,Marks_Range,
Grade)
USN,CNO Marks_Range
Marks_Range Grade
AB and BC => AC
The attribute Room# is said to be transitively
dependent on the key C# since it is dependent on
LName which in turn is dependent on C#.
Inference Rules(IR) or
Armstrong's Axioms
IR-1: Reflexivity: If X ﬤY AND XX, XY
IR-2: Augmentation: If X Y, then XZ YZ.
IR-3: Transitivity: If X Y and Y Z,
then X Z.
IR-4: Decomposition: If X YZ, then X Y
and X Z.
IR-5: Union: If X Y and X Z,
then X YZ.
IR-6: Pseudo transitivity: XY, WYZ then
WXZ
IR-1: Reflexivity: If X ﬤY AND XX, XY
y
Proof: Let r is some relation state of R and
there exists 2 tuples t1 and t2
S.T. t1[x]=t2[x] then we must have
t1[y]=t2[y]
Because X ﬤY hence x->y must hold in R.
Ex:X={ssn,ename}
Y={ename}
{ssn.ename}->ename
IR-2: Augmentation: If X Y, then XZ YZ.
Ex:{ssn}->{ename}
{ssn,add}->{ename,add}
{ssn,phn,add}->{ename,phn,add}
Therefore xz->yz
IR-3: Transitivity: If X Y and Y Z,
then X Z.
(1) A BC {Given}
(2) A C {Decomposition of (1)}
(3) AD CD {Augmentation of (2) by adding D}
(4) D EF {Given}
(5) CD EF {Augmentation}
(5) AD EF {Transitivity of (3) and (4)}
(6) AD F {Decomposition of (5)}
Attribute closure, X+
To compute F + , start with FDs in F; repeatedly apply IR-1 to
IR-3 until no new FD can be derived
Armstrong's Axioms do not produce any incorrect FDs that are
added to F +. However, finding F + is too expensive; the
complexity grows exponentially
The solution is to find the attribute closure of X, denoted as X
+
Algorithm to find X+
Algorithm Attribute_Closure()
{
X + = X;
Repeat {
for each FD XY in F do X + = X + Y
for each FD Y Z in F do
if Y X + then X+ =X + Z
// i. e. if Y is in X +, the add Z to X +
until no change;
// until no more attributes are added to X +
}
}
Example: Let us consider
SSN ename pnumber pname plocation hours
F={ssn->ename,
pnumber->{pname,plocation},
{ssn,pnumber}->hours}
Closure sets w.r.t F
{ssn}+={ssn,ename}
{pnumber}+={pnumber,pname,plocation}
{ssn,pnumber}
+={ssn,ename,pnumber,pname,plocation,hours}
Example
Consider R (A, B, C) and a set of FDs
F = {AB C, C B}
Using the Algorithm, we calculate the following
closure sets with respect to F:
A+ = {A},
B+ = {B},
C+ = {C, B} because of FD-2
{AB}+ = {ABC} because of FD-1 add attribute C
{AC}+ = {ACB} because of AC AB (IR-2) add
attribute B
{BC}+ = {BC} nothing can be added
{ABC}+ = {ABC} nothing can be added
Minimal Cover (F I )
A set of FDs F is minimal and can be
represented as a set of FDs G if it satisfies the
following conditions:
a) Every FD in G has a single attribute on its right-
hand side, i.e. X A, where A is a single attribute.
b) No FD can be removed from G and still have a set of
FDs that is equivalent to F.
c) We can not replace any FD: X A in F with a
dependency Y A, where Y A and still have a
set of dependencies that is equivalent to F.
Algorithm to find the Minimal cover
Algorithm MinimalCover(F)
{
Step-1: G = F.
Step-2: Transform G into a set of FD's with right hand side
containing only one attribute (Canonical cover).
Step-3: Eliminate a redundant attribute from left-side.
For each dependency A1, A2, ..., Ak B in current set of G,
and each attribute Ai in its left-side,
if G - {A1, A2, ..., Ak B } {A1, A2, ..., Ai -1, Ai+1,..., Ak B}
is equivalent to G.
then delete Ai from the left side of A1, A2, ..., Ak B.
Step-4: Eliminate a redundant dependency
For each dependency X Y in the current set of dependencies G
if G - {X Y} is equivalent to G then delete X Y from G.
}
Given F = {B->A,D->A,AB->D} ,Find the
minimal cover of E.
STEP1:G= {B->A,D->A,AB->D}
A D B
E C G
H
Vni = {B,C,G,H}
Voi = {A, E}
Candidate keys = AE
Example - 2
Consider R (A, B, C, D, E, H), and
F = {A B, AB E, BH C, C D, D A}
A B C D
E H
Vni = {H}
Voi = {E}
Candidate keys = AH, BH, CH, and DH
BS={ABCD} So to find ck,w.r.t BS
AH+={AH}
={AHB}
={AHBE}
={AHBEC}
={AHBECD}
CK1={AH}
BH+={BH} {BHC}{BHCD}{BHCDA}
{BHCDAE}
CK2={BH}
={CHD}{CHDA}{CHDAB}
{CHADBE}=CK3={CH}
DH+={DH}
{DHA}
{DHAB}{DHABE}{DHABEC}
CK4=DH
F={A->C,C->D,D->B,E->F}
F={CH->G,A->BC,B->CFH,E->A,F->EG}
First normal form: 1NF
A relation schema is in 1NF if all of its attributes are:
single-valued
restricted to assuming atomic values,
1NF implies:
Composite attributes are represented only by their
component attributes
Attributes cannot have multiple values
FD1:AB CD
A B C D FD2: C B
A+ ={A}
AB+ ={A,B,D,C} B C
AC+ ={A,C,B,D}
Logically R1{A,C,D} is a better choice over R1{A,B,D}
as the join operation is will not generate SPURIOUS
tuples.
Example - 2
Consider R (City, Street, Zipcode) or R (C, S, Z) and F = {CS
Z, Z C}.
The candidate keys for R are CS and ZS (using dependency
graph).
The relation R is in 3NF (since each attribute is prime) but not
in BCNF, because in Z C, Z is not a superkey and also it is
not a trivial FD. In R, we cannot store the city to which a
zipcode belongs unless we know a street address with the
zipcode. This introduces insertion anomaly.
To convert this into BCNF, decompose R into:
R1 = {Z, C} and R2 = {S, Z}
If we have R2={C,S} as the other table, then
We can’t have a Foreign key reference to link both the tables.
Determinants of R2 (i.e.,C,S)does not determine any other attribute
BCNF
Also, from the FD: F = {FD1:CS Z, FD2:Z C}
we know that C is Dependent from FD2.
Vni={S}
Voi={ }
Candidate Key is S
Z
S ={S}
+
SZ+ ={S,Z,C}
C
SC+ ={C,S,Z}
Logically R1{S,Z} is a better choice over R1{C,S} as
the join operation is will not generate SPURIOUS
tuples.
Example - 3
Consider the relation GradeList (S, N, C, G}
FD-1: {Name, Course} GPA NC G
FD-2: {StudentNo, Course} GPASC G
FD-3: Name StudentNo NS
FD-4: StudentNo Name SN
Candidate keys are:
{Name, Course} N G
{StudentNo, Course}
C S
The relation is in 3NF.
But redundancy of data.
The association between Name and the corresponding
StudentNo is repeated.
- insertion anomaly.
There exists deletion anomaly too.
(if a student fails in all subjects, looses the student
information!).
The relation Gradelist is not in BCNF, because
of FD-3 and FD-4 which are nontrivial and
their determinants (left-hand side) are not
super keys of GradeList.
BCNF Checking
For each FD X Y in R calculate X+.
If X+ includes all the attributes of R, then it is in BCNF,
otherwise it is not.
Eg: Assume R (C, S, Z) and F = {CS Z, Z C} that
is not in BCNF.
Attribute Closure:
(CS)+ = (C, S, Z) and
Z+ = (Z, C). The second FD {Z C} does not include
all attributes and hence it is not in BCNF. So,
decompose R based on 2FD as R2(Z,C) .
Emp_Details relation
E_ssn E_Name E-Dob E-Sal Dno D_name D_loc
FD1 FD2
The solution
Course_Offering (Lecturer#, Course#, Num-of-
Students)
Lecturer (Lecturer#, Dept#)
Case Study
The HR dept of an organization is planning for a big recruitment drive.
• They wish to organize the data required for the process, in a
database. The data that needs to be captured is as follows:
Functional dependencies
Enroll#, -> Name,
Enroll ->Address,
Enroll ->DOB,
Enroll -> Gender,
Enroll -> Phone,
Enroll -> interviewer
Interviewer -> Int_Name (transitive dependency)
Interviewer -> Extension (transitive dependency)
Qualifications
{Enroll#, qualification, year_of_passing } awarded_by
{Enroll#, qualification, year_of_passing } class
Assumptions:
A person may acquire the same qualification several times from
the same university (e.g M.A in english, M.A in history)
Only one degree can be obtained in an year
Functional dependencies
Employment
Enroll#, Employername,date_joined designation
Enroll#, Employername, date_joined
reason_for_Leaving,
Enroll#, Employername, date_joined date_left
Enroll#, Employername, date_joined last_slary
Employername address (partial dependency)
Employername telephone (partial dependency)
1NF
Applicant( Enroll#, Name, Address, DOB,
Gender, Phone, interviewer,Int_Name,
Extension)
Qualifications( Enroll#, qualification,
year_of_passing, awarded by ,class)
Employment( Enroll#, Employername,
date_joined, address, telephone,designation,
reason_for_Leaving, date_left, last_slary)
2NF
Applicant( Enroll#, Name, Address, DOB, Gender,
Phone, interviewer,Int_Name, Extension)
Qualifications( Enroll#, qualification,
year_of_passing, awarded_by ,class)
Employment( Enroll#, Employername, date_joined,
designation,reason_for_Leaving,date_left, last_slary )
Employer( Employername, Address, Phone)
SupplierId City
{SupplierId, ProdId} City
End of Chapter
Functional Dependencies
Consider the following Relation
Student#
Marks
Course#
Lets observe the data of Online Retail Application Table in a flat file
In this Scenario
Can we Insert the record of an item which has not been
purchased by any customer?
The table is not to maintain
the record of items but it is to
keep the record of purchase
of item by customers
(iv) UnitPriceClass
Second Normal Form : (Cont..)
Key and Non Key Attributes of Retail Application Table
CustomerId CustomerName,AccountNo
ItemId ItemName,UnitPrice,Class
Second Normal Form : (Cont..)
After removing the Partial dependencies on Key Attributes we get
the below tables which aree in 2NF:
Customer
CustomerId CustomerName Accountno
1001 John 1500012351
1002 Tom 1200354611
1003 Maria 2134724532
Item
ItemId ItemName UnitPrice Class
STN001 Pen 10 A
BAK003 Bread 10 A
GRO001 Potato 20 B
Second Normal Form : (Cont..)
ItemPurchase
CustomerId ItemId QtyPurchased NetAmt
1001 STN001 5 50
1002 BAK003 1 10
1003 GRO001 1 20
Third Normal Form: 3 NF
A relation R is said to be in the Third Normal Form (3NF) if and only if
It is in 2NF and
No transitive dependency exists between non-key attributes and key
attributes through another non key attribute.
A B C
It should
be key It should be
Attribute It should be non key
non key attribute
attribute
ItemId UnitPrice
UnitPrice Class
Item
ItemId ItemName UnitPrice
STN001 Pen 10
BAK003 Bread 10
GRO001 Potato 20
ItemClass
UnitPrice Class
10 A
20 B
ename ssn bdate add dno dname dmgrss
n
lots
pid Country Lot# area price Tax_rate
name
area price
Boyce-codd Normal Form: BCNF
A Relation schema R is in BCNF if
whenever a nontrivial FD X->A
i)AсX(trivial)
FD1:AB CD
A B C D FD2: C B
A+ ={A}
AB+ ={A,B,D,C} B C
AC+ ={A,C,B,D}
Logically R1{A,C,D} is a better choice over R1{A,B,D}
as the join operation is will not generate SPURIOUS
tuples.
Example - 2
Consider R (City, Street, Zipcode) or R (C, S, Z) and F = {CS
Z, Z C}.
The candidate keys for R are CS and ZS (using dependency
graph).
The relation R is in 3NF (since each attribute is prime) but not
in BCNF, because in Z C, Z is not a superkey and also it is
not a trivial FD. In R, we cannot store the city to which a
zipcode belongs unless we know a street address with the
zipcode. This introduces insertion anomaly.
To convert this into BCNF, decompose R into:
R1 = {Z, C} and R2 = {S, Z}
If we have R2={C,S} as the other table, then
We can’t have a Foreign key reference to link both the tables.
Determinants of R2 (i.e.,C,S)does not determine any other attribute
BCNF
Also, from the FD: F = {FD1:CS Z, FD2:Z C}
we know that C is Dependent from FD2.
Vni={S}
Voi={ }
Candidate Key is S
Z
S ={S}
+
SZ+ ={S,Z,C}
C
SC+ ={C,S,Z}
Logically R1{S,Z} is a better choice over R1{C,S} as
the join operation is will not generate SPURIOUS
tuples.
Example - 3
Consider the relation GradeList (S, N, C, G}
FD-1: {Name, Course} GPA NC G
FD-2: {StudentNo, Course} GPASC G
FD-3: Name StudentNo NS
FD-4: StudentNo Name SN
Candidate keys are:
{Name, Course} N G
{StudentNo, Course}
C S
The relation is in 3NF.
But redundancy of data.
The association between Name and the corresponding
StudentNo is repeated.
- insertion anomaly.
There exists deletion anomaly too.
(if a student fails in all subjects, looses the student
information!).
The relation Gradelist is not in BCNF, because
of FD-3 and FD-4 which are nontrivial and
their determinants (left-hand side) are not
super keys of GradeList.
BCNF Checking
For each FD X Y in R calculate X+.
If X+ includes all the attributes of R, then it is in BCNF,
otherwise it is not.
Eg: Assume R (C, S, Z) and F = {CS Z, Z C} that
is not in BCNF.
Attribute Closure:
(CS)+ = (C, S, Z) and
Z+ = (Z, C). The second FD {Z C} does not include
all attributes and hence it is not in BCNF. So,
decompose R based on 2FD as R2(Z,C) .
Emp_Details relation
E_ssn E_Name E-Dob E-Sal Dno D_name D_loc
FD1 FD2
btype listprice
Aname Aff
FD1:AB->CD
FD2:C->B
A+={ABCD} //OK
C+={CB} //NOT OK
So decompose,
R1(A,C,D)->remove B I.e.RHS of FD2
from FD1 .
R2(C,B)
2.3 Equivalence of Sets of FDs
Two sets of FDs F and G are equivalent if:
Every FD in F can be inferred from G, and
Every FD in G can be inferred from F
Hence, F and G are equivalent if F+ =G+
Definition (Covers):
F covers G if every FD in G can be inferred from F
(i.e., if G+ subset-of F+)
F and G are equivalent if F covers G and G covers
F
There is an algorithm for checking equivalence of
sets of FDs
Slide 10- 131
Example
X={A->B,B->C}
Y={A->B,B->C,A->C}
Check if X covers y y x
Y covers x x y
then X=Y
First take Y
Y=A->B Y=A->C A+=ABC
A+=ABC
Y=B->C
B+=BC
X={AB->CD,B->C,C->D}
Y={AB->C,AB->D,C->D}
X COVERS Y Y COVERS X
AB->C AB+=ABCD
AB+=ABCD C+=CD
B+=BCD
C+=CD
F={A->B,A->C}
G={A->B,B->C}
F COVERS G G COVERS F
A->B A->B
A+=ABC A+=ABC
B->C
B+=B
Not equivalent
F={A->B,B->C,C->A}
G={C->B,B->A,A->C}
F COVERS G G COVERS F
A->B C->B
A+=ACB C+=ABC
B->C B->A
B+=ABC B+=ABC
C->A A->C
C+=ABC A+=ABC
F=G,THEN THEY ARE EQUIVALENT
R(A,B,C,D,E)
F={A->B,AB->C,D->AC,D->E}
G={A->BC,D->AE}
G COVERS F F COVERS G
A->BC A->B
A+=ABC A+=ABC
D->AE AB+=ABC
D+=DACEB D+=DAEBC
F=G
THEN THEY ARE EQUIVALENT
R(ABCDEF)
FD={AB->C,C->DE,E->F,F->A}
CHECK THE HIGHEST NORMAL FORM?
Merits of Normalization
Normalization is based on a mathematical
foundation.
extent.
Tables in 2 NF
Eliminate partial dependency Tables in 3 NF
Summary
While converting ERD into relational schema, each strong entity becomes a table.
There are three normal forms that were defined being commonly used.
1NF makes sure that all the attributes are atomic in nature.