Normalisation Formated Unit 4
Normalisation Formated Unit 4
INTRODUCTION TO DBMS
MIT-ADT UNIVERSITY
[email protected], [email protected]
Compiled
BY
Dr Ashwin Tomar
[MCA. PhD(Computer Science.), MBA]
SYLLABUS
Model 3. (5 Hrs) Entity-Relationship model: Basic concepts, Design
process, constraints, Keys, Design issues, E-R diagrams, weak entity
sets, extended E-R features – generalization, specialization,
aggregation, reduction to E-R database schema
In the database, every entity set or relationship set can be represented in tabular form.
There are some points for converting the ER diagram to the table:
In the given ER diagram, LECTURE, STUDENT, SUBJECT and COURSE forms individual
tables.
In the given ER diagram, student address is a composite attribute. It contains CITY, PIN,
DOOR#, STREET, and STATE. In the STUDENT table, these attributes can merge as an
individual column.
In the STUDENT table, Age is the derived attribute. It can be calculated at any point of time
by calculating the difference between current date and Date of Birth.
Using these rules, you can convert the ER diagram to tables and columns and assign the
mapping between the tables. Table structure for the given ER diagram is as below:
X → Y
Dependent is FD on Determinant
The left side of FD is known as a determinant, the right side of the production is known
as a dependent.
One value of determinant is associated with one and only one value of the determind so
after combining all one can write
For example:
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table
because if we know the Emp_Id, we can tell that employee name associated with it.
Emp_Id → Emp_Name
Example:
Example:
ID → Name,
Name → DOB A
B
Normalization
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of
relations. It is also used to eliminate the undesirable characteristics like Insertion,
Update and Deletion Anomalies.
o Normalization divides the larger table into the smaller table and links them using
relationship.
o The normal form is used to reduce redundancy from the database table.
o Normalisation is 2 step process
i) data is put in tabular form by removing repeating groups
ii) duplicate data is removed from relational tables.
Objectives of Normalisation:
Un-Normalised Tables : Tables entries that have more than one value are
called as multivalue entries. Such tables with multivalue entries are known as
un-normalised tables.
Normalisation: Normalisation is 2 step process i) data is put in tabular form
by removing repeating groups ii) duplicate data is removed from relational
tables.
Normalisation is defined as process of decomposing the redundant schemas by
breaking up their attributes into smaller relation schemas that posses desirable
properties. The method of splitting is known as Projections.
Goal is to have only primary keys on left hand side of a functional dependency.
Normal
Description
Form
EMPLOYEE table:
Example: A school can store the data of teachers and the subjects they teach.
In a school, a teacher can teach more than one subject.
TEACHER table:
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate
key.
Example:
EMPLOYEE_DETAIL table:
3. That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMPLOYEE_ZIP table:
E.g: In the ER diagram primary key is represented by underlining the primary key attribute.
Ideally a primary key is composed of only a single attribute. But it is possible to have a
primary key composed of more than one attribute.
Candidate Keys
Candidate Keys are super keys for which no proper subset is a super key. In other words
candidate keys are minimal super keys.
Composite Key:
Composite key consists of more than one attributes.
Example: Consider a Relation or Table R1. Let A,B,C,D,E are the attributes of this relation.
R(A,B,C,D,E)
A→BCDE This means the attribute 'A' uniquely determines the other attributes B,C,D,E.
BC→ADE This means the attributes 'BC' jointly determines all the other attributes A,D,E in
the relation.
Composite Key
Super Key
Candidate key
Primary key
Examples:
Super key
Super Key in DBMS: A super key is a set of one or more attributes (columns), which can
uniquely identify a row in a table.
Table: Employee
Super keys: The above table has following super keys. All of the following sets of super key
are able to uniquely identify a row of the employee table.
{Emp_SSN}
{Emp_Number}
{Emp_SSN, Emp_Number}
{Emp_SSN, Emp_Name}
{Emp_SSN, Emp_Number, Emp_Name}
{Emp_Number, Emp_Name}
Candidate Keys: A candidate key is a minimal super key with no redundant attributes. The
following two set of super keys are chosen from the above sets as there are no redundant
attributes in these sets.
{Emp_SSN}
{Emp_Number}
Only these two sets are candidate keys as all other sets are having redundant attributes that
are not necessary for unique identification.
In the above example, we have not chosen {Emp_SSN, Emp_Name} as candidate key
because {Emp_SSN} alone can identify a unique row in the table and Emp_Name is
redundant.
Primary key:
A Primary key is selected from a set of candidate keys. This is done by database admin or
database designer. We can say that either {Emp_SSN} or {Emp_Number} can be chosen as a
primary key for the table Employee.
Here Emp_Id & Emp_Number will be having unique values and Emp_Name can have
duplicate values as more than one employees can have same name.
Lets select the candidate keys from the above set of super keys.
Note: A primary key is selected from the set of candidate keys. That means we can either
have Emp_Id or Emp_Number as primary key. The decision is made by DBA (Database
administrator)
For example:
In the below example the Stu_Id column in Course_enrollment table is a foreign key as it
points to the primary key of the Student table.
Course_enrollment table:
Course_Id Stu_Id
C01 101
C02 102
C03 101
C05 102
C06 103
C07 102
Student table:
Note: Practically, the foreign key has nothing to do with the primary key tag of another table,
if it points to a unique column (not necessarily a primary key) of another table then too, it
would be a foreign key. So, a correct definition of foreign key would be: Foreign keys are the
columns of a table that points to the candidate key of another table.
Note: Any key such as super key, primary key, candidate key etc. can be called composite
key if it has more than one attributes.
Table – Sales
None of these columns alone can play a role of key in this table.
Column cust_Id alone cannot become a key as a same customer can place multiple orders,
thus the same customer can have multiple entires.
Column order_Id alone cannot be a primary key as a same order can contain the order of
multiple products, thus same order_Id can be present multiple times.
Column product_code cannot be a primary key as more than one customers can place order
for the same product.
Column product_count alone cannot be a primary key because two orders can be placed for
the same product count.
Based on this, it is safe to assume that the key should be having more than one attributes:
Key in above table: {cust_id, product_code}
Table: Employee/strong>
DBA (Database administrator) can choose any of the above key as primary key. Lets say
Emp_Id is chosen as primary key.
Since we have selected Emp_Id as primary key, the remaining key Emp_Number would be
called alternative or secondary key.
Example:
Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are
keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
Now, this is in BCNF because left side part of both the functional dependencies
is a key.
Example
STUDENT
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and
Math and two hobbies, Dancing and Singing. So there is a Multi-valued dependency on
STU_ID, which leads to unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
Example
SUBJECT LECTURER SEMESTER
In the above table, John takes both Computer and Math class for Semester 1 but
he doesn't take Math class for Semester 2. In this case, combination of all these
fields required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the
subject and who will be taking that subject so we leave Lecturer and Subject as
NULL. But all three columns together acts as a primary key, so we can't leave
other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations
P1, P2 & P3:
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
Two basic keys of any database that is super key and candidate key. Every candidate key is
a super key but, every super key may or may not be a candidate key.
Comparison Chart
Basis for
Super Key Candidate Key
Comparison
The set of super keys forms the base for The set of candidate keys form the base for
Selection
selection of candidate keys. selection of a single primary key.
There are comparatively more super There are comparatively less candidate keys
Count
keys in a relation. in a relation.
A super key is a basic key of any relation. It is defined as a key that can identify all other
attributes in a relation.
Super key can be a single attribute or a set of attributes. Two entities do not have the same
values for the attributes that compose a super key. There is at least one or more that one super
keys in a relation.
A minimal super key is also called candidate key. So we can say some of the super keys get
verified for being a candidate key.
Let us take a relation R (A, B, C, D, E, F); we have following dependencies for a relation R,
and we have checked each for being super key.
Using key, AB we are able to identify rest of the attributes of the table i.e. CDEF. Similarly,
using keys CD, ABD, DF, and DEF we can identify remaining attributes of the table R. So
all these are super keys.
But using a key CB we can only find values for attribute D and F, we can not find the value
for attributes A and E. Hence, CB is not a super key. Same is the case with key D we can not
find the values of all attributes in a table using key D. So, D is not a super key.
A super key that is a proper subset of another super key of the same relation is called a
minimal super key. The minimal super key is called Candidate key. Like super key, a
candidate key also identifies each tuple in a table uniquely. A candidate key’s attribute can
accept NULL value.
One of the candidate keys is chosen as primary key by DBA. Provided, that the key attribute
values must be unique and does not contain NULL. The attributes of Candidate key is called
prime attributes.
In above example, we have found the Super keys for relation R. Now, let us check all the
super keys for being Candidate key.
Super key AB is a proper subset of super key ABD. So, when a minimal super key AB alone,
is capable of identifying all attributes in a table, then we do not need bigger key ABD. Hence,
super key AB is a candidate key while ABD will only be super key.
Similarly, a super key DF is also a proper subset of super key DEF. So, when DF is alone
capable of identifying all attributes in a relation why do we need DEF. Hence, super key
DF becomes a candidate key while DEF is only a super key.
The super key CD is not a proper subset of any other super key. So, we can say CD is a
minimal super key that identifies all attributes in a relation. Hence, CD is a candidate key.
For MSc(CA) Students Page 27
Advance Database
Whereas key CB and D are not super key so, they cannot be the candidate key even. Viewing
above table you can conclude that each candidate key is a super key but the inverse is not
true.
Conclusion:
Super key is a basic key of any relation. They must be plotted first before recognizing other
keys for the relation as they form the base for other keys. Candidate key are important as it
helps in recognizing the most important key of any relation that is a primary key.
1. Candidate Key: are individual columns in a table that qualifies for uniqueness of all the
rows. Eg: Here in Employee table Employee ID & SSN are Candidate keys.
3. Alternate Key: Candidate column other the Primary column, like if EmployeeID is PK
then SSN would be the Alternate key.
4. Super Key: If you add any other column/attribute to a Primary Key then it become a super
key, like EmployeeID + FullName is a Super Key.
5. Composite Key: If a table do have a single columns that qualifies for a Candidate key,
then you have to select 2 or more columns to make a row unique. Like if there is no
Employee ID or SSN columns, then you can make Full Name + Date Of Birth as Composite
primary Key. But still there can be a narrow chance of duplicate row.
A functional dependency A->B in a relation holds if two tuples having same value of
attribute A also have same value for attribute B. For Example, in relation STUDENT shown
in table 1, Functional Dependencies
STUD_NO->STUD_NAME, STUD_NO->STUD_ADDR hold
but
STUD_NAME->STUD_ADDR do not hold
Functional Dependency Set: Functional Dependency set or FD set of a relation is the set of
all FDs present in the relation.
For Example, FD set for relation STUDENT shown in table 1 is:
Attribute Closure: Attribute closure of an attribute set can be defined as set of attributes
which can be functionally determined from it.
If no subset of this attribute set can functionally determine all attributes of the
relation, the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE,
STUD_STATE, STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate key.
GATE Question:
Consider the relation scheme R = {E, F, G, H, I, J, K, L, M, M} and the set of functional
dependencies {{E, F} -> {G}, {F} -> {I, J}, {E, H} -> {K, L}, K -> {M}, L -> {N} on R.
What is the key for R? (GATE-CS-2014)
A. {E, F}
B. {E, F, H}
C. {E, F, H, K, L}
D. {E}
Answer: Finding attribute closure of all given options, we get:
{E,F}+ = {EFGIJ}
{E,F,H}+ = {EFHGIJKLMN}
{E,F,H,K,L}+ = {{EFHGIJKLMN}
{E}+ = {E}
{EFH}+ and {EFHKL}+ results in set of all attributes, but EFH is minimal. So it will be
candidate key. So correct option is (B).
GATE Question:
Which of the following functional dependencies is NOT implied by the above set?
(GATE IT 2005)
A. CD -> AC
B. BD -> CD
C. BC -> CD
D. AC -> BC
Answer: Using FD set given in question,
(CD)+ = {CDEAB} which means CD -> AC also holds true.
(BD)+ = {BD} which means BD -> CD can’t hold true. So this FD is no implied in FD set.
So (B) is the required option.
Others can be checked in the same way.
Assume that F is a set of functional dependencies for a relation R. The closure of F, denoted
by F+ , is the set of all functional dependencies obtained logically implied by F i.e., F+ is the
set of FD’s that can be derived from F.
Furthermore, the F+ is the smallest set of FD’s such that F+ is superset of F and no FD can be
derived from F by using the axioms that are not contained in F+ . If we have identified all the
functional dependencies in a relationship then we can easily identify superkeys, candidate
keys, and other determinants necessary for normalization.
Algorithm: To compute F+ , the closure of FD’s Input: Given a relation with a set of FD’s F.
Output: The closure of a set of FD’s F.
Step 3. For each functional dependency f in F+ Apply Reflexivity and augmentation axioms
on f and add the resulting functional dependencies to F+ .
Step 4. For each pair of functional dependencies f1 and f 2 in F+ Apply transitivity axiom on
f1 and f2 If f 1 and f 2 can be combined add the resulting FD to F+ .
Step 5. For each functional dependencies to F+ Apply Union and Decomposition axioms on f
and add the resulting functional dependencies to F+ .
Example. Consider the relation schema R = {H, D, X, Y, Z} and the functional dependencies
X→YZ, DX→W, Y→H Find the closure F+ of FD’s.
Sol. Applying Decomposition on X→YZ gives X→Y and X→Z Applying Transitivity on
X→Y and Y→H gives X→H Thus the closure F+ has the FD’s X→YZ, DX→W, Y→H,
X→Y, X→Z, X→H
Sol. Applying Decomposition on A→BC gives A→B and A→C. Functional Dependency and
Normalisation 203 Applying Transitivity on A→B and B→D gives A→D. Applying
Transitivity on CD→E and E→A gives CD→A Thus the closure F+ has the FD’s A→BC,
CD→E, B→D, E→A, A→B, A→C, A→D, CD→A.