DBMS: Unit-3
DBMS: Unit-3
Then the following will represent the functional dependency between attributes with an
arrow sign −
A -> B
Example
The following is an example that would make it easier to understand functional
dependency −
The DeptId is our primary key. Here, DeptId uniquely identifies the DeptName attribute.
This is because if you want to know the department name, then at first you need to have
the DeptId.
DeptId DeptName
001 Finance
002 Marketing
003 HR
Therefore, the above functional dependency between DeptId and DeptName can be
determined as DeptId is functionally dependent on DeptName −
For example:
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table
because if we know the Emp_Id, we can tell that employee name associated with it.
Emp_Id → Emp_Name
Example:
Example:
1. ID → Name,
2. Name → DOB
A ->B
3. Partial Dependency
Partial Dependency occurs when a non-prime attribute is functionally dependent on part
of a candidate key.
The 2nd Normal Form (2NF) eliminates the Partial Dependency.
Let us see an example −
Example
<StudentProject>
StudentID ProjectNo StudentName ProjectName
<ProjectInfo>
ProjectNo ProjectName
X->Y
Y does not ->X
Y->Z
{Book} ->{Author} (if we know the book, we knows the author name)
1. If X ⊇ Y then X → Y
Example:
1. X = {a, b, c, d, e}
2. Y = {a, b, c}
1. If X → Y then XZ → YZ
Example:
1. If X → Y and X → Z then X → YZ
Proof:
1. X → Y (given)
2. X → Z (given)
3. X → XY (using IR2 on 1 by augmentation with X. Where XX = X)
4. XY → YZ (using IR2 on 2 by augmentation with Y)
5. X → YZ (using IR3 on 3 and 4)
1. If X → YZ then X → Y and X → Z
Proof:
1. X → YZ (given)
2. YZ → Y (using IR1 Rule)
3. X → Y (using IR3 on 1 and 2)
1. If X → Y and YZ → W then XZ → W
Proof:
1. X → Y (given)
2. WY → Z (given)
3. WX → WY (using IR2 on 1 by augmenting with W)
4. WX → Z (using IR3 on 3 and 2)
Closures of a set of functional dependencies
A Closure is a set of FDs is a set of all possible FDs that can be derived from a given
set of FDs. It is also referred as a Complete set of FDs. If F is used to donate the set of
FDs for relation R, then a closure of a set of FDs implied by F is denoted by F+. Let's
consider the set F of functional dependencies given below:
Now, by applyiing Rule-6 Union, it is possible to derive A+ -> ABCD and it can be
denoted using A -> ABCD. All such type of FDs derived from each FD of F form a
closure of F. Steps to determine F+example:
Determine each set of attributes X that appears as a left hand side of some FD in
F.
Determine the set X+ of all attributes that are dependent on X, as given in above
example.
In other words, X+ represents a set of attributes that are functionally determined
by X based on F. And, X+ is called the Closure of X under F.
All such sets of X+, in combine, Form a closure of F.
Steps:
1. X+ = X //initialize X+ to X
2. For each FD : Y -> Z in F Do
If Y ⊆ X+ Then //If Y is contained in X+
X+ = X+ ∪ Z //add Z to X+
End If
End For
3. Return X+ //Return closure of X
The set of all those attributes which can be functionally determined from an attribute
set is called as a closure of that attribute set. Closure of attribute set {X} is denoted
as {X}+.
Step-01:
Add the attributes contained in the attribute set for which closure is being calculated
to the result set.
Step-02:
Recursively add the attributes to the result set which can be functionally determined
from the attributes already contained in the result set.
Example-
Now, let us find the closure of some attributes and attribute sets-
Closure of attribute A-
A+ = { A }
={A,B,C} ( Using A → BC )
={A,B,C,D,E} ( Using BC → DE )
={A,B,C,D,E,F} ( Using D → F )
={A,B,C,D,E,F,G} ( Using CF → G )
Thus,
A+ = { A , B , C , D , E , F , G }
Closure of attribute D-
D+ = { D }
= { D , F } ( Using D → F )
We can not determine any other attribute using attributes D and F contained in the result
set.
Thus,
D+ = { D , F }
{ B , C }+ = { B , C }
={B,C,D,E} ( Using BC → DE )
={B,C,D,E,F} ( Using D → F )
={B,C,D,E,F,G} ( Using CF → G )
Thus,
{ B , C }+ = { B , C , D , E , F , G }
Super Key-
If the closure result of an attribute set contains all the attributes of the relation,
then that attribute set is called as a super key of that relation.
Thus, we can say-
“The closure of a super key is the entire relation schema.”
Example-
If there exists no subset of an attribute set whose closure contains all the attributes of the
relation, then that attribute set is called as a candidate key of that relation.
Example-
Problem-
Option-(A):
{ CF }+ = { C , F }
={C,F,G} ( Using C → G )
={C,E,F,G} ( Using F → E )
={A,C,E,E,F} ( Using G → A )
={A,C,D,E,F,G} ( Using AF → D )
Since, our obtained result set is same as the given result set, so, it means it is correctly
given.
Option-(B):
{ BG }+ = { B , G }
={A,B,G} ( Using G → A )
={A,B,C,D,G} ( Using AB → CD )
Since, our obtained result set is same as the given result set, so, it means it is correctly
given.
Option-(C):
{ AF }+ = { A , F }
={A,D,F} ( Using AF → D )
={A,D,E,F} ( Using F → E )
Since, our obtained result set is different from the given result set, so,it means it is not
correctly given.
Option-(D):
{ AB }+ = { A , B }
={A,B,C,D} ( Using AB → CD )
={A,B,C,D,G} ( Using C → G )
Since, our obtained result set is different from the given result set, so,it means it is not
correctly given.
Thus,
Option (C) and Option (D) are correct.
Example:
Question:
Consider a relation R with the schema R(A, B, C, D, E, F) with a set of functional
dependencies F as follows;
{AB → C, BC → AD, D → E, CF → B}
Find the super key for this relation.
Solution:
Finding (AB)+
First, let us find (AB)+ , the closure of attribute set AB (We do not need to test all the
attributes individually. Instead we can try with those attributes that are on the left
hand side of any FD.)
result = AB
As AB determines C, C can be included with the result. Hence, result = AB U C = ABC.
According to second FD BC → AD, the attributes B and C together can identify both A
and D. Hence, result = ABC U AD = ABCD
If you know D, you will know E according to third FD D → E. so, result = ABCD U E =
ABCDE.
F cannot be identified by any of these FDs. And our result can include ABCDE attributes
only.
Hence, the solution is AB is not the key for R. The reason is, the closure of AB, i.e.,
(AB)+ does not include all the attributes of R in the result.
Then what would be the key for R?. As I told you initially, we can try all the left hand
side attributes (because they are the determiners), or some of their combination. From
the above example, we would get an idea to include F as one of the key attribute. So,
let us try to find (ABF)+ , the closure of attribute set ABF.
result = ABF
from the above example, we could say (AB)+ = ABCDE
we know C and F, then according to CF → B, we would deduce the result as ABCDEF,
which includes all the attributes from R.
Hence, the solution is ABF is one of the key for R. because, (ABF)+ includes all the
attributes of R.
Equivalence of Two Sets of Functional Dependencies
Two different sets of functional dependencies for a given relation may or may not be
equivalent.
If F and G are the two sets of functional dependencies, then following 3 cases are
possible-
Case-01: F covers G (F ⊇ G)
Case-02: G covers F (G ⊇ F)
Case-03: Both F and G cover each other (F = G)
Step-01:
Step-02:
Step-03:
Step-01:
Step-02:
Step-03:
Set F-
A→C
AC → D
E → AD
E→H
Set G-
A → CD
E → AH
Solution-
Step-01:
Step-03:
Step-01:
Step-02:
Step-03:
How to find whether two sets of functional dependencies are equal or not? -
Equivalent sets of functional dependencies
In other words,
Two sets of functional dependencies F and G are said to be equal if;
Alternative definition;
Two sets of functional dependencies F and G are said to be equal if;
Example:
Let R (A, B, C, D, E) be a relation with set of functional dependencies F = { A → BC,
A → D, CD → E } and G = { A → BCE, A → ABD, CD → E }. Is F = G?
Does F cover G?
If set of FDs of G can be inferred from F, then we would say that F covers G.
The FD A → BCE of G can be inferred from the FDs A → BC, A → D, and CD → E of F.
[here, A gives BCD. If you know C and D then E can be derived]
The FD A → ABD of G can be inferred from the FDs A → BC, and A → D of F.
The FD CD → E of G can be inferred from the FD CD → E of F.
All the three FDs of G can be inferred from FDs of F. Hence, F covers G.
Does G cover F?
If set of FDs of F can be inferred from G, then we would say that G covers F.
The FD A → BC of F can be inferred from the FD A → BCE of G.
The FD A → D of F can be inferred from the FD A → ABD of G.
The FD CD → E of F can be inferred from the FD CD → E of G.
All the three FDs of F can be inferred from FDs of G. Hence, G covers F.
In DBMS,
A canonical cover is a simplified and reduced version of the given set of functional
dependencies.
Since it is a reduced version, it is also called as Irreducible set.
Characteristics-
Working with the set containing extraneous functional dependencies increases the
computation time.
Therefore, the given set is reduced by eliminating the useless functional dependencies.
This reduces the computation time and working with the irreducible set becomes easier.
Step-01:
Write the given set of functional dependencies in such a way that each functional
dependency contains exactly one attribute on its right side.
Example-
Step-02:
Consider each functional dependency one by one from the set obtained in Step-01.
Determine whether it is essential or non-essential.
NOTE-
Step-03:
Consider the newly obtained set of functional dependencies after performing Step-02.
Check if there is any functional dependency that contains more than one attribute on its
left side.
Case-01: No-
There exists no functional dependency containing more than one attribute on its left side.
In this case, the set obtained in Step-02 is the canonical cover.
Case-01: Yes-
There exists at least one functional dependency containing more than one attribute on its
left side.
In this case, consider all such functional dependencies one by one.
Check if their left side can be reduced.
Problem-
The following functional dependencies hold true for the relational scheme R ( W , X , Y , Z )
–
X→W
WZ → XY
Y → WXZ
Write the irreducible equivalent for this set of functional dependencies.
Solution-
Step-01:
Write all the functional dependencies such that each contains exactly one attribute on its
right side-
X→W
WZ → X
WZ → Y
Y→W
Y→X
Y→Z
Step-02:
For X → W:
Considering X → W, (X)+ = { X , W }
Ignoring X → W, (X)+ = { X }
Now,
Clearly, the two results are different.
Thus, we conclude that X → W is essential and can not be eliminated.
For WZ → X:
Considering WZ → X, (WZ)+ = { W , X , Y , Z }
Ignoring WZ → X, (WZ)+ = { W , X , Y , Z }
Now,
Clearly, the two results are same.
Thus, we conclude that WZ → X is non-essential and can be eliminated.
For WZ → Y:
Considering WZ → Y, (WZ)+ = { W , X , Y , Z }
Ignoring WZ → Y, (WZ)+ = { W , Z }
Now,
Clearly, the two results are different.
Thus, we conclude that WZ → Y is essential and can not be eliminated.
For Y → W:
Considering Y → W, (Y)+ = { W , X , Y , Z }
Ignoring Y → W, (Y)+ = { W , X , Y , Z }
Now,
Clearly, the two results are same.
Thus, we conclude that Y → W is non-essential and can be eliminated.
Considering Y → X, (Y)+ = { W , X , Y , Z }
Ignoring Y → X, (Y)+ = { Y , Z }
Now,
Clearly, the two results are different.
Thus, we conclude that Y → X is essential and can not be eliminated.
For Y → Z:
Considering Y → Z, (Y)+ = { W , X , Y , Z }
Ignoring Y → Z, (Y)+ = { W , X , Y }
Now,
Clearly, the two results are different.
Thus, we conclude that Y → Z is essential and can not be eliminated.
Step-03:
Consider the functional dependencies having more than one attribute on their left side.
Check if their left side can be reduced.
In our set,
Only WZ → Y contains more than one attribute on its left side.
Considering WZ → Y, (WZ)+ = { W , X , Y , Z }
Now,
Consider all the possible subsets of WZ.
Check if the closure result of any subset matches to the closure result of WZ.
(W)+ = { W }
(Z)+ = { Z }
Clearly,
None of the subsets have the same closure result same as that of the entire left side.
Thus, we conclude that we can not write WZ → Y as W → Y or Z → Y.
Thus, set of functional dependencies obtained in step-02 is the canonical cover.
Types of Decomposition-
Lossless Decomposition
Decomposition is lossless if it is feasible to reconstruct relation R from decomposed tables
using Joins. This is the preferred choice. The information will not lose from the relation
when decomposed. The join would result in the same original relation.
Consider there is a relation R which is decomposed into sub relations R1 , R2 , …. , Rn.
This decomposition is called lossless join decomposition when the join of the sub
relations results in the same relation R that was decomposed.
For lossless join decomposition, we always have-
R1 ⋈ R2 ⋈ R3 ……. ⋈ Rn = R
Example-
Consider the following relation R( A , B , C )-
A B C
1 2 1
2 5 3
3 3 3
R( A , B , C )
Consider this relation is decomposed into two sub relations R1( A , B ) and R2( B , C )-
A B
1 2
2 5
3 3
R1( A , B )
B C
2 1
5 3
3 3
R2( B , C )
Now, if we perform the natural join ( ⋈ ) of the sub relations R1 and R2 , we get-
A B C
1 2 1
2 5 3
3 3 3
NOTE-
Lossless join decomposition is also known as non-additive join decomposition.
This is because the resultant relation after joining the sub relations is same as the
decomposed relation.
No extraneous tuples appear after joining of the sub-relations.
Example −
<EmpInfo>
Emp_ID Emp_Name Emp_Age Emp_Location Dept_ID Dept_Name
<DeptDetails>
Dept_ID Emp_ID Dept_Name
Dpt2 E002 HR
Therefore, the above relation had lossless decomposition i.e. no loss of information.
If we decompose a relation R into relations R1 and R2,
Decomposition is lossy if R1 ⋈ R2 ⊃ R
Decomposition is lossless if R1 ⋈ R2 = R
To check for lossless join decomposition using FD set, following conditions must
hold:
Lossy Decomposition
As the name suggests, when a relation is decomposed into two or more relational
schemas, the loss of information is unavoidable when the original relation is retrieved.
Consider there is a relation R which is decomposed into sub relations R1 , R2 , …. , Rn.
This decomposition is called lossy join decomposition when the join of the sub relations
does not result in the same relation R that was decomposed.
The natural join of the sub relations is always found to have some extraneous tuples.
For lossy join decomposition, we always have-
R1 ⋈ R2 ⋈ R3 ……. ⋈ Rn ⊃ R
Example-
A B C
1 2 1
2 5 3
3 3 3
R( A , B , C )
Consider this relation is decomposed into two sub relations as R1( A , C ) and R2( B , C )-
A C
1 1
2 3
3 3
R1( A , B )
B C
2 1
5 3
3 3
R2( B , C )
Now, if we perform the natural join ( ⋈ ) of the sub relations R1 and R2 we get-
A B C
1 2 1
2 5 3
2 3 3
3 5 3
3 3 3
This relation is not same as the original relation R and contains some extraneous tuples.
Clearly, R1 ⋈ R2 ⊃ R.
Thus, we conclude that the above decomposition is lossy join decomposition.
NOTE-
Example −
<EmpInfo>
Emp_ID Emp_Name Emp_Age Emp_Location Dept_ID Dept_Name
<DeptDetails>
Dept_ID Dept_Name
Dpt1 Operations
Dpt2 HR
Dpt3 Finance
Now, you won’t be able to join the above tables, since Emp_ID isn’t part of
the DeptDetails relation.
Therefore, the above relation has lossy decomposition.
Answer: For lossless join decomposition, these three conditions must hold true:
1. Att(R1) U Att(R2) = ABCD = Att(R)
2. Att(R1) ∩ Att(R2) = Φ, which violates the condition of lossless join decomposition.
Hence the decomposition is not lossless.
For dependency preserving decomposition,
A->B can be ensured in R1(AB) and C->D can be ensured in R2(CD). Hence it is
dependency preserving decomposition.
Normal Description
Form
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully
functional dependent on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-
valued dependency.
5NF A relation is in 5NF if it is in 4NF and not contains any join dependency and
joining should be lossless.
Consider the following database schema:
The relation schemas in Figures 15.3(a) and 15.3(b) also have clear semantics. A tuple in the
EMP_DEPT relation schema in Figure 15.3(a) represents a single employee but includes additional
information—namely, the name (Dname) of the department for which the employee works and the
Social Security number (Dmgr_ssn) of the department manager. For the EMP_PROJ relation in
Figure 15.3(b), each tuple relates an employee to a project but also includes the employee name
(Ename), project name (Pname), and project location (Plocation).
One goal of schema design is to minimize the storage space used by the base relations (and hence
the corresponding files). Grouping attributes into relation schemas has a significant effect on
storage space. For example, compare the space used by the two base relations EMPLOYEE and
DEPARTMENT in Figure 15.2 with that for an EMP_DEPT base relation in Figure 15.4, which
is the result of applying the NATURAL JOIN operation to EMPLOYEE and DEPARTMENT. In
EMP_DEPT, the attribute values pertaining to a particular department (Dnumber, Dname,
Dmgr_ssn) are repeated for every employee who works for that department. In contrast, each
department’s information appears only once in the DEPARTMENT relation in Figure 15.2. Only
the department number (Dnumber) is repeated in the EMPLOYEE relation for each employee who
works in that department as a foreign key. Similar comments apply to the EMP_PROJ relation
(see Figure 15.4), which augments the WORKS_ON relation with additional attributes from
EMPLOYEE and PROJECT.
Storing natural joins of base relations leads to an additional problem referred to as update
anomalies. These can be classified into insertion anomalies, deletion anomalies, and modification
anomalies.
Insertion Anomalies. Insertion anomalies can be differentiated into two types, illustrated by the
following examples based on the EMP_DEPT relation:
To insert a new employee tuple into EMP_DEPT, we must include either the attribute values for
the department that the employee works for, or NULLs (if the employee does not work for a
department as yet). For example, to insert a new tuple for an employee who works in department
number 5, we must enter all the attribute values of department 5 correctly so that they are consistent
with the corresponding values for department 5 in other tuples in EMP_DEPT. In the design of
Figure 15.2, we do not have to worry about this consistency problem because we enter only the
department number in the employee tuple; all other attribute values of department 5 are recorded
only once in the database, as a single tuple in the DEPARTMENT relation.
It is difficult to insert a new department that has no employees as yet in the EMP_DEPT relation.
The only way to do this is to place NULL values in the attributes for employee. This violates the
entity integrity for EMP_DEPT because Ssn is its primary key. Moreover, when the first employee
is assigned to that department, we do not need this tuple with NULL values any more. This problem
does not occur in the design of Figure 15.2 because a department is entered in the DEPARTMENT
relation whether or not any employees work for it, and whenever an employee is assigned to that
department, a corresponding tuple is inserted in EMPLOYEE.
Deletion Anomalies. The problem of deletion anomalies is related to the second insertion
anomaly situation just discussed. If we delete from EMP_DEPT an employee tuple that happens
to represent the last employee working for a particular department, the information concerning that
department is lost from the database. This problem does not occur in the database of Figure 15.2
because DEPARTMENT tuples are stored separately.
Normalization of data can be considered a process of analyzing the given relation schemas based
on their FDs and primary keys to achieve the desirable properties of :
Definition. The normal form of a relation refers to the highest normal form condition that it
meets, and hence indicates the degree to which it has been normalized.
Normal forms, when considered in isolation from other factors, do not guarantee a good database
design. It is generally not sufficient to check separately that each relation schema in the database
is, say, in BCNF or 3NF. Rather, the process of normalization through decomposition must also
confirm the existence of additional properties that the relational schemas, taken together, should
possess. These would include two properties:
■ The nonadditive join or lossless join property, which guarantees that the spurious tuple
generation problem does not occur with respect to the relation schemas created after
decomposition.
■ The dependency preservation property, which ensures that each functional dependency is
represented in some individual relation resulting after decomposition.
Definition. A superkey of a relation schema R = {A1, A2, ... , An} is a set of attributes S ⊆ R
with the property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] =
t2[S]. A key K is a superkey with the additional property that removal of any attribute from K
will cause K not to be a superkey anymore.
The difference between a key and a superkey is that a key has to be minimal; that is, if we have a
key K = {A1, A2, ..., Ak} of R, then K – {Ai} is not a key of R for any Ai, 1 ≤ i ≤ k. In Figure 15.1,
{Ssn} is a key for EMPLOYEE, whereas {Ssn}, {Ssn, Ename}, {Ssn, Ename, Bdate}, and any set
of attributes that includes Ssn are all superkeys. If a relation schema has more than one key, each
is called a candidate key. One of the candidate keys is arbitrarily designated to be the primary
key, and the others are called secondary keys. In a practical relational database, each relation
schema must have a primary key. If no candidate key is known for a relation, the entire relation
can be treated as a default superkey. In Figure 15.1, {Ssn} is the only candidate key for
EMPLOYEE, so it is also the primary key.
In Figure 15.1, both Ssn and Pnumber are prime attributes of WORKS_ON, whereas other
attributes of WORKS_ON are nonprime.
First normal form (1NF) is now considered to be part of the formal definition of a relation in the
basic (flat) relational model; historically, it was defined to disallow multivalued attributes,
composite attributes, and their combinations. It states that the domain of an attribute must include
only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a
single value from the domain of that attribute. Hence, 1NF disallows having a set of values, a tuple
of values, or a combination of both as an attribute value for a single tuple. In other words, 1NF
disallows relations within relations or relations as attribute values within tuples. The only attribute
values permitted by 1NF are single atomic (or indivisible) values. Consider the DEPARTMENT
relation schema shown in Figure 15.1, whose primary key is Dnumber, and suppose that we extend
it by including the Dlocations attribute as shown in Figure 15.9(a). We assume that each
department can have a number of locations. The DEPARTMENT schema and a sample relation
state are shown in Figure 15.9. As we can see, this is not in 1NF because Dlocations is not an
atomic attribute, as illustrated by the first tuple in Figure 15.9(b). There are two ways we
can look at the Dlocations attribute:
The domain of Dlocations contains atomic values, but some tuples can have a set of these
values. In this case, Dlocations is not functionally dependent on the primary key
Dnumber.
The domain of Dlocations contains sets of values and hence is nonatomic. In this case,
Dnumber→Dlocations because each set is considered a single member of the attribute
domain.
In either case, the DEPARTMENT relation in Figure 15.9 is not in 1NF; in fact, it does not even
qualify as a relation according to our definition of relation in Section 3.1. There are three main
techniques to achieve first normal form for such a relation:
1. Remove the attribute Dlocations that violates 1NF and place it in a separate relation
DEPT_LOCATIONS along with the primary key Dnumber of DEPARTMENT. The
primary key of this relation is the combination {Dnumber, Dlocation}, as shown in Figure
15.2. A distinct tuple in DEPT_LOCATIONS exists for each location of a department.
This decomposes the non-1NF relation into two 1NF relations.
2. Expand the key so that there will be a separate tuple in the original DEPARTMENT relation for
each location of a DEPARTMENT, as shown in Figure 15.9(c). In this case, the primary key
becomes the combination {Dnumber, Dlocation}. This solution has the disadvantage of
introducing redundancy in the relation.
3. If a maximum number of values is known for the attribute—for example, if it is known that at
most three locations can exist for a department—replace the Dlocations attribute by three atomic
attributes: Dlocation1, Dlocation2, and Dlocation3. This solution has the disadvantage of
introducing NULL values if most departments have fewer than three locations. It further introduces
spurious semantics about the ordering among the location values that is not originally intended.
Querying on this attribute becomes more difficult; for example, consider how you would write the
query: List the departments that have ‘Bellaire’ as one of their locations in this design.
Second Normal Form
Second normal form (2NF) is based on the concept of full functional dependency. A functional
dependency X → Y is a full functional dependency if removal of any attribute A from X means
that the dependency does not hold any more; that is, for any attribute A ε X, (X – {A}) does not
functionally determine Y. A functional dependency X→Y is a partial dependency if some attribute
A ε X can be removed from X and the dependency still holds;
that is, for some A ε X, (X – {A}) → Y.
In Figure 15.3(b), {Ssn, Pnumber} → Hours is a full dependency (neither Ssn → Hours nor
Pnumber→Hours holds). However, the dependency {Ssn, Pnumber}→Ename is partial because
Ssn→Ename holds.
Therefore, the functional dependencies FD1, FD2, and FD3 in Figure 15.3(b) lead to the
decomposition of EMP_PROJ into the three relation schemas EP1, EP2, and EP3 shown in Figure
15.11(a), each of which is in 2NF.
Third Normal Form
Third normal form (3NF) is based on the concept of transitive dependency. A functional
dependency X→Y in a relation schema R is a transitive dependency if there exists a set of
attributes Z in R that is neither a candidate key nor a subset of any key of R and both X→Z and
Z→Y hold. The dependency Ssn→Dmgr_ssn is transitive through Dnumber in EMP_DEPT in
Figure 15.3(a), because both the dependencies Ssn → Dnumber and Dnumber → Dmgr_ssn hold
and Dnumber is neither a key itself nor a subset of the key of EMP_DEPT.
Intuitively, we can see that the dependency of Dmgr_ssn on Dnumber is undesirable in
EMP_DEPT since Dnumber is not a key of EMP_DEPT.
Intuitively, we can see that any functional dependency in which the left-hand side is part (a proper
subset) of the primary key, or any functional dependency in which the left-hand side is a nonkey
attribute, is a problematic FD. 2NF and 3NF normalization remove these problem FDs by
decomposing the original relation into new relations. In terms of the normalization process, it is
not necessary to remove the partial dependencies before the transitive dependencies, but
historically, 3NF has been defined with the assumption that a relation is tested for 2NF first before
it is tested for 3NF.
Table 15.1 informally summarizes the three normal forms based on primary keys, the tests used in
each case, and the corresponding remedy or normalization performed to achieve the normal form.
Table 15.1
In words, the dependency FD3 says that the tax rate is fixed for a given county (does not vary lot
by lot within the same county), while FD4 says that the price of a lot is determined by its area
regardless of which county it is in. (Assume that this is the price of the lot for tax purposes.) The
LOTS relation schema violates the general definition of 2NF because Tax_rate is partially
dependent on the candidate key {County_name, Lot#}, due to FD3. To normalize LOTS into 2NF,
we decompose it into the two relations LOTS1 and LOTS2, shown in Figure 15.12(b). We
construct LOTS1 by removing the attribute Tax_rate that violates 2NF from LOTS and placing it
with County_name (the left-hand side of FD3 that causes the partial dependency) into another
relation LOTS2. Both LOTS1 and LOTS2 are in 2NF. Notice that FD4 does not violate 2NF and
is carried over to LOTS1.
General Definition of Third Normal Form
Definition. A relation schema R is in third normal form (3NF) if, whenever a nontrivial
functional dependency X→A holds in R, either (a) X is a superkey of R, or (b) A is a prime attribute
of R. According to this definition, LOTS2 (Figure 15.12(b)) is in 3NF. However, FD4 in LOTS1
violates 3NF because Area is not a superkey and Price is not a prime attribute in LOTS1. To
normalize LOTS1 into 3NF, we decompose it into the relation schemas LOTS1A and LOTS1B
shown in Figure 15.12(c).We construct LOTS1A by removing the attribute Price that violates 3NF
from LOTS1 and placing it with Area (the left hand side of FD4 that causes the transitive
dependency) into another relation LOTS1B. Both LOTS1A and LOTS1B are in 3NF.
Two points are worth noting about this example and the general definition of 3NF:
■ LOTS1 violates 3NF because Price is transitively dependent on each of the candidate keys of
LOTS1 via the nonprime attribute Area.
■ This general definition can be applied directly to test whether a relation schema is in 3NF; it
does not have to go through 2NF first. If we apply the above 3NF definition to LOTS with the
dependencies FD1 through FD4, we find that both FD3 and FD4 violate 3NF. Therefore, we could
decompose
LOTS into LOTS1A, LOTS1B, and LOTS2 directly. Hence, the transitive and partial
dependencies that violate 3NF can be removed in any order.
Boyce-Codd Normal Form
Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was found to
be stricter than 3NF. That is, every relation in BCNF is also in 3NF; however, a relation in 3NF is
not necessarily in BCNF. Intuitively, we can see the need for a stronger normal form than 3NF by
going back to the LOTS relation schema in Figure 15.12(a) with its four functional dependencies
FD1 through FD4. Suppose that we have thousands of lots in the relation but the lots are from only
two counties: DeKalb and Fulton. Suppose also that lot sizes in DeKalb County are only 0.5, 0.6,
0.7, 0.8, 0.9, and 1.0 acres, whereas lot sizes in Fulton County are restricted to 1.1, 1.2, ..., 1.9, and
2.0 acres. In such a situation we would have the additional functional dependency
FD5: Area→County_name.
If we add this to the other dependencies, the relation schema LOTS1A still is in 3NF because
County_name is a prime attribute. The area of a lot that determines the county, as specified by
FD5, can be represented by 16 tuples in a separate relation R (Area, County_name), since there
are only 16 possible Area values (see Figure 15.13). This representation reduces the redundancy
of repeating the same information in the thousands of LOTS1A tuples. BCNF is a stronger normal
form that would disallow LOTS1A and suggest the need for decomposing it.
In practice, most relation schemas that are in 3NF are also in BCNF. Only if X→A holds in a
relation schema R with X not being a super key and A being a prime attribute will R be in 3NF but
not in BCNF. The relation schema R shown in Figure 15.13(b) illustrates the general case of such
a relation. Ideally, relational database design should strive to achieve BCNF or 3NF for every
relation schema. Achieving the normalization status of just 1NF or 2NF is not considered adequate,
since they were developed historically as stepping stones to 3NF and BCNF.
Multivalued Dependency and Fourth Normal Form
So far we have discussed the concept of functional dependency, which is by far the most important
type of dependency in relational database design theory, and normal forms based on functional
dependencies. However, in many cases relations have constraints that cannot be specified as
functional dependencies. In this section, we discuss the concept of multivalued dependency (MVD)
and define fourth normal form, which is based on this dependency. Multivalued dependencies are
a consequence of first normal form (1NF), which disallows an attribute in a tuple to have a set of
values, and the accompanying process of converting an un normalized relation into 1NF. If we
have two or more multivalued independent attributes in the same relation schema, we get into a
problem of having to repeat every value of one of the attributes with every value of the other
attribute to keep the relation state consistent and to maintain the independence among the attributes
involved. This constraint is specified by a multivalued dependency.
Whenever X→→Y holds, we say that X multi determines Y. Because of the symmetry in the
definition, whenever X →→ Y holds in R, so does X →→ Z. Hence, X →→ Y implies X→→Z,
and therefore it is sometimes written as X→→Y|Z. An MVD X →→ Y in R is called a trivial
MVD if (a) Y is a subset of X, or (b) X ∪ Y = R. For example, the relation EMP_PROJECTS in
Figure 15.15(b) has the trivial MVD Ename →→ Pname. An MVD that satisfies neither (a) nor
(b) is called a nontrivial MVD. A trivial MVD will hold in any relation state r of R; it is called
trivial because it does not specify any significant or meaningful constraint on R. If we have a
nontrivial MVD in a relation, we may have to repeat values redundantly in the tuples. In the EMP
relation of Figure 15.15(a), the values ‘X’ and ‘Y’ of Pname are repeated with each value of
Dname (or, by symmetry, the values ‘John’ and ‘Anna’ of Dname are repeated with each value of
Pname). This redundancy is clearly undesirable.
However, the EMP schema is in BCNF because no functional dependencies hold in EMP.
Therefore, we need to define a fourth normal form that is stronger than BCNF and disallows
relation schemas such as EMP. Notice that relations containing nontrivial MVDs tend to be all-
key relations—that is, their key is all their attributes taken together. Furthermore, it is rare that
such all-key relations with a combinatorial occurrence of repeated values would be designed in
practice.
However, recognition of MVDs as a potential problematic dependency is essential in relational
design.
We now present the definition of fourth normal form (4NF), which is violated when a relation
has undesirable multivalued dependencies, and hence can be used to identify and decompose such
relations.
Definition. A relation schema R is in 4NF with respect to a set of dependencies F (that includes
functional dependencies and multivalued dependencies) if, for every nontrivial multivalued
dependency X →→ Y in F+17 X is a super key for R.
We can state the following points:
■ An all-key relation is always in BCNF since it has no FDs.
■ An all-key relation such as the EMP relation in Figure 15.15(a), which has no FDs but has the
MVD Ename →→ Pname | Dname, is not in 4NF.
■ A relation that is not in 4NF due to a nontrivial MVD must be decomposed to convert it into a
set of relations in 4NF.
■ The decomposition removes the redundancy caused by the MVD.
The process of normalizing a relation involving the nontrivial MVDs that is not in 4NF consists
of decomposing it so that each MVD is represented by a separate relation where it becomes a trivial
MVD. Consider the EMP relation in Figure 15.15(a). EMP is not in 4NF because in the nontrivial
MVDs Ename→→ Pname and Ename →→ Dname, and Ename is not a super key of EMP. We
decompose EMP into EMP_PROJECTS and EMP_DEPENDENTS, shown in Figure 15.15(b).
Both EMP_PROJECTS and EMP_DEPENDENTS are in 4NF, because the MVDs Ename →→
Pname in EMP_PROJECTS and Ename →→ Dname in EMP_DEPENDENTS are trivial MVDs.
No other nontrivial MVDs hold in either EMP_PROJECTS or EMP_DEPENDENTS. No FDs
hold in these relation schemas either.
In our discussion so far, we have pointed out the problematic functional dependencies and showed
how they were eliminated by a process of repeated binary decomposition to remove them during
the process of normalization to achieve 1NF, 2NF, 3NF and BCNF. Achieving 4NF typically
involves eliminating MVDs by repeated binary decompositions as well. However, in some cases
there may be no non additive join
decomposition of R into two relation schemas, but there may be a non additive join decomposition
into more than two relation schemas. Moreover, there may be no functional dependency in R that
violates any normal form up to BCNF, and there may be no nontrivial MVD present in R either
that violates 4NF.We then resort to another dependency called the join dependency and, if it is
present, carry out a multi way decomposition into fifth normal form (5NF). It is important to note
that such a dependency is a very peculiar semantic constraint that is very difficult to detect in
practice; therefore, normalization into 5NF is very rarely done in practice.
Definition. A join dependency (JD), denoted by JD (R1, R2, ..., Rn), specified on relation schema
R, specifies a constraint on the states r of R. The constraint states that every legal state r of R should
have a non additive join decomposition into R1, R2, ..., Rn. Hence, for every such r we have
∗ (πR1 (r), πR2 (r), ..., πRn (r)) = r
Notice that an MVD is a special case of a JD where n = 2. That is, a JD denoted as JD (R1, R2)
implies an MVD (R1 ∩ R2) →→ (R1 – R2) (or, by symmetry, (R1 ∩ R2) →→ (R2 – R1)). A join
dependency JD (R1, R2, ..., Rn), specified on relation schema R, is a trivial JD if one of the relation
schemas Ri in JD(R1, R2, ..., Rn) is equal to R.
Such a dependency is called trivial because it has the non additive join property for any relation
state r of R and thus does not specify any constraint on R. We can now define fifth normal form,
which is also called project-join normal form.
Definition. A relation schema R is in fifth normal form (5NF) (or project-join normal form
(PJNF)) with respect to a set F of functional, multivalued, and join dependencies if, for every
nontrivial join dependency JD (R1, R2, … , Rn) in F+ (that is, implied by F),18 every Ri is a
superkey of R.
For an example of a JD, consider once again the SUPPLY all-key relation in Figure 15.15(c).
Suppose that the following additional constraint always holds: Whenever a supplier s supplies part
p, and a project j uses part p, and the supplier s supplies at least one part to project j, then supplier
s will also be supplying part p to project j.
This constraint can be restated in other ways and specifies a join dependency JD (R1, R2, R3)
among the three projections R1 (Sname, Part_name), R2 (Sname, Proj_name), and R3 (Part_name,
Proj_name) of SUPPLY. If this constraint holds, the tuples below the dashed line in Figure
15.15(c) must exist in any legal state of the SUPPLY relation that also contains the tuples above
the dashed line. Figure 15.15(d) shows how the SUPPLY relation with the join dependency is
decomposed into three relations R1, R2, and R3 that are each in 5NF. Notice that applying a natural
join to any two of these relations produces spurious tuples, but applying a natural join to all three
together does not. The reader should verify this on the sample relation in Figure 15.15(c) and its
projections in Figure 15.15(d). This is because only the JD exists, but no MVDs are specified.
Notice, too, that the JD (R1, R2, R3) is specified on all legal relation states, not just on the one
shown in Figure 15.15(c).
Discovering JDs in practical databases with hundreds of attributes is next to impossible. It can be
done only with a great degree of intuition about the data on the part of the designer. Therefore, the
current practice of database design pays scant attention to them.