Normalization
Normalization
2
Functional Dependencies (Cont…)
• This means that the values of the Y component of a tuple in r depend
on, or are determined by, the values of the X component;
• Alternatively, the values of the X component of a tuple uniquely (or
functionally) determine the values of the Y component.
• We also say that there is a functional dependency from X to Y, or that
Y is functionally dependent on X.
• Relation extensions r(R) that satisfy the functional dependency
constraints are called legal relation states (legal extensions) of R.
• Hence, the main use of functional dependencies is to describe further
a relation schema R by specifying constraints on its attributes that
must hold at all times. 3
Functional Dependencies (Cont…)
• Thus, X functionally determines Y in a relation schema R if, and only if,
whenever two tuples of r(R) agree on their X-value, they must
necessarily agree on their Y-value. Note the following:
⁻ If a constraint on R states that there cannot be more than one
tuple with a given X-value in any relation instance r(R)—that is, X is
a candidate key of R—this implies that X → Y for any subset of
attributes Y of R (because the key constraint implies that no two
tuples in any legal state r(R) will have the same value of X). If X is a
candidate key of R, then X→R.
⁻ If X→Y in R, this does not say whether or not Y→X in R.
4
Functional Dependencies (Cont…)
• Examples
⁻ Ssn→Ename (Ssn values uniquely determines the Ename)
⁻ Pnumber →{Pname, Plocation} (Pnumber uniquely determines
Pname and Plocation)
⁻ {Ssn, Pnumber}→Hours (a combination of Ssn and Pnumber
uniquely determines hours)
• A functional dependency is a property of the relation schema R, not of
a particular legal relation state r of R.
• Given a populated relation, one cannot determine which FDs hold and
which do not unless the meaning of and the relationships among the
attributes are known. 5
Functional Dependencies (Cont…)
• The Figure in the next slide shows a particular state of the TEACH
relation schema.
• Although at first glance we may think that Text →Course, we cannot
confirm this unless we know that it is true for all possible legal states
of TEACH.
• It is, however, sufficient to demonstrate a single counterexample to
disprove a functional dependency.
6
Functional Dependencies (Cont…)
• For example, because ‘Smith’ teaches both ‘Data Structures’ and
‘Data Management,’ we can conclude that Teacher does not
functionally determine Course.
7
Functional Dependencies (Cont…)
• From the first Figure shown in the next slide, the following FDs may
hold because the four tuples in the current extension have no
violation of these constraints:
⁻ B → C; C → B; {A, B} → C; {A, B} → D; and {C, D} → B.
• However, the following do not hold because we already have
violations of them in the given extension:
⁻ A → B (tuples 1 and 2 violate this constraint);
⁻ B → A (tuples 2 and 3 violate this constraint); D→C (tuples 3 and 4
violate it).
8
Functional Dependencies (Cont…)
• A relation R (A, B, C, D) with its extension.
9
Normal Forms Based on Primary Keys
• We assume that a set of functional dependencies is given for each
relation, and that each relation has a designated primary key; this
information combined with the tests (conditions) for normal forms
drives the normalization process for relational schema design.
• Most practical relational design projects take one of the following two
approaches:
• Perform a conceptual schema design using a conceptual model such
as ER or EER and map the conceptual design into a set of relations
• Design the relations based on external knowledge derived from an
existing implementation of files or forms or reports
10
Normalization of Relations
• The normalization process, as first proposed by Codd (1972a), takes a
relation schema through a series of tests to certify whether it satisfies
a certain normal form.
• Initially, Codd proposed three normal forms, which he called first,
second, and third normal form.
• A stronger definition of 3NF—called Boyce-Codd normal form (BCNF)
—was proposed later by Boyce and Codd.
• All these normal forms are based on a single analytical tool: the
functional dependencies among the attributes of a relation.
• Later, a fourth normal form (4NF) and a fifth normal form (5NF) were
proposed, based on the concepts of multivalued dependencies and
join dependencies, respectively; 11
Normalization of Relations (Cont…)
• Normalization of data can be considered a process of analyzing the
given relation schemas based on their FDs and primary keys to
achieve the desirable properties of
1) Minimizing redundancy and
2) Minimizing the insertion, deletion, and update anomalies.
• It can be considered as a “filtering” or “purification” process to make
the design have successively better quality.
• The normalization procedure provides database designers with the
following:
• A formal framework for analyzing relation schemas based on their keys
and on the functional dependencies among their attributes
• A series of normal form tests that can be carried out on individual
relation schemas so that the relational database can be normalized to
any desired degree 12
Definitions of Keys and Attributes
Participating in Keys
• Revise the definitions regarding superkey, candidate key, primary key.
• Definition. An attribute of relation schema R is called a prime
attribute of R if it is a member of some candidate key of R. An
attribute is called nonprime if it is not a prime attribute—that is, if it
is not a member of any candidate key.
• In the Figure below, both Ssn and Pnumber are prime attributes of
WORKS_ON, whereas other attributes of WORKS_ON are nonprime.
13
First Normal Form (1NF)
• First normal form (1NF) is now considered to be part of the formal
definition of a relation in the basic (flat) relational model;
• Historically, it was defined to disallow multivalued attributes,
composite attributes, and their combinations.
• It states that the domain of an attribute must include only atomic
(simple, indivisible) values and that the value of any attribute in a
tuple must be a single value from the domain of that attribute.
• In other words, 1NF disallows relations within relations or relations as
attribute values within tuples.
14
First Normal Form (Cont…)
• Consider the DEPARTMENT relation schema shown below.
• As we can see, this is not in 1NF because Dlocations is not an atomic
attribute, as illustrated by the first tuple in the following figure.
15
First Normal Form (Cont…)
• There are two ways we can look at the Dlocations attribute:
• The domain of Dlocations contains atomic values, but some tuples
can have a set of these values. In this case, Dlocations is not
functionally dependent on the primary key Dnumber.
• The domain of Dlocations contains sets of values and hence is
nonatomic. In this case, Dnumber→Dlocations because each set is
considered a single member of the attribute domain.
• In either case, the DEPARTMENT relation is not in 1NF.
• There are three main techniques to achieve first normal form for such
a relation.
16
First Normal Form (Cont…)
1) Remove the attribute Dlocations that violates 1NF and place it in a
separate relation DEPT_LOCATIONS along with the primary key
Dnumber of DEPARTMENT. The primary key of this relation is the
combination {Dnumber, Dlocation} as shown in the first figure of
next slide.
2) Expand the key so that there will be a separate tuple in the original
DEPARTMENT relation for each location of a DEPARTMENT, as
shown in the second figure of next slide. In this case, the primary
key becomes the combination {Dnumber, Dlocation}. This solution
has the disadvantage of introducing redundancy in the relation.
3) If a maximum number of values is known for the attribute—for
example, if it is known that at most three locations can exist for a
department—replace the Dlocations attribute by three atomic
attributes: Dlocation1, Dlocation2, and Dlocation3. This solution has
the disadvantage of introducing NULL values
17
First Normal Form (Cont…)
• Example for the first case
18
First Normal Form (Cont…)
• First normal form also disallows multivalued attributes that are
themselves composite. These are called nested relations.
• The Figure below shows how the EMP_PROJ relation could appear if
nesting is allowed.
• The schema of this EMP_PROJ relation can be represented as follows:
EMP_PROJ(Ssn, Ename, {PROJS(Pnumber, Hours)})
19
First Normal Form (Cont…)
• To normalize this into 1NF, we remove the nested relation attributes
into a new relation and propagate the primary key into it; the primary
key of the new relation will combine the partial key with the primary
key of the original relation. Decomposition and primary key
propagation yield the schemas EMP_PROJ1 and EMP_PROJ2, as
shown in the Figure.
20
First Normal Form (Cont…)
• The existence of more than one multivalued attribute in one relation
must be handled carefully. As an example, consider the following non-
1NF relation: PERSON (Ss#, {Car_lic#}, {Phone#})
• This relation represents the fact that a person has multiple cars and
multiple phones.
• If strategy 2 above is followed, it results in an all-key relation:
PERSON_IN_1NF (Ss#, Car_lic#, Phone#).
• The right way to deal with the two multivalued attributes in PERSON
shown previously is to decompose it into two separate relations, using
strategy 1 discussed above: P1(Ss#, Car_lic#) and P2(Ss#,
Phone#).
21
Second Normal Form (2NF)
• Second normal form (2NF) is based on the concept of full functional
dependency.
• A functional dependency X → Y is a full functional dependency if
removal of any attribute A from X means that the dependency does
not hold any more.
• A functional dependency X→Y is a partial dependency if some
attribute A ε X can be removed from X and the dependency still holds
• In the Figure (next slide), {Ssn, Pnumber} → Hours is a full
dependency (neither Ssn → Hours nor Pnumber→Hours holds).
• However, the dependency {Ssn, Pnumber}→Ename is partial because
Ssn→Ename holds. 22
Second Normal Form (Cont…)
• Definition. A relation schema R is in 2NF if every nonprime attribute A
in R is fully functionally dependent on the primary key of R.
• The test for 2NF involves testing for functional dependencies whose
left-hand side attributes are part of the primary key.
• If the primary key contains a single attribute, the test need not be
applied at all.
23
Second Normal Form (Cont…)
• The EMP_PROJ relation in Figure (previous slide) is in 1NF but is not in
2NF.
• The nonprime attribute Ename violates 2NF because of FD2, as do the
nonprime attributes Pname and Plocation because of FD3.
• The functional dependencies FD2 and FD3 make Ename, Pname, and
Plocation partially dependent on the primary key {Ssn, Pnumber} of
EMP_PROJ, thus violating the 2NF test.
• Therefore, the functional dependencies FD1, FD2, and FD3 lead to the
decomposition of EMP_PROJ into the three relation schemas EP1,
EP2, and EP3 shown in the Figure (next slide), each of which is in 2NF.
24
Second Normal Form (Cont…)
25
Third Normal Form (3NF)
• Third normal form (3NF) is based on the concept of transitive
dependency.
• A functional dependency X→Y in a relation schema R is a transitive
dependency if there exists a set of attributes Z in R that is neither a
candidate key nor a subset of any key of R, and both X→Z and Z→Y
hold.
• The dependency Ssn→Dmgr_ssn is transitive through Dnumber in
EMP_DEPT in the Figure (next slide), because both the dependencies
Ssn → Dnumber and Dnumber → Dmgr_ssn hold and Dnumber is
neither a key itself nor a subset of the key of EMP_DEPT.
• The dependency of Dmgr_ssn (and also Dname) on Dnumber is
undesirable in EMP_DEPT since Dnumber is not a key of EMP_DEPT. 26
Third Normal Form (Cont…)
• We can normalize EMP_DEPT by decomposing it into the two 3NF
relation schemas ED1 and ED2 shown in the Figure below.
• Definition. According to Codd’s original definition, a relation schema
R is in 3NF if it satisfies 2NF and no nonprime attribute of R is
transitively dependent on the primary key.
27
Third Normal Form (Cont…)
• Intuitively, we see that ED1 and ED2 represent independent entity
facts about employees and departments.
• A NATURAL JOIN operation on ED1 and ED2 will recover the original
relation EMP_DEPT without generating spurious tuples.
• In terms of the normalization process, it is not necessary to remove
the partial dependencies before the transitive dependencies.
• However historically, 3NF has been defined with the assumption that
a relation is tested for 2NF first before it is tested for 3NF.
28
Summarization of the three normal forms
29
End of Chapter 4
30