Module-Iv Normalization of Database Tables Database Tables and Normalization
Module-Iv Normalization of Database Tables Database Tables and Normalization
MODULE-IV
Normalization is a process for evaluating and correcting table structures to minimize data redundancies, thereby
reducing the likelihood of data anomalies. The normalization process involves assigning attributes to tables based on
the concept of Relational Database Model.
Normalization works through a series of stages called normal forms. The first three stages are described as first
normal form (1NF), second normal form (2NF), and third normal form (3NF). From a structural point of view, 2NF
is better than 1NF, and 3NF is better than 2NF. For most purposes in business database design, 3NF is as high as
you need to go in the normalization process. However, you will discover in Section 5.3 that properly designed 3NF
structures also meet the requirements of fourth normal form (4NF).
Although normalization is a very important database design ingredient, you should not assume that the highest level
of normalization is always the most desirable. Generally, the higher the normal form, the more relational join
operations required to produce a specified output and the more resources required by the database system to respond
to end-user queries. A successful design must also consider end-user demand for fast performance. Therefore, you
will occasionally be expected to denormalize some portions of a database design in order to meet performance
requirements. Denormalization produces a lower normal form; that is, a 3NF will be converted to a 2NF through
denormalization. However, the price you pay for increased performance through denormalization is greater data
redundancy.
Normalization of data can be considered a process of analyzing the given relation schemas based on their FDs and
primary keys to achieve the desirable properties of (1) minimizing redundancy and (2) minimizing the insertion,
deletion, and update Anomalies.
SCHEMA REFINEMENT:
MODULE-4 Page 1
DBMS
The key for Hourly_Emps is ssn. In addition, suppose that the hourly_wages attribute is determined by the Rating
attribute. That is, for a given rating value, there is only one permissible hourly_wages value. This IC is an example
of a functional dependency. It leads to possible redundancy in the relation Hourly_Emps, as illustrated in Figure
19.1.
If the same value appears in the rating column of two tuples, the IC tells us that the same value must appear
in the hourly_wages column well. This redundancy has the same negative consequences as before:
Redundant Storage: The rating value 8 corresponds to the hourly wage 10, and this association is repeated three
times.
Update Anomalies The Hourly _wages in the first tuple could be updated without making a similar change in the
second tuple.
Insertion Anomalies: We cannot insert a tuple for an employee unless we know the hourly wage for the employee’s
rating value.
Deletion Anomalies: If we delete all tuples vvith a given rating value (e.g., we delete the tuples for Srllethurst and
Guldu) we lose the association between that rating value and its hourly_wages value.
Decompositions:
A decomposition of a relation schema R consists of replacing the relation schema by two (or more) relation
schemas that each contain a subset of the attributes of R and together include all attributes in R. Intuitively, we want
to store the information in any given instance of R by storing projections the instance.
MODULE-4 Page 2
DBMS
The instances of these relations corresponding to the instance of Hourly_Emps relation in Figure 19.1 is
shown in Figure 19.2. Note that we can easily record the hourly wage for any rating simply by adding a tuple to
Wages, even if no employee with that rating appears in the current instance of hourly _Emps. Changing the wage
associated with a rating involves updating a single Wages tuple. This is more efficient than updating several tuples
(as in the original design), and it eliminates the potential for inconsistency.
FUNCTIONAL DEPENDENCIES:
A functional dependency, denoted by X -> Y, between two sets of attributes X and Y that are subsets of R
specifies a constraint on the possible tuples that can form a relation state r of R. The constraint is that, for any two
tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
This means that the values of the Y component of a tuple in r depend on, or are determined by, the values of
the X component; alternatively, the values of the X component of a tuple uniquely (or functionally) determine the
values of the Y component. We also say that there is a functional dependency from X to Y, or that Y is functionally
dependent on X. The abbreviation for functional dependency is FD or f.d. The set of attributes X is called the left-
hand side of the FD, and Y is called the right-hand side.
Figure 19.3 illustrates the meaning of the FD AB C by showing an instance that satisfies this dependency. The
first two tuples show that an FD is not the same as a key constraint: Although the FD is not violated, AB is clearly
not a key for the relation. The third and fourth tuples illustrate that if two tuples differ in either the A field or the B
field, they can differ in the C field without violating the FD. On the other hand, if we add a tuple (a1, bl, c2, d1) to
the instance shown in this figure, the resulting instance would violate the FD; to see this violation, compare the first
tuple in the figure with the new tuple.
MODULE-4 Page 3
DBMS
Given a set of FDs over a relation schlem .R, typically several additional FDs hold over R whenever all of
the given FDs hold. As an example, consider:
Workers (Ssn,name,lot,did,since)
We know that ssn -- did holds, since ssn is the key, and FD did -- lot is given to hold. Therefore, in any legal
instance of Workers, if two tuples have the same ssn value, they must have the same did value (from the first FD),
and because they have the same did value, they must also have the same lot value (from the second FD). Therefore,
the FD ssn -- lot also holds on Workers.
We say that an FD f is implied by a given set F of FDs if f holds on every relation instance that satisfies all
dependencies in F; that is, f holds whenever all FDs in F hold. Note that it is not sufficient for f to hold on same
instance that satisfies all dependencies in F; rather, f must hold on every instance that satisfies all dependencies in F.
The set of all FDs implied by a given set F of FDs is called the closure of F, denoted F+.
- In order to check for the presence of R/A we need to ascertain the possible presence of other FDs implied by those
stated explicitly. This means that we have to calculate the closure F+.
- This may be done utilizing the Armstrong Axioms which may be stated as follows: letting
X, Y, Z denote sets of attributes of a relation R,
• Reflexivity: If X ⊇ Y (i.e. X contains Y) then X->Y. (This rule really generates only
trivial FDs).
It is convenient to add the following additional rules which may even be considered as denotation rules
Note that these axioms do not imply that you may ‘cancel’ attributes appearing on both sides. Thus if AB_->BC,
then you may not conclude that A->B.
Consider the relation ABC with FDs {(i) A->B and (ii) B->C}
1. From Reflexivity we get all the trivial FDs which are of the form
X->Y, where Y ⊆X, X ⊆ABC and Y ⊆ABC.
MODULE-4 Page 4
DBMS
Thus the closure of the set F of given FDs is (apart from trivial FDs):
F+ = {A->B, B->C, A-> C, AC->BC, AB->AC, AB->CB}
Consider the previous relation Contracts that is characterized by the set of FDs
{ (i) C ->CSJDPQV, (ii) JP->C, (iii) SD-> P}.
Attribute Closure:
Constructing the closure of a set of FDs may be fairly laborious. It may be avoided when one wishes to check what
are the possible right-hand sides of an FD X ->Y, for a given X, by means of the following algorithm which
calculates the so-called attribute closure, denoted X+, of a set X = {A1, A2, … ,An} of attributes, with respect to
the set F of FDs.
1. Let X be a set of attributes that eventually will become the closure. First we initialize X to be {A1, A2, … ,An}.
2. We repeatedly search for some FD B1 B2 …Bm ->C such that all of B1 B2 … Bm are in the set of attributes X,
but C is not. We then add C to the set X.
3. Repeat step 2 as many times as necessary until no more new attributes can be added to X.
4. The final set X is the correct value of {A1, A2, … ,An}+.
2. (i) does not satisfy the requirement that the left side be in JP.
(ii) does, therefore we set (X)+ = (X)+ U C = {JPC}
(iii) does not.
We now repeat step 2:
2. (i) now does satisfy the requirement that the left side be in JP,
therefore we set (X)+ = (X)+ U CSJDPQV = {JPCSDQV}.
(ii) and (iii) add nothing new. Repeating step 2 does not change (X)+.
Therefore we stop having obtained (JP)+ = {JPCSDQV}+.
- Each table represents a single subject. For example, a course table will contain only data that directly pertains
to courses. Similarly, a student table will contain only student data.
MODULE-4 Page 5
DBMS
-No data item will be unnecessarily stored in more than one table (in short, tables have minimum controlled
redundancy). The reason for this requirement is to ensure that the data are updated in only one place.
- All nonprime attributes in a table are dependent on the primary key—the entire primary key and nothing but the
primary key. The reason for this requirement is to ensure that the data are uniquely identifiable by a primary key
value.
-Each table is void of insertion, update, or deletion anomalies. This is to ensure the integrity and consistency of the
data.
To accomplish the objective, the normalization process takes you through the steps that lead to successively higher
Normal forms. The most common normal forms and their basic characteristic are listed in Table 5.2.
It states that the domain of an attribute must include only atomic (simple, indivisible) values and that the
value of any attribute in a tuple must be a single value from the domain of that attribute. Hence, 1NF disallows
having a set of values, a tuple of values, or a combination of both as an attribute value for a single tuple. In other
words, 1NF disallows relations within relations or relations as attribute values within tuples. The only attribute
values permitted by 1NF are single atomic (or indivisible) values.
Consider the DEPARTMENT relation schema shown in Figure 15.9(a), whose primary key is Dnumber, and
suppose we assume that each department can have a number of locations. The DEPARTMENT schema and a
sample relation state are shown in Figure 15.9(b). As we can see, this is not in 1NF because Dlocations is not an
atomic attribute, as illustrated by the first tuple in Figure 15.9(b). There are two ways we can look at the Dlocations
attribute: So we have converted this relation into 1NF as shown in figure 15.9(c).
MODULE-4 Page 6
DBMS
The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part of the primary
key. If the primary key contains a single attribute, the test need not be applied at all. The EMP_PROJ relation in
Figure 15.10 is in 1NF but is not in 2NF. The nonprime attribute Ename violates 2NF because of FD2, as do the
nonprime attributes Pname and Plocation because of FD3. The functional dependencies FD2 and FD3 make Ename,
Pname, and Plocation partially dependent on the primary key {Ssn, Pnumber} of EMP_PROJ, thus violating the
2NF test.
If a relation schema is not in 2NF, it can be second normalized or 2NF normalized into a number of 2NF relations in
which nonprime attributes are associated only with the part of the primary key on which they are fully functionally
dependent. Therefore, the functional dependencies FD1, FD2, and FD3 in Figure 15.10 lead to the decomposition of
EMP_PROJ into the three relation schemas EP1, EP2, and EP3 shown in Figure 15.11, each of which is in 2NF.
Figure 15.10
MODULE-4 Page 7
DBMS
Third normal form (3NF) is based on the concept of transitive dependency. A functional dependency X-
>Y in a relation schema R is a transitive dependency if there exists a set of attributes Z in R that is neither a
candidate key nor a subset of any key of R and both X->Z and Z->Y hold.
Definition: According to Codd’s original definition, a relation schema R is in 3NF if it satisfies 2NF and no
nonprime attribute of R is transitively dependent on the primary key. A relation schema R is in third normal form
(3NF) if, whenever a nontrivial functional dependency X->A holds in R, either (a) X is a superkey of R, or (b) A is a
prime attribute of R.
Figure 15.5
The dependency Ssn->Dmgr_ssn is transitive through Dnumber in EMP_DEPT in Figure 15.5, because both the
dependencies Ssn ->Dnumber and Dnumber ->Dmgr_ssn hold and Dnumber is neither a key itself nor a subset of
the key of EMP_DEPT. Intuitively, we can see that the dependency of Dmgr_ssn on Dnumber is undesirable in
EMP_DEPT since Dnumber is not a key of EMP_DEPT.
The relation schema EMP_DEPT in Figure 15.5 is in 2NF, since no partial dependencies on a key exist. However,
EMP_DEPT is not in 3NF because of the transitive dependency of Dmgr_ssn (and also Dname) on Ssn via
Dnumber. We can normalize EMP_DEPT by decomposing it into the two 3NF relation schemas ED1 and ED2
shown in Figure 15.6.
Intuitively, we see that ED1 and ED2 represent independent entity facts about employees and departments. A
NATURAL JOIN operation on ED1 and ED2 will recover the original relation EMP_DEPT without generating
spurious tuples.
MODULE-4 Page 8
DBMS
Figure 15.6
Example:
MODULE-4 Page 9
DBMS
Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was found to be stricter
than 3NF. That is, every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF.
Definition: A relation schema R is in BCNF if whenever a nontrivial functional dependency X->A holds in R, then
X is a superkey of R.
MODULE-4 Page 10
DBMS
The formal definition of BCNF differs from the definition of 3NF in that condition (b) of 3NF, which allows A to be
prime, is absent from BCNF. That makes BCNF a stronger normal form compared to 3NF. In our example, FD5
violates BCNF in LOTS1A because AREA is not a superkey of LOTS1A. Note that FD5 satisfies 3NF in LOTS1A
because County_name is a prime attribute (condition b), but this condition does not exist in the definition of
BCNF.We can decompose LOTS1A into two BCNF relations LOTS1AX and LOTS1AY, shown in Figure 15.13(a).
This decomposition loses the functional dependency FD2 because its attributes no longer coexist in the same
relation after decomposition.
Multivalued Dependencies:
If we have a nontrivial MVD in a relation, we may have to repeat values redundantly in the tuples. In the
EMP relation of Figure 15.15(a), the values ‘X’ and ‘Y’ of Pname are repeated with each value of Dname (or, by
symmetry, the values ‘John’ and ‘Anna’ of Dname are repeated with each value of Pname). This redundancy is
clearly undesirable. However, the EMP schema is in BCNF because no functional dependencies hold in EMP.
Therefore, we need to define a fourth normal form that is stronger than BCNF and disallows relation schemas such
as EMP. Notice that relations containing nontrivial MVDs tend to be all-key relations—that is, their key is all their
attributes taken together.
MODULE-4 Page 11
DBMS
MODULE-4 Page 12
DBMS
It's in 4NF
If we can decompose table further to eliminate redundancy and anomaly, and when we re-join the
decomposed tables by means of candidate keys, we should not be losing the original data or any new record
set should not arise. In simple words, joining two or more decomposed table should not lose records nor
create new records.
There should be no join dependency in a relation.
MODULE-4 Page 13
DBMS
Figure 15.15
DECOMPOSITION:
Decomposition is a tool that allows us to eliminate redundancy. However, it is important to check that
decomposition does not introduce new problems. In particular, we should check whether decomposition allows us to
recover the original relation, and whether it allows us to check integrity constraints efficiently.
Properties Of Decompositions:
1. Lossless-Join Decomposition:
Let R be a relation schema and let F be a set of FDs over R. A decomposition of R into two schemas with
attribute sets X and Y is said to be a lossless-join decomposition with respect to F if, for every instance r of R that
satisfies the dependencies in F, ∏x (r) ⋈ ∏y (r) = r. In other words, we can recover the original relation from the
decomposed relations.
From the definition it is easy to see that r is always a subset of natural join of decomposed relations. If we take
projections of a relation and recombine them using natural join, we typically obtain some tuples that were not in the
original relation.
MODULE-4 Page 14
DBMS
2.Dependency-Preserving Decomposition:
The decomposition of relation schema R with FDs Finto schemas with attribute Sets X and Y is
dependency-preserving if (Fx U Fy)+ =F+, That is, if we take the dependencies in Fx and Fy and compute the
closure of their union, We get back all dependencies in the closure of F. Therefore, we need to enforce only the
dependencies in Fx and Fy: all FDs in F+ are then sure to be satisfied. To enforce Fx , we need to examine only
relation X (on inserts to that relation). To enforce Fy, we need to examine only relation Y.
DENORMALIZATION:
It’s important to remember that the optimal relational database implementation requires that all tables be at least in
third normal form (3NF). A good relational DBMS excels at managing normalized relations; that is, relations void of
any unnecessary redundancies that might cause data anomalies. Although the creation of normalized relations is an
important database design goal, it is only one of many such goals. Good database design also considers processing
(or reporting) requirements and processing speed. The problem with normalization is that as tables are decomposed
to conform to normalization requirements, the number of database tables expands. Therefore, in order to generate
information, data must be put together from various tables. Joining a large number of tables takes additional
input/output (I/O) operations and processing logic, thereby reducing system speed. Most relational database systems
are able to handle joins very efficiently. However, rare and occasional circumstances may allow some degree of
denormalization so processing speed can be increased.
Keep in mind that the advantage of higher processing speed must be carefully weighed against the disadvantage of
data anomalies. On the other hand, some anomalies are of only theoretical interest. For example, should people in a
real-world database environment worry that a ZIP_CODE determines CITY in a CUSTOMER table whose primary
key is the customer number? Is it really practical to produce a separate table for
MODULE-4 Page 15
DBMS
A more comprehensive example of the need for denormalization due to reporting requirements is the case of a
faculty evaluation report in which each row list the scores obtained during the last four semesters taught. See Figure
5.17.
Although this report seems simple enough, the problem arises from the fact that the data are stored in a normalized
table in which each row represents a different score for a given faculty in a given semester. See Figure 5.18.
MODULE-4 Page 16
DBMS
The difficulty of transposing multirow data to multicolumnar data is compounded by the fact that the last four
semesters taught are not necessarily the same for all faculty members (some might have taken sabbaticals, some
might have had research appointments, some might be new faculty with only two semesters on the job, etc.) To
generate this report, the two tables you see in Figure 5.18 were used. The EVALDATA table is the master data table
containing the evaluation scores for each faculty member for each semester taught; this table is normalized. The
FACHIST table contains the last four data points—that is, evaluation score and semester—for each faculty member.
The FACHIST table is a temporary denormalized table created from the EVALDATA table via a series of queries.
(The FACHIST table is the basis for the report shown in Figure 5.17.)
As seen in the faculty evaluation report, the conflicts between design efficiency, information requirements, and
performance are often resolved through compromises that may include denormalization. In this case and assuming
there is enough storage space, the designer’s choices could be narrowed down to:
- Store the data in a permanent denormalized table. This is not the recommended solution, because the denormalized
table is subject to data anomalies (insert, update, and delete.) This solution is viable only if performance is an issue.
-Create a temporary denormalized table from the permanent normalized table(s). Because the denormalized table
exists only as long as it takes to generate the report, it disappears after the report is produced. Therefore, there are no
data anomaly problems. This solution is practical only if performance is not an issue and there are no other viable
processing options.
MODULE-4 Page 17