EDM - E1 - Data Architecture and Modeling - Normalization v1.1
EDM - E1 - Data Architecture and Modeling - Normalization v1.1
Normalization
• Normalization
– Normal Forms
– 1st NF
– 2nd NF
– 3rd NF
– Denormalization
• Normalization Examples
– Example 1
– Example 2
• Synthesis of Relations
• References
• Normalization
– Normal Forms
– 1st NF
– 2nd NF
– 3rd NF
– Denormalization
• Normalization Examples
– Example 1
– Example 2
• Synthesis of Relations
• References
Prof
Prof
Home: [email protected]
• A relation has one unique primary key and may also have
additional unique keys called candidate keys (CK)
• A PK is a CK, but not every CK is a PK. CKs that are not the PK
are called Alternate Keys (AK).
• SKs are used to replace composite PKs in the referencing table, that
are long (i.e. take a lot of bytes/row) and reduce disk space by their
usage as FK in referential integrity constraints. SKs are defined in
the referenced table that contains the original PK, which now
becomes an AK.
• The attribute on the left side of the functional dependency is called the determinant e.g.
– ProfName ProfDept, ProfEmail
• Given the name of a professor, I can tell you with certainty his/her dept and email. In
other words, ProfDept and ProfEmail are functionally dependent on/determined by
ProfName
– StudentID DormName, Fee
– CustomerNumber, ItemNumber, Quantity Price
• A partial dependency occurs when the value of one (set of) attribute (s) that are part of a key,
determines the value of a second (set of) attribute (s)
• A transitive dependency occurs when the value of one (set of) attribute (s) determines the value of a
second (set of) attribute (s), which in turn determines the value of a third (set of) attribute (s).
e.g. If A B and B C, then A C
As an example,
• FD2: C D :: Partial Dependency. Non-key attribute D is dependent on part ( C ) of the key (A, B, C)
• FD4: B G :: Partial dependency :: Non-key attribute G is dependent on part ( B ) of the key (A, B, C)
• Normalization
– Normal Forms
– 1st NF
– 2nd NF
– 3rd NF
– Denormalization
• Normalization Examples
– Example 1
– Example 2
• Synthesis of Relations
• References
• Normalization is a technique for producing a set of relations with desirable properties, given the
data requirements of an enterprise.
• A table that is sufficiently normalized is less vulnerable to problems of this kind, because its
structure reflects the basic assumptions for when multiple instances of the same information
should be represented by a single instance only.
• Anomalies can be removed by splitting the relation into two or more relations; each with a
different, single theme. However, breaking up a relation may create referential integrity
constraints.
• For the relational data model, it is important to recognize that it is only first
normal form (1NF) that is critical in creating relations. All the subsequent
normal forms are optional.
• A relation is in 2NF iff all its non-key attributes are dependent on all of the PK (no
partial dependencies) or no key i.e. other non-key attributes
• A relation is in 5NF iff it is in 4NF and every join dependency in it is implied by the
CKs
6 lawrence Tina
PG4 1-Sep-99 10-Jun-00 350 CO40 Murphy
St,Glasgow
Tony
Aline 2 Manor Rd,
CR56 PG36 10-Oct-00 1-Dec-01 370 CO93 Shaw
Stewart Glasgow
Tony
5 Novar Dr, Shaw
PG16 1-Nov-02 1-Aug-03 450 CO93
Glasgow
2 Manor Tony
Aline
CR56 PG36 Rd, 10-Oct-00 1-Dec-01 370 CO93 Shaw
Stewart
Glasgow
Tony
Aline 5 Novar Dr,
CR56 PG16 1-Nov-02 1-Aug-03 450 CO93 Shaw
Stewart Glasgow
6 lawrence Tina
CR76 PG4 1-Jul-00 31-Aug-01 350 CO40
St,Glasgow Murphy
6 lawrence Tina
CR56 PG4 1-Sep-99 10-Jun-00 350 CO40
St,Glasgow Murphy
2 Manor
Rd, Tony
CR56 PG36 10-Oct-00 1-Dec-01 370 CO93
Shaw
Glasgow
Client
fd2 clientNo cName (Primary key)
Rental
fd1 clientNo, propertyNo rentStart, rentFinish (Primary key)
fd5 clientNo, rentStart propertyNo, rentFinish (Candidate key)
fd6 propertyNo, rentStart clientNo, rentFinish (Candidate key)
PropertyOwner
fd3 propertyNo pAddress, rent, ownerNo, oName (Primary Key)
fd4 ownerNo oName (Transitive Dependency)
– Identify any determinants, other than the primary key, and the columns
they determine.
– Create and name a new table for each determinant and the unique
columns it determines.
– Move the determined columns from the original table to the new table.
The determinant becomes the primary key of the new table.
– Delete the columns you just moved from the original table except for
the determinant which will serve as a foreign key.
PropertyOwner Owner
• The tables which describe the dimensions in the snowflake scheme are in
Third normal form.
• 3NF schemas are typically chosen for large data warehouses, especially
environments with significant data-loading requirements that are used to
feed data marts and execute long-running queries.
• Queries on 3NF schemas are often very complex and involve a large
number of tables. The performance of joins between large tables is thus a
primary consideration when using 3NF schemas.
• Minimize Redundancy
– One fact in one place – single theme
– Defined once – used consistently by all stakeholders
• Example
– Normalized relation
• CUSTOMER (CustNumber, CustName, Zip)
• CODES (Zip, City, State)
– De-Normalized relations
• CUSTOMER (CustNumber, CustName, City, State, Zip)
• Normalization
– Normal Forms
– 1st NF
– 2nd NF
– 3rd NF
– Denormalization
• Normalization Examples
– Example 1
– Example 2
• Synthesis of Relations
• References
FD1: A, B, C D, E, F
FD2: C D
FD3: B E
FD4: E F
FD1: A, B, C D, E, F
FD2: C D
FD3: B E
FD4: E F
• 2NF Violation:
FD2 and FD3 violate 2NF for R, since they are partial dependencies i.e.
their determinant is part of the key.
2) T (B, E)
FD3: B E
3) R1 (A, B, C, E, F)
FD11: A, B, C E, F
FD4: E F
2) T (B, E)
FD3 B E
3) R1 (A, B, C, E, F)
FD11: A, B, C E, F
FD4: E F
3NF violation:
Only FD4 violates 3NF for (only) R1, since it has non-key attributes (F) determined by
other non-key attributes (E) i.e. its non-key attributes are dependent on NO KEY. Note
that S and T are already in 3NF with respect to their FDs.
2) T (B, E)
FD3: B E
3) Q (E, F)
FD4: E F
4) R12 (A, B, C, E)
FD112: A, B, C E
FD1: A, B, C D, E, F
FD2: C D
FD3: B E
FD4: E F
2) T (B, E)
FD3: B E
3) Q (E, F)
FD4: E F
4) R12 (A, B, C, E)
FD112: A, B, C E
2) Talt (B, E, F) :: Take the complete transitive (and its partial) dependency into this table
FD3alt: B E, F :: satisfies 2NF
FD4: E F :: satisfies 2NF, violates 3NF
3) R1alt (A, B, C, E)
FD11alt: A, B, C E :: satisfies 2NF
• After 3NF resolution of Talt, you will land up with the same final answer as shown in previous
slide.
Stu_Gr (A, B, C, D, E, F, G, H, I)
b) Assume that there are normalization anomalies present in this relation
Stu_Gr containing the 9 attributes
PROBLEM
Design a set of tables that will be in 3NF for this real world scenario
A B C D E F G H I
17 Wkeeping102 A Summer A 101006 Mahendra Singh Dhoni Farukh Engineer Fielding [email protected]
… … … … … … … … June
… 26, 2024 … 41
Example 2 - Constraints
CONSTRAINTS
Stu_Gr
6 A professor must have exactly 1 email, but could be shared by spouse if spouse teaching in same college
9 "E" is a failing grade, whereas "A", "B", "C", "D" are passing grades
10 A Student can enroll for the same class in the same term in 2 different sections, although not likely
• What happens if Hrithik Roshan joins the faculty as a professor in Fall, but doesn’t
start teaching till Spring? Can he be entered into the Stu_Gr table?
• Assuming there were only 24 records in the Stu_Gr table (as shown), and we found
that Irfan Pathan’s info (row 13) need no longer be tracked – Does that mean Vijay
Mallya has left the college?
• Assuming there were only 24 records in the Stu_Gr table (as shown), and we found
that Irfan Pathan’s info (row 13) need no longer be tracked – Does that mean that the
class “Fielding101” Section B was never offered in Spring? And that too by Vijay
Mallya?
• Based on constraints, and looking at the sample data, there is only 1 CK in the Stu_Gr
relation
Hence, the PK is
PK : A, B, C, E
Stu_Gr (A, B, C, D, E, F, G, H, I)
Stu_Gr (A, B, C, D, E, F, G, H, I)
1T1
1NF 2NF
FD1 A, B, C, E ---> D, F, G, H, I Definition of Primary Key D, F, G, H, I Whole Key Y Y
2T1
Student (E, F)
2T2
Student_Course_Grade (A, B, C, D, E)
2T3_1T1
Functional
Dependency Based on Non-Key attribute(s) Determinant Satisfies
Prof (G, H, I)
3T1
Course_Prof (A, B, C, G)
3T2_2T1
Stu_Gr (A, B, C, D, E, F, G, H, I)
1T1
Student
Student_Number Student_Name
E F
… … …
Prof
G H I
… … … …
Course_Prof
A B C G
… … … … …
A B C D E
… … … … … …
• Have all the identified anomalies been eliminated by the 3NF design?
• Normalization
– Normal Forms
– 1st NF
– 2nd NF
– 3rd NF
– Denormalization
• Normalization Examples
– Example 1
– Example 2
• Synthesis of Relations
• References
2. Database Concepts – 2nd edition, David M. Kroenke, Pearson Prentice Hall, ISBN 0-
13-145141-3