Databases 03 - Normalisation
Databases 03 - Normalisation
03 - Normalisation
Data normalization is a process in which data attributes within a data model are organized
to increase the cohesion of entity types. In other words, the goal of data normalization is to
reduce and even eliminate data redundancy, an important consideration for application
developers because it is incredibly difficult to stores objects in a relational database that
maintains the same information in several places.
Table 1 summarizes the three most common forms of normalization ( First normal form
(1NF), Second normal form (2NF), and Third normal form (3NF)) describing how to put
entity types into a series of increasing levels of normalization. Higher levels of data
normalization are beyond the scope of this article.
With respect to terminology, a data schema is considered to be at the level of normalization
of its least normalized entity type. For example, if all of your entity types are at second
normal form (2NF) or higher then we say that your data schema is at 2NF.
Level Rule
First normal form An entity type is in 1NF when it contains no repeating groups
(1NF) of data.
Second normal An entity type is in 2NF when it is in 1NF and when all of its
form (2NF) non-key attributes are fully dependent on its primary key.
Third normal An entity type is in 3NF when it is in 2NF and when all of its
form (3NF) attributes are directly dependent on the primary key.
Page 1 of 7
1. First Normal Form (1NF)
Let’s consider an example. An entity type is in first normal form (1NF) when it contains no
repeating groups of data. For example, in Figure 1 you see that there are several repeating
attributes in the data Order0NF table – the ordered item information repeats nine times
and the contact information is repeated twice, once for shipping information and once for
billing information.
Although this initial version of orders could work, what happens when an order has more
than nine order items? Do you create additional order records for them? What about the
vast majority of orders that only have one or two items? Do we really want to waste all that
storage space in the database for the empty fields? Likely not.
Furthermore, do you want to write the code required to process the nine copies of item
information, even if it is only to marshal it back and forth between the appropriate number
of objects. Once again, likely not.
Page 2 of 7
Figure 2 presents a reworked data schema where the order schema is put in first normal
form.
The introduction of the OrderItem1NF table enables us to have as many, or as few, order
items associated with an order, increasing the flexibility of our schema while reducing
storage requirements for small orders (the majority of our business).
The ContactInformation1NF table offers a similar benefit, when an order is shipped and
billed to the same person (once again the majority of cases) we could use the same contact
information record in the database to reduce data redundancy. OrderPayment1NF was
introduced to enable customers to make several payments against an order –
Order0NF could accept up to two payments, the type being something like “MC" and the
description “MasterCard Payment", although with the new approach far more than two
payments could be supported.
Multiple payments are accepted only when the total of an order is large enough that a
customer must pay via more than one approach, perhaps paying some by check and some
by credit card.
An important thing to notice is the application of primary and foreign keys in the new
solution. Order1NF has kept OrderID, the original key of Order0NF, as its primary key. To
maintain the relationship back to Order1NF, the OrderItem1NF table includes
the OrderID column within its schema, which is why it has the stereotype of FK. When a
new table is introduced into a schema, in this case OrderItem1NF, as the result of first
normalization efforts it is common to use the primary key of the original table (Order0NF) as
part of the primary key of the new table. Because OrderID is not unique for order items, you
can have several order items on an order, the column ItemSequence was added to form a
composite primary key for the OrderItem1NF table. A different approach to keys was taken
with the ContactInformation1NF table. The column ContactID, a surrogate key that has no
business meaning, was made the primary key.
Page 3 of 7
2. Second Normal Form (2NF)
Although the solution presented in Figure 2 is improved over that of Figure 1, it can be
normalized further. Figure 3 presents the data schema of Figure 2 in second normal form
(2NF). an entity type is in second normal form (2NF) when it is in 1NF and when every non-
key attribute, any attribute that is not part of the primary key, is fully dependent on the
primary key. This was definitely not the case with the OrderItem1NF table, therefore we
need to introduce the new table Item2NF. The problem with OrderItem1NF is that item
information, such as the name and price of an item, do not depend upon an order for that
item.
For example, if Hal Jordan orders three widgets and Oliver Queen orders five widgets, the
facts that the item is called a “widget" and that the unit price is $19.95 is constant. This
information depends on the concept of an item, not the concept of an order for an item,
and therefore should not be stored in the order items table – therefore the Item2NF table
was introduced. OrderItem2NF retained the TotalPriceExtended column, a calculated value
that is the number of items ordered multiplied by the price of the item. The value of the
SubtotalBeforeTax column within the Order2NF table is the total of the values of the total
price extended for each of its order items.
Page 4 of 7
3. Third Normal Form (3NF)
An entity type is in third normal form (3NF) when it is in 2NF and when all of its attributes
are directly dependent on the primary key. A better way to word this rule might be that the
attributes of an entity type must depend on all portions of the primary key. In this case
there is a problem with the OrderPayment2NF table, the payment type description (such as
“Mastercard" or “Check") depends only on the payment type, not on the combination of
the order id and the payment type. To resolve this problem the PaymentType3NF table was
introduced in Figure 4, containing a description of the payment type as well as a unique
identifier for each payment type.
Page 5 of 7
4. Beyond 3NF
The data schema of Figure 4 can still be improved upon, at least from the point of view of
data redundancy, by removing attributes that can be calculated/derived from other
ones. In this case we could remove the SubtotalBeforeTax column within
the Order3NF table and the TotalPriceExtended column of OrderItem3NF, as you see
in Figure 5.
Page 6 of 7
5. Why Data Normalization?
The advantage of having a highly normalized data schema is that information is stored in
one place and one place only, reducing the possibility of inconsistent data. Furthermore,
highly-normalized data schemas in general are closer conceptually to object-oriented
schemas because the object-oriented goals of promoting high cohesion and loose coupling
between classes results in similar solutions (at least from a data point of view). This
generally makes it easier to map your objects to your data schema.
6. Denormalization
From a purist point of view you want to normalize your data structures as much as possible,
but from a practical point of view you will find that you need to 'back out" of some of your
normalizations for performance reasons. This is called "denormalization". For example, with
the data schema of Figure 1 all the data for a single order is stored in one row (assuming
orders of up to nine order items), making it very easy to access. With the data schema
of Figure 1 you could quickly determine the total amount of an order by reading the single
row from theOrder0NF table. To do so with the data schema of Figure 5 you would need to
read data from a row in the Order table, data from all the rows from the OrderItem table for
that order and data from the corresponding rows in the Item table for each order item. For
this query, the data schema of Figure 1 very likely provides better performance.
7. Acknowledgements
I'd like to thank Jon Heggland for his thoughtful review and feedback. He found several bugs
which had gotten by both myself and my tech reviewers. - See more at:
http://agiledata.org/essays/dataNormalization.html#sthash.SO3JA4bW.dpuf
Page 7 of 7