Logical Database Design Using Entity-Relationship Modeling: Normalization To Avoid Redundancy
Logical Database Design Using Entity-Relationship Modeling: Normalization To Avoid Redundancy
1. Data modeling
Logical data modeling is the process of documenting the comprehensive business information requirements in an accurate and consistent format. Designing and implementing a successful database, one that satisfies the needs of an organization,
publib.boulder.ibm.com/infocenter/dzichelp/v2r2/advanced/print.jsp?topic=/com.ibm.db2z10.doc.intro/src/tpc/db2z_logicaldbdesignentityrelationshp.htm&topicIn 1/12
12/29/12
requires a logical data model. Analysts who do data modeling define the data items and the business rules that affect those data items. The process of data modeling acknowledges that business data is a vital asset that the organization needs to understand and carefully manage. This topic contains information that was adapted from Handbook of Relational Database Design. Consider the following business facts that a manufacturing company needs to represent in its data model: Customers purchase products. Products consist of parts. Suppliers manufacture parts. Warehouses store parts. Transportation vehicles move the parts from suppliers to warehouses and then to manufacturers. These are all business facts that a manufacturing company's logical data model needs to include. Many people inside and outside the company rely on information that is based on these facts. Many reports include data about these facts. Any business, not just manufacturing companies, can benefit from the task of data modeling. Database systems that supply information to decision makers, customers, suppliers, and others are more successful if their foundation is a sound data model.
12/29/12
books on those subjects. 4. Determine additional business rules that affect attributes. Next, analysts clarify the data-driven business rules. Data-driven business rules are constraints on particular data values. These constraints need to be true, regardless of any particular processing requirements. Analysts define these constraints during the data design stage, rather than during application design. The advantage to defining data-driven business rules is that programmers of many applications don't need to write code to enforce these business rules. Example 3: Assume that a business rule requires that a customer entity have either a phone number or an address, or both. If this rule doesn't apply to the data itself, programmers must develop, test, and maintain applications that verify the existence of one of these attributes. Data-driven business requirements have a direct relationship with the data, thereby relieving programmers from extra work. 5. Integrate user views. In this last phase of data modeling, analysts combine the different user views that they have built into a consolidated logical data model. If other data models already exist in the organization, the analysts integrate the new data model with the existing one. At this stage, analysts also strive to make their data model flexible so that it can support the current business environment and possible future changes. Example 4: Assume that a retail company operates in a single country and that business plans include expansion to other countries. Armed with knowledge of these plans, analysts can build the model so that it is flexible enough to support expansion into other countries.
12/29/12
for the entities. In the case of the EMPLOYEE entity, you might define the following additional attributes: Birth date Hire date Home address Office phone number Gender Resume You can read more about defining attributes later in this information. Finally, you normalize the data. Parent topic: Logical database design using entity-relationship modeling Related concepts: Referential constraints DB2 keys Normalization to avoid redundancy Entities for different types of relationships
12/29/12
If you look at this information's example tables, you can find answers for the following questions: What does Wing Lee work on? Who works on project number OP2012? Both questions yield multiple answers. Wing Lee works on project numbers OP2011 and OP2012. The employees who work on project number OP2012 are Ramlal Mehta and Wing Lee. Parent topic: Entities for different types of relationships Related reference: DB2 sample tables
12/29/12
Database designers and data analysts can be more effective when they have a good understanding of the business. If they understand the data, the applications, and the business rules, they can succeed in building a sound database design. When you define relationships, you have a large influence on how smoothly your business runs. If you perform this task poorly, your database and associated applications are likely to have many problems, some of which might not manifest themselves for years. Parent topic: Logical database design using entity-relationship modeling
12/29/12
CHARACTER: Fixed-length character strings. The common short name for this data type is CHAR. VARCHAR: Varying-length character strings. CLOB: Varying-length character large object strings, typically used when a character string might exceed the limits of the VARCHAR data type. GRAPHIC: Fixed-length graphic strings that contain double-byte characters. VARGRAPHIC: Varying-length graphic strings that contain double-byte characters. DBCLOB: Varying-length strings of double-byte characters in a large object. BINARY: A sequence of bytes that is not associated with a code page. VARBINARY: Varying-length binary strings. BLOB: Varying-length binary strings in a large object. XML: Varying-length string that is an internal representation of XML. Numeric Data that contains digits. Numeric data types are listed below: SMALLINT: for small integers. INTEGER: for large integers. BIGINT: for bigger values. DECIMAL(p s or NUMERIC(p s where pis precision and sis scale: for packed decimal ,) , ), numbers with precision pand scale s Precision is the total number of digits, and scale is the . number of digits to the right of the decimal point. DECFLOAT: for decimal floating-point numbers. REAL: for single-precision floating-point numbers. DOUBLE: for double-precision floating-point numbers. Datetime Data values that represent dates, times, or timestamps. Datetime data types are listed below: DATE: Dates with a three-part value that represents a year, month, and day. TIME: Times with a three-part value that represents a time of day in hours, minutes, and seconds. TIMESTAMP: Timestamps with a seven-part value that represents a date and time by year, month, day, hour, minute, second, and microsecond. Examples: You might use the following data types for attributes of the EMPLOYEE entity: EMPLOYEE_NUMBER: CHAR(6) EMPLOYEE_LAST_NAME: VARCHAR(15) EMPLOYEE_HIRE_DATE: DATE EMPLOYEE_SALARY_AMOUNT: DECIMAL(9,2) The data types that you choose are business definitions of the data type. During physical database design you might need to change data type definitions or use a subset of these data types. The database or the host language might not support all of these definitions, or you might make a different choice for performance reasons. For example, you might need to represent monetary amounts, but DB2 and many host languages do not have a data type MONEY. In the United States, a natural choice for the SQL data type in this situation is DECIMAL(10,2) to represent dollars. But you might also consider the INTEGER data type for fast, efficient performance. Parent topic: Attributes for entities Related concepts: Column names
12/29/12
For example, you would not want to allow numeric data in an attribute for a person's name. The data types that you choose limit the values that apply to a given attribute, but you can also use other mechanisms. These other mechanisms are domains, null values, and default values.
Domain
A domain describes the conditions that an attribute value must meet to be a valid value. Sometimes the domain identifies a range of valid values. By defining the domain for a particular attribute, you apply business rules to ensure that the data makes sense. Examples: A domain might state that a phone number attribute must be a 10-digit value that contains only numbers. You would not want the phone number to be incomplete, nor would you want it to contain alphabetic or special characters and thereby be invalid. You could choose to use either a numeric data type or a character data type. However, the domain states the business rule that the value must be a 10-digit value that consists of numbers. Before finalizing this rule, consider if you have a need for international phone numbers, which have different formats. A domain might state that a month attribute must be a 2-digit value from 01 to 12. Again, you could choose to use datetime, character, or numeric data types for this value, but the domain demands that the value must be in the range of 01 through 12. In this case, incorporating the month into a datetime data type is probably the best choice. This decision should be reviewed again during physical database design.
Null values
When you are designing attributes for your entities, you will sometimes find that an attribute does not have a value for every instance of the entity. For example, you might want an attribute for a person's middle name, but you can't require a value because some people have no middle name. For these occasions, you can define the attribute so that it can contain null values. A null value is a special indicator that represents the absence of a value. The value can be absent because it is unknown, not yet supplied, or nonexistent. The DBMS treats the null value as an actual value, not as a zero value, a blank, or an empty string. Just as some attributes should be allowed to contain null values, other attributes should not contain null values. Example: For the EMPLOYEE entity, you might not want to allow the attribute EMPLOYEE_LAST_NAME to contain a null value.
Default values
In some cases, you might not want a specific attribute to contain a null value, but you don't want to require that the user or program always provide a value. In this case, a default value might be appropriate. A default value is a value that applies to an attribute if no other valid value is available. Example: Assume that you don't want the EMPLOYEE_HIRE_DATE attribute to contain null values and that you don't want to require users to provide this data. If data about new employees is generally added to the database on the employee's first day of employment, you could define a default value of the current date. Parent topic: Attributes for entities Related concepts: Implementation of your database design
12/29/12
Normalization helps you avoid redundancies and inconsistencies in your data. There are several forms of normalization. After you define entities and decide on attributes for the entities, you normalize entities to avoid redundancy. An entity is normalized if it meets a set of constraints for a particular normal form, which this information describes. Entities can be in first, second, third, and fourth normal forms, each of which has certain rules that are associated with it. In some cases, you follow these rules, and in other cases, you do not follow them. The rules for normal form are cumulative. In other words, for an entity to satisfy the rules of second normal form, it also must satisfy the rules of first normal form. An entity that satisfies the rules of fourth normal form also satisfies the rules of first, second, and third normal form. In the context of logical data modeling, an instance is one particular occurrence. An instance of an entity is a set of data values for all the attributes that correspond to that entity. Example: The following figure shows one instance of the EMPLOYEE entity. Figure 1. One instance of an entity
First normal form A relational entity satisfies the requirement of first normal form if every instance of the entity contains only one value, but never multiple repeating attributes. Second normal form An entity is in second normal form if each attribute that is not in the primary key provides a fact that depends on the entire key. Third normal form An entity is in third normal form if each nonprimary key attribute provides a fact that is independent of other non-key attributes and depends only on the key. Fourth normal form An entity is in fourth normal form if no instance contains two or more independent, multivalued facts about an entity. Parent topic: Logical database design using entity-relationship modeling Related concepts: Database design with denormalization
12/29/12
This situation violates the requirement of first normal form, because JANUARY_SALARY_AMOUNT, FEBRUARY_SALARY_AMOUNT, and MARCH_SALARY_AMOUNT are essentially the same attribute, EMPLOYEE_MONTHLY_SALARY_AMOUNT. Parent topic: Normalization to avoid redundancy
Here, the primary key consists of the PART and the WAREHOUSE attributes together. Because the attribute WAREHOUSE_ADDRESS depends only on the value of WAREHOUSE, the entity violates the rule for second normal form. This design causes several problems: Each instance for a part that this warehouse stores repeats the address of the warehouse. If the address of the warehouse changes, every instance referring to a part that is stored in that warehouse must be updated. Because of the redundancy, the data might become inconsistent. Different instances could show different addresses for the same warehouse. If at any time the warehouse has no stored parts, the address of the warehouse might not exist in any instances in the entity. To satisfy second normal form, the information in the figure above would be in two entities, as the following figure shows. Figure 2. Two entities that satisfy second normal form
12/29/12
attribute. Example: The first entity in the following figure contains the attributes EMPLOYEE_NUMBER and DEPARTMENT_NUMBER. Suppose that a program or user adds an attribute, DEPARTMENT_NAME, to the entity. The new attribute depends on DEPARTMENT_NUMBER, whereas the primary key is on the EMPLOYEE_NUMBER attribute. The entity now violates third normal form. Changing the DEPARTMENT_NAME value based on the update of a single employee, David Brown, does not change the DEPARTMENT_NAME value for other employees in that department. The updated version of the entity in the following figure illustrates the resulting inconsistency. Additionally, updating the DEPARTMENT_NAME in this table does not update it in any other table that might contain a DEPARTMENT_NAME column. Figure 1. The update of an unnormalized entity. Information in the entity has become inconsistent.
You can normalize the entity by modifying the EMPLOYEE_DEPARTMENT entity and creating two new entities: EMPLOYEE and DEPARTMENT. The following figure shows the new entities. The DEPARTMENT entity contains attributes for DEPARTMENT_NUMBER and DEPARTMENT_NAME. Now, an update such as changing a department name is much easier. You need to make the update only to the DEPARTMENT entity. Figure 2. Normalized entities: EMPLOYEE, DEPARTMENT, and EMPLOYEE_DEPARTMENT
12/29/12
Instead, you can avoid this violation by creating two entities that represent both relationships, as the following figure shows. Figure 2. Entities that are in fourth normal form
If, however, the facts are interdependent (that is, the employee applies certain languages only to certain skills) you should not split the entity. You can put any data into fourth normal form. A good rule to follow when doing logical database design is to arrange all the data in entities that are in fourth normal form. Then decide whether the result gives you an acceptable level of performance. If the performance is not acceptable, denormalizing your design is a good approach to improving performance. Parent topic: Normalization to avoid redundancy
publib.boulder.ibm.com/infocenter/dzichelp/v2r2/advanced/print.jsp?topic=/com.ibm.db2z10.doc.intro/src/tpc/db2z_logicaldbdesignentityrelationshp.htm&topicI
12/12