Data Modeling 101: Search
Data Modeling 101: Search
www.agiledata.org: Techniques for Successful Evolutionary/Agile Database Development Search Announcements Home
| Agile DBAs | Developers | Enterprise Architects | Enterprise Administrators | Best Practices | Agility@Scale Blog |
| Contact Us
The goals of this article are to overview fundamental data modeling skills that all developers should have, skills that can be applied on both traditional projects that take a serial approach to agile projects that take an evolutionary approach. My personal philosophy is that every IT professional should have a basic understanding of data modeling. They dont need to be experts at data modeling, but they should be prepared to be involved in the creation of such a model, be able to read an existing data model, understand when and when not to create a data model, and appreciate fundamental data design techniques. This article is a brief introduction to these skills. The primary audience for this article is application developers who need to gain an understanding of some of the critical activities performed by an Agile DBA. This understanding should lead to an appreciation of what Agile DBAs do and why they do them, and it should help to bridge the communication gap between these two roles.
Table of Contents
1. What is data modeling? r How are data models used in practice? r What about conceptual models? r Common data modeling notations 2. How to model data r Identify entity types r Identify attributes r Apply naming conventions r Identify relationships r Apply data model patterns r Assign keys r Normalize to reduce data redundancy r Denormalize to improve performance 3. Evolutionary/agile data modeling 4. How to become better at modeling data
Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be significantly different. This is because the goals for each diagram is different you can use an LDM to explore domain concepts with your stakeholders and the PDM to define your database design. Figure 1 presents a simple LDM and Figure 2 a simple PDM, both modeling the concept of customers and addresses as well as the relationship between them. Both diagrams apply the Barker notation, summarized below. Notice how the PDM shows greater detail, including an associative table required to implement the association as well as the keys needed to maintain the relationships. More on these concepts later. PDMs should also reflect your organizations database naming standards, in this case an abbreviation of the entity name is appended to each column name and an abbreviation for Number was consistently introduced. A PDM should also indicate the data types for the columns, such as integer and char(5). Although Figure 2 does not show them, lookup tables (also called reference tables or description tables) for how the address is used as well as for states and countries are implied by the attributes ADDR_USAGE_CODE, STATE_CODE, and COUNTRY_CODE.
An important observation about Figures 1 and 2 is that Im not slavishly following Barkers approach to naming relationships. For example, between Customer and Address there really should be two names Each CUSTOMER may be located in one or more ADDRESSES and Each ADDRESS may be the site of one or more CUSTOMERS. Although these names explicitly define the relationship I personally think that theyre visual noise that clutter the diagram. I prefer simple names such as has and then trust my readers to interpret the name in each direction. Ill only add more information where its needed, in this case I think that it isnt. However, a significant advantage of describing the names the way that Barker suggests is that its a good test to see if you actually understand the relationship if you cant name it then you likely dont understand it. Data models can be used effectively at both the enterprise level and on projects. Enterprise architects will often create one or more high-level LDMs that depict the data structures that support your enterprise, models typically referred to as enterprise data models or enterprise information models. An enterprise data model is one of several views that your organizations enterprise architects may choose to maintain and support other views may explore your network/hardware infrastructure, your organization structure, your software infrastructure, and your business processes (to name a few). Enterprise data models provide information that a project team can use both as a set of constraints as well as important insights into the structure of their system. Project teams will typically create LDMs as a primary analysis artifact when their implementation environment is predominantly procedural in nature, for example they are using structured COBOL as an implementation language. LDMs are also a good choice when a project is data-oriented in nature, perhaps a data warehouse or reporting system is being developed (having said that, experience seems to show that usage-centered approaches appear to work even better). However LDMs are often a poor choice when a project team is using object-oriented or component-based technologies because the developers would rather work with UML diagrams or when the project is not data-oriented in nature. As Agile Modeling advises, apply the right artifact(s) for the job. Or, as your grandfather likely advised you, use the right tool for the job. It's important to note that traditional approaches to Master Data Management (MDM) will often motivate the creation and maintenance of detailed LDMs, an effort that is rarely justifiable in practice when you consider the total cost of ownership (TCO) when calculating the return on investment (ROI) of those sorts of efforts. When a relational database is used for data storage project teams are best advised to create a PDMs to model its internal schema. My experience is that a PDM is often one of the critical design artifacts for business application development projects.
My experience is that people will capture information in the best place that they know. As a result I typically discard ORMs after Im finished with them. I sometimes user ORMs to explore the domain with project stakeholders but later replace them with a more traditional artifact such as an LDM, a class diagram, or even a PDM. As a generalizing specialist, someone with one or more specialties who also strives to gain general skills and knowledge, this is an easy decision for me to make; I know that this information that Ive just discarded will be captured in another artifact a model, the tests, or even the code that I understand. A specialist who only understands a limited number of artifacts and therefore hands-off their work to other specialists doesnt have this as an option. Not only are they tempted to keep the artifacts that they create but also to invest even more time to enhance the artifacts. Generalizing specialists are more likely than specialists to travel light.
Table 1. Discussing common data modeling notations. Notation IE Comments The IE notation (Finkelstein 1989) is simple and easy to read, and is well suited for high-level logical and enterprise data modeling. The only drawback of this notation, arguably an advantage, is that it does not support the identification of attributes of an entity. The assumption is that the attributes will be modeled with another diagram or simply described in the supporting documentation. The Barker notation is one of the more popular ones, it is supported by Oracles toolset, and is well suited for all types of data models. Its approach to subtyping can become clunky with hierarchies that go several levels deep. This notation is overly complex. It was originally intended for physical modeling but has been misapplied for logical modeling as well. Although popular within some U.S. government agencies, particularly the Department of Defense (DoD), this notation has been all but abandoned by everyone else. Avoid it if you can. This is not an official data modeling notation (yet). Although several suggestions for a data modeling profile for the UML exist, none are complete and more importantly are not official UML yet. However, the Object Management Group (OMG) in December 2005 announced an RFP for data-oriented models.
Barker
IDEF1X
UML
Identify entity types Identify attributes Apply naming conventions Identify relationships Apply data model patterns Assign keys Normalize to reduce data redundancy Denormalize to improve performance
Very good practical books about data modeling include Joe Celkos Data & Databases and Data Modeling for Information Professionals as they both focus on practical issues with data modeling. The Data Modeling Handbook and Data Model Patterns are both excellent resources once youve mastered the fundamentals. An Introduction to Database Systems is a good academic treatise for anyone wishing to become a data specialist.
You also need to identify the cardinality and optionality of a relationship (the UML combines the concepts of optionality and cardinality into the single concept of multiplicity). Cardinality represents the concept of how many whereas optionality represents the concept of whether you must have something. For example, it is not enough to know that customers place orders. How many orders can a customer place? None, one, or several? Furthermore, relationships are two-way streets: not only do customers place orders, but orders are placed by customers. This leads to questions like: how many customers can be enrolled in any given order and is it possible to have an order with no customer involved? Figure 5 shows that customers place zero or more orders and that any given order is placed by one customer and one customer only. It also shows that a customer lives at one or more addresses and that any given address has zero or more customers living at it. Although the UML distinguishes between different types of relationships associations, inheritance, aggregation, composition, and dependency data modelers often arent as concerned with this issue as much as object modelers are. Subtyping, one application of inheritance, is often found in data models, an example of which is the is a relationship between Item and its two sub entities Service and Product. Aggregation and composition are much less common and typically must be implied from the data model, as you see with the part of role that Line Item takes with Order. UML dependencies are typically a software construct and therefore wouldnt appear on a data model, unless of course it was a very highly detailed physical model that showed how views, triggers, or stored procedures depended on other aspects of the database schema.
Some data modelers will apply common data model patterns, David Hays book Data Model Patterns is the best reference on the subject, just as object-oriented developers will apply analysis patterns (Fowler 1997; Ambler 1997) and design patterns (Gamma et al. 1995). Data model patterns are conceptually closest to analysis patterns because they describe solutions to common domain issues. Hays book is a very good reference for anyone involved in analysis-level modeling, even when youre taking an object approach instead of a data approach because his patterns model business structures from a wide variety of business domains.
Let's consider Figure 6 in more detail. Figure 6 presents an alternative design to that presented in Figure 2, a different naming convention was adopted and the model itself is more extensive. In Figure 6 the Customer table has the CustomerNumber column as its primary key and SocialSecurityNumber as an alternate key. This indicates that the preferred way to access customer information is through the value of a persons customer number although your software can get at the same information if it has the persons social security number. The CustomerHasAddress table has a composite primary key, the combination of CustomerNumber and AddressID. A foreign key is one or more attributes in an entity type that represents a key, either primary or secondary, in another entity type. Foreign keys are used to maintain relationships between rows. For example, the relationships between rows in the CustomerHasAddress table and the Customer table is maintained by the CustomerNumber column within the CustomerHasAddress table. The interesting thing about the CustomerNumber column is the fact that it is part of the primary key for CustomerHasAddress as well as the foreign key to the Customer table. Similarly, the AddressID column is part of the primary key of CustomerHasAddress as well as a foreign key to the Address table to maintain the relationship with rows of Address. Although the "natural vs. surrogate" debate is one of the great religious issues within the data community, the fact is that neither strategy is perfect and you'll discover that in practice (as we see in Figure 6) sometimes it makes sense to use natural keys and sometimes it makes sense to use surrogate keys. In Choosing a Primary Key: Natural or Surrogate? I describe the relevant issues in detail.
Data normalization is a process in which data attributes within a data model are organized to increase the cohesion of entity types. In other words, the goal of data normalization is to reduce and even eliminate data redundancy, an important consideration for application developers because it is incredibly difficult to stores objects in a relational database that maintains the same information in several places. Table 2 summarizes the three most common normalization rules describing how to put entity types into a series of increasing levels of normalization. Higher levels of data normalization (Date 2000) are beyond the scope of this book. With respect to terminology, a data schema is considered to be at the level of normalization of its least normalized entity type. For example, if all of your entity types are at second normal form (2NF) or higher then we say that your data schema is at 2NF.
Table 2. Data Normalization Rules. Level First normal form (1NF) Second normal form (2NF) Third normal form (3NF) Rule An entity type is in 1NF when it contains no repeating groups of data. An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its primary key. An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the primary key.
Figure 7 depicts a database schema in ONF whereas Figure 8 depicts a normalized schema in 3NF. Read the Introduction to Data Normalization essay for details. Why data normalization? The advantage of having a highly normalized data schema is that information is stored in one place and one place only, reducing the possibility of inconsistent data. Furthermore, highly-normalized data schemas in general are closer conceptually to object-oriented schemas because the object-oriented goals of promoting high cohesion and loose coupling between classes results in similar solutions (at least from a data point of view). This generally makes it easier to map your objects to your data schema. Unfortunately, normalization usually comes at a performance cost. With the data schema of Figure 7 all the data for a single order is stored in one row (assuming orders of up to nine order items), making it very easy to access. With the data schema of Figure 7 you could quickly determine the total amount of an order by reading the single row from the Order0NF table. To do so with the data schema of Figure 8 you would need to read data from a row in the Order table, data from all the rows from the OrderItem table for that order and data from the corresponding rows in the Item table for each order item. For this query, the data schema of Figure 7 very likely provides better performance.
In class modeling, there is a similar concept called Class Normalization although that is beyond the scope of this article.
To denormalize the data schema the following decisions were made: 1. To support quick searching of item information the Item table was left alone. 2. To support the addition and removal of order items to an order the concept of an OrderItem table was kept, albeit split in two to support outstanding orders and fulfilled orders. New order items can easily be inserted into the OutstandingOrderItem table, or removed from it, as needed. 3. To support order processing the Order and OrderItem tables were reworked into pairs to handle outstanding and fulfilled orders respectively. Basic order information is first stored in the OutstandingOrder and OutstandingOrderItem tables and then when the order has been shipped and paid for the data is then removed from those tables and copied into the FulfilledOrder and FulfilledOrderItem tables respectively. Data access time to the two tables for outstanding orders is reduced because only the active orders are being stored there. On average an order may be outstanding for a couple of days, whereas for financial reporting reasons may be stored in the fulfilled order tables for several years until archived. There is a performance penalty under this scheme because of the need to delete outstanding orders and then resave them as fulfilled orders, clearly something that would need to be processed as a transaction. 4. The contact information for the person(s) the order is being shipped and billed to was also denormalized back into the Order table, reducing the time it takes to write an order to the database because there is now one write instead of two or three. The retrieval and deletion times for that data would also be similarly improved. Note that if your initial, normalized data design meets the performance needs of your application then it is fine as is. Denormalization should be resorted to only when performance testing shows that you have a problem with your objects and subsequent profiling reveals that you need to improve database access time. As my grandfather said, if it aint broke dont fix it.
Agile/Evolutionary Data Modeling Agile Database Best Practices Agile Master Data Management (MDM) Agile Modeling Best Practices
q q q q q q q q q q q q q q q q
Choosing a Primary Key: Natural or Surrogate? Comparing the Various Approaches to Modeling in Software Development Data & Databases Data Model Patterns Data Modeling for Information Professionals The Data Modeling Handbook Database Modeling Within an XP Methodology (Ronald Bradford) Initial High-Level Architectural Envisioning Initial High-Level Requirements Envisioning Introduction to Data Normalization Logical Data Modeling: What It Is and How To Do It by Alan Chmura and J. Mark Heumann On Relational Theory The "One Truth Above All Else" Anti-Pattern Prioritized Requirements: An Agile Best Practice Survey Results (Agile and Data Management) When is Enough Modeling Enough?
This book describes the philosophies and skills required for developers and database administrators to work together effectively on project teams following evolutionary software processes such as Extreme Programming (XP), the Rational Unified Process (RUP), the Agile Unified Process (AUP), Feature Driven Development (FDD), Dynamic System Development Method (DSDM), or The Enterprise Unified Process (EUP). In March 2004 it won a Jolt Productivity award.
This book describes, in detail, how to refactor a database schema to improve its design. The first section of the book overviews the fundamentals evolutionary database techniques in general and of database refactoring in detail. More importantly it presents strategies for implementing and deploying database refactorings, in the context of both "simple" single application databases and in "complex" multi-application databases. The second section, the majority of the book, is a database refactoring reference catalog. It describes over 60 database refactorings, presenting data models overviewing each refactoring and the code to implement it.
This book presents a full-lifecycle, agile model driven development (AMDD) approach to software development. It is one of the few books which covers both object-oriented and data-oriented development in a comprehensive and coherent manner. Techniques the book covers include Agile Modeling (AM), Full Lifecycle Object-Oriented Testing (FLOOT), over 30 modeling techniques, agile database techniques, refactoring, and test driven development (TDD). If you want to gain the skills required to build mission-critical applications in an agile manner, this is the book for you.
8. Acknowledgements
I'd like to thank Jon Heggland for his thoughtful review and feedback regarding the normalization section in this essay. He found several bugs which had gotten by both myself and my tech reviewers. David Dautbegoviwas also kind enough to let me know about an error.
Let Us Help
We actively work with clients around the world to improve their information technology (IT) practices, typically in the role of mentor/coach, team lead, or trainer. A full description of what we do, and how to contact us, can be found at Scott W. Ambler + Associates.
Copyright 2002-2011 Scott W. Ambler This site owned by Ambysoft Inc. Agile Modeling (AM) | Agile Unified Process (AUP) | Enterprise Unified Process (EUP) | Disciplined Agile Delivery (DAD) | My Writings | IT Surveys