Conceptual Data Modeling and Database Design Volume 1 - The Shortest Advisable Path A Fully Algorithmic Approach
Conceptual Data Modeling and Database Design Volume 1 - The Shortest Advisable Path A Fully Algorithmic Approach
MODELING AND
DATABASE DESIGN:
A FULLY ALGORITHMIC
APPROACH
Volume 1
The Shortest Advisable Path
This page intentionally left blank
CONCEPTUAL DATA
MODELING AND
DATABASE DESIGN:
A FULLY ALGORITHMIC
APPROACH
Volume 1
The Shortest Advisable Path
Christian Mancas
Mathematics and Computer Science Department,
Ovidius State University, Constanta, Romania
Computer Science Taught in English Department,
Politehnica University, Bucharest, Romania
Software R&D Department, Asentinel International and
DATASIS ProSoft, Bucharest, Romania
CRC Press Apple Academic Press, Inc
Taylor & Francis Group 3333 Mistwell Crescent
6000 Broken Sound Parkway NW, Suite 300 Oakville, ON L6L 0A2
Boca Raton, FL 33487-2742 Canada
© 2016 by Apple Academic Press, Inc.
Exclusive worldwide distribution by CRC Press an imprint of Taylor & Francis Group, an Informa
business
This book contains information obtained from authentic and highly regarded sources. Reason-
able efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organiza-
tion that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
For information about Apple Academic Press product
http://www.appleacademicpress.com
To the loving memory of my parents,
to my revered Professors,
to my students, colleagues, love, children, and friends.
Our standards in databases (and not only) should always be at least equal to
the maximum of the corresponding highest academic and customer ones.
Dedication..................................................................................................... v
About the Author.......................................................................................... ix
Foreword by Professor Bernhard Thalheim................................................. xi
Foreword by Professor Dan Suciu...............................................................xv
Preface...................................................................................................... xvii
Acknowledgments..................................................................................... xxiii
List of Symbols...........................................................................................xxv
1. Data, Information and Knowledge in the Computer Era....................... 1
2. The Quest for Data Adequacy and Simplicity:
The Entity-Relationship Data Model (E-RDM)..................................... 35
3. The Quest for Data Independence, Minimal Plausibility,
and Formalization: The Relational Data Model (RDM)..................... 115
4. Relational Schemas Implementation and Reverse Engineering......... 341
5. Conclusion................................................................................................ 595
Appendix: Mathematic Prerequisites for the Math Behind....................... 615
Index.......................................................................................................... 635
This page intentionally left blank
ABOUT THE AUTHOR
His main research areas are the conceptual data and knowledge mod-
eling and querying, database design, implementation, and optimization,
as well as the architecture, design, development, fine-tuning, and mainte-
nance of data and knowledge base management systems.
Dr. Mancas graduated in 1977 from the Computers Department at
Politehnica University of Bucharest, Romania, with a thesis on Generating
Parsers for LR(k) Grammars, under the supervision of Professor Dan Luca
Serbanati. Up until the fall of communism in 1990, he worked as a soft-
ware engineer and, since 1980, had been R&D manager of a state-owned
computer center in Bucharest (contributing to the design, development,
and maintenance of a dedicated ERP), and conducted (from time to time)
computer programming labs at Politehnica University of Bucharest, but,
for political reasons, he was not accepted for PhD studies. He started this
program under the supervision of Professor Cristian Giumale in 1992 and
obtained his PhD degree in 1997 from the same as above department, with
a thesis on Conceptual Data and Knowledge Modeling.
FOREWORD BY
PROFESSOR BERNHARD THALHEIM
correct typos and real errors, and propose some kind of insight how things
should be developed for real applications.
So, we have now a glimpse into the real need of practical database
development. One must start with a good understanding of the topic and the
application, must use the best language, must be supported by most appro-
priate tools for development, and must be completely sure that the approach
is well founded and will not add errors due to the partial background of
the language. The classic approach uses some extension of the entity-rela-
tionship modeling language for introduction to conceptual database design,
allows to transfer schemata developed in such language to another lan-
guage—say some abstraction of relational database management system
language, and finally derives the description within a system language. This
book follows such order as well. Whether the given extension is a most
appropriate is a matter of culture, education and background. Whether the
transfer or mapping works well is a matter of insight. This book is, how-
ever, based on methods—called algorithms—for database design and on
stewardship by good guidelines—called good practices. I know only four
books in our area that follow such approach. Practical people wrote all four
books during the last almost three decades. And that is it. All four books
provide an insight what should be done for database development. They do,
however, not provide an insight how database design should be backed by
an appropriate language and theoretical underpinning.
Since I have a reputation for expertise in object-relational database
design and development, in database theory, in database programming
and in methodology development for this area, I was one of the people
Christian Mancas asked for a review of the manuscript. And I was sur-
prised and satisfied while discovering stuff and tricks I have not seen
before. While improving the co-design methodology to information sys-
tems development with the goal to reach maturity at SPICE level 3, we
had to realize that database design is an engineering activity far beyond
an art and must be based on a well-founded technology of design. Many
areas of knowledge begin as an art and only later become scientific. The
maturity of such an area can be measured by the existence of well-defined
methods, by a chance for being repeatable, by becoming manageable,
and by logistics on the basis of methods for continuous optimization.
Currently, database design is still an art. An artisan or design mechanic
Foreword by Professor Bernhard Thalheim xiii
Bernhard Thalheim
Professor with the Department of Computer Science,
Christian-Albrechts-University Kiel, Germany
E-mail: [email protected]
October 2014
This page intentionally left blank
FOREWORD BY
PROFESSOR DAN SUCIU
The first volume of this book is best suited for the practitioner who
wants to achieve a thorough understanding of the fundamental concepts
in data management. The increasingly central role that data, sometimes
called big data, plays in our society today makes it imperative for every
computer professional to be skilled in the topics covered by the book. One
should think of this first volume as an important first step in understanding
the complexities of data today.
Dan Suciu
Professor
University of Washington, Seattle, Washington, USA
September 2014
PREFACE
There are those who believe that databases (dbs) do not need architec-
ture and design. In the best such case, they consider that it is enough to
build and design your software application, then develop it accordingly
by using a CASE1 ORM2 development environment, like, for example,
Java’s Hibernate, which automatically generates the needed underlying db
scheme too.
If the application’s architecture is correct, then architecture of the cor-
responding db is also correct; however, generally, even the best software
architects who are not also db ones are extremely tempted, for example in
the case of commercial applications, to treat cities, states, and sometimes
even countries as character strings, instead of objects, and/or to semanti-
cally overload classes, which will result in db instance anomalies and/or
semantically overloaded corresponding db tables, both of them prone to
implausible data storing.
Moreover, even if the db architecture is perfect, it is not enough: in
order to guarantee data plausibility, its design should also take into con-
sideration all existing business rules (constraints) in the given universe of
discourse.
Consequently, even when perfectly using such CASE ORM tools, the
generated db scheme should then be refined.
In the worst such case, they simply launch a database management
system (DBMS), like SQLite, MySQL, MS Access, or even IBM DB2,
Oracle Database (Oracle, for short), Sybase Advantage Database Server
(former SQL Server), or MS SQL Server, etc. and start creating tables and
filling them with data.
1
The acronym for Computer Aided Software Engineering.
2
The acronym for Object-Relational Mapping.
xviii Preface
Generally, such persons believe that dbs are only some technological
artifacts and almost completely ignore the old software adage “garbage
in, garbage out”, which is equivalent to, for example, building not only
houses, but even skyscrapers, without architecture and design.
Unfortunately, in dbs there is no law prohibiting such “constructions”,
and the day when mankind will realize that data is, after health, education,
and love, perhaps our most valuable asset (and so it deserves legal regula-
tions too) is not yet foreseeable.
Then, there are those who consider db architecture and design as being
rather an art, or even a “mystery”, but definitely not a science.
This book advocates a dual approach: conceptual data modeling,
strengthened with CASE algorithms, is db architecting, while db design
and querying is applied mathematics, namely the (semi-naïve) algebraic
theory of sets, relations, and functions, plus the first-order predicate logic.
The main four chapters of this book are devoted to four data models
that I’m considering as being cornerstones in this field: the relational, the
entity-relationship (E-R), the logical, and a(n elementary) mathematical
ones.
Two chapters are devoted to implementing optimized relational dbs
(rdbs), by using five of the currently most widely used relational DBMSs
(RDBMSs): IBM DB2, Oracle Database and MySQL, Microsoft SQL
Server and Access.
Essentially, this book emphasizes that using RDBMSs, so, generally,
technology mastering is a must, but much more important are the previous
two steps of db architecture and design: it almost does not matter what pen
(and RDBMSs too are mainly pens…) you write with, it is what you write
that matters the most.
The proposed approach is fully algorithmic: Entity-Relationship (E-R)
diagrams and business rule (restriction) sets are obtained in the concep-
tual data modeling architectural phase by using an assistance algorithm;
they are next algorithmically translated into mathematical schemes, which
are then refined by applying other five algorithms; next, mathematical
schemes are algorithmically translated into relational ones and associated
sets of nonrelational constraints, which, in their turn, are then algorithmi-
cally translated into actual RDBMSs rdbs. Finally, rdbs are optimized with
a couple of other algorithms.
Preface xix
CONTENTS
The goal is to turn data into information, and information into insight.
—Carly Fiorina
Data comes from the Latin datum, which means both a fact (known from
direct observation) and a premise (from which conclusions are drawn).
Information also comes from Latin: informatio (meaning formation, con-
ception, education) is the participle stem of informare (from where “to
inform” descends).
Generally, in Information Technology (IT), we mean by data facts’
(objects, concepts, etc.) property descriptions (through some values, gen-
erally of type character strings, numbers, calendar dates, etc.) which are
worth being noticed and (electronically) stored for later retrieving and
processing, while by information—increments of knowledge that can be
derived from data.
Data,
Information and Knowledge in the Computer Era 3
For example, all cells in Table 1.1 contain data: those in the first row
(of type character string) store the names of the corresponding columns
(the so-called table header or scheme), which in fact, together with the
table name, is metadata—that is, data on data—, while all of the others
(the table instance or content) contain data on some Carpathians peaks.
Obviously, for example, line two of this table embeds the informa-
tion that in the (Carpathians) “Făgăraşi” mountains there is a peak called
“Moldoveanu” whose altitude (in meters) is 2544. Generally, each (data,
not metadata) line of a table embeds the information that there is (in the
corresponding set of objects) an element (object, item, etc.) having cor-
responding values for the corresponding properties.
But more information, of higher degrees, may also be derived from this
table, for example:
1. There are 3 mountains and 5 peaks for which data is stored, out of
which 2 from the “Făgăraşi” and “Retezat”, and 1 from the “Piatra
Craiului” mountains.
2. “La Om (Piscul Baciului)” is less high than “Retezat”, which is
less high than “Peleaga”, which is less high than “Negoiu”, which
is less high than “Moldoveanu”; “Moldoveanu” is the highest,
while “La Om (Piscul Baciului)” is the lowest.
3. Only 3 peaks are at least 2500 m, with two from “Făgăraşi” and
one from “Retezat”.
4. Overall average altitude is 2461.6 m; for “Făgăraşi” it is 2539.5 m,
for “Retezat” it is 2495.5 m, and for “Piatra Craiului” it is 2238 m.
5. There is only one peak (“Retezat”) that gives its name to its mountain.
Traditionally, data storage uses some communication means (e.g.,
images, languages, sculptures, shrines, computers, etc.) and media
(lime stone, granite, papyrus, paper, HDD/CD/DVD/SSD, etc.). Getting
TABLE 1.0 Carpathians Peaks
4 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
only negative and incomplete data, as well as inference rules. When not
otherwise stated, this book deals with homogeneous data.
As usual, we are not happy only seeing trees (data): we want to see espe-
cially forests (information). As such, some data interpretation is always
needed that should be enough abstract and powerful in order to understand
not only data values, but especially data interconnections.
Generally, models are such intellectual tools, intensively used in sci-
ence for abstracting details and outlining structure; from mathematical and
physical ones (e.g., Big Bang, Big Crunch, relativity, quantum, strings,
unification theories, etc.), economics (e.g., Nash equilibrium), geography
and meteorology (e.g., maps), and up to everyday life (e.g., phone books,
yellow pages, time schedules and tables, etc.) models allow us to at least
partially derive and understand information.
Data models existed long before computers: from maps to phone
books and all color pages, from Alumni to cook books, and from ships/
trains/planes/buses time tables to accounting worksheets mankind stored
data according to several data models since at least the ancient Egyptians,
Babylonians, Mayans, and Chinese. We will, however, focus here only on
computer manageable ones.
Such models are structuring elementary data items on which a pre-
defined set of operations is allowed. Elementary data items are of type
<object set, object unique id, object properties, properties’ values, time-
stamp>, as objects descriptions are finite and characterized by sets of
aggregated properties’ values that may vary in time. For example, MOUN-
TAIN_PEAKS objects from Table 1.1 are aggregated from the Mountain,
Peak, and Altitude properties. PEOPLE might be aggregated from SSN,
FirstName, LastName, Sex, BirthDate, BirthPlace, and e-mail.
Data,
Information and Knowledge in the Computer Era 7
The next chapter is devoted to both data analysis and its deliverables: an
exhaustive, concise and clear informal description of the needed data and
existing business rules, as well as a corresponding set of E-RDs.
The shortest advisable path for db design and implementation needs
then an algorithm to translate E-RDs and associated business rules into rdb
schemas and (possibly empty) sets of associated nonrelational constraints
(i.e., business rules that are not expressible in the RDM framework and,
consequently, need to be enforced by software applications built on top the
corresponding database).
However, the best approach to conceptual data analysis and modeling,
as well as database design and implementation is to first translate E-RDs
and associated business rules into a higher level semantic data model
scheme, refine it as much as possible, and only afterwards translating this
refined scheme into RDM schemas and (possibly empty) sets of associated
nonrelational constraints.
Finally, RDM schemas can be implemented as databases (through
corresponding translation algorithms presented in the fourth chapter), by
using any desired (or imposed) relational database management system
(RDBMS) version.
Knowledge is stored both using RDM and logic (for negative and
incomplete data) and inferred by logical data models (presented in the
first chapter of the second volume), based on first order logic.
As constraints are first order logic formulas,8 one of them or a set of con-
straints may imply other constraints as well: a set of constraints C implies
a constraint c if c holds in all instances in which C holds (dually, C does
not imply a constraint c if there is an instance for which c does not hold,
but C holds).
The standard logical notation for formulas implication is C |— c; c
is called an implied constraint. For example, in Table 1.1, constraint set
8
In particular, closed ones, that is, formulas whose variable occurrences are bound to at least one logic
quantifier (be it “for any” or “there is”). Dually, open formulas have at least one occurrence of a vari-
able free (i.e., not bound to any logic quantifier), which, in dbs, are formalizing queries.
Data,
Information and Knowledge in the Computer Era 13
{Altitude > 1000, Altitude < 2550} implies constraint Altitude [ (1000,
2550); trivially, the vice-versa is also true. Constraints that are not implied
are called fundamental.
We should never enforce implied constraints in db schemas: it would
only be superfluously time consuming.
Constraint implication is not important only for minimizing constraint
sets, but also for defining constraint sets closures that are needed for defin-
ing many important db theoretical notions, including the highest RDM
normal form: given any constraint class CC, its subset containing all con-
straints implied by any of its subsets C is called the closure of C (with
respect to that class) and is denoted by (C)CC+ (or, if CC is understood from
the context, simply C+). For example, if C = {S # T, T # U}, its transitive
closure is C+ = {S # T, T # U, S # U}.
Closures of a set with respect to different classes are generally different
(e.g., the reflexive closure of the above C is C+ = {S # T, T # U, S # S, T
# T, U # U}).
Note that any set is included in its closure and that the closure of a
closure is itself (i.e., for any C and CC, (C+)+ = C+). Moreover, the closure
of a subset is a subset of the closure of the corresponding superset (i.e., for
any C, D, and CC, C # D C+ # D+), whereas the union of two closures
is distinct of the closure of the corresponding union (i.e., for any C, D, and
CC, C+ ø D+ ≠ (C ø D)+).
A set which is equal to its closure is called closed with respect to impli-
cation (e.g., C’ = {S # T, T # U, S # U} is transitively closed).
Two sets having same closures are called equivalent and are said to cover
each other (e.g., C and C’ above are equivalent or covering each other).
To think out in every implication the ethic of love for all creation…
this is the difficult task which confronts our age.
—Albert Schweitzer
9
A problem is said to be undecidable if it is neither provable, nor refutable or for which it is impos-
sible to design an algorithm that always (i.e., in any context) correctly answers to its questions (see
Appendix).
Data,
Information and Knowledge in the Computer Era 15
are varying in real time)? (i.e., what should be the scheme of the
needed db and how should it be implemented on a DBMS? What
columns and constraints in what tables?) For example, Table 1.1
should be split into two tables, one for mountains and the other
for peaks, and all existing constraints should be added to both of
them.
• a dynamic one: how should the corresponding db be queried for
getting correct and prompt answers? (i.e., what should be the archi-
tecture, design, and development of the needed application built on
top of the above db?)
Generally, dbs have four dimensions: design, implementation, optimi-
zation, and usage (i.e., programming for querying and modifying data).
The design, implementation and optimization are static, while usage is
dynamic.
Note that, generally, it is true that daily db tasks involve in average
some 90% usage, 3% design and implementation, and 7% optimization;
consequently, at a first glance, it might seem that mastering only SQL’s
DML is almost the only key to success in the db field.
However, both design, implementation and optimization are crucial
and can make the huge difference between a very poor and cumbersome
set of interrelated tables and a true database: who would not love to deco-
rate and live in a beautifully and intelligently architectured, designed, and
built spacious home full of light and splendid views around, but prefer
instead a dark underground labyrinth shared with some minotaur?
Data was stored since the dawn of humanity on various media in order
to infer from it information and knowledge. Most of the data is discrete,
which makes tables and graphs ideal candidates to store it, with tables
having great advantages on both storing and querying, which are still the
simplest and fastest.
Especially in the computer era, where the amount of stored data is huge
(think, for example, of the astronomical one) and there may be several
thousands of people simultaneously updating it, but especially as crucial
global and local decisions, be them in politics, intelligence, military, sci-
ence, health, commerce, etc., are more and more heavily based on data,
guaranteeing at least its plausibility is of paramount importance.
Consequently, on one hand, more and more sophisticated data models
were introduced, even if, commercially, from the implementation point of
view, the vast majority of the existing database management systems are
relational. This means that only part of the existing business rules may be
enforced through them and for the rest we have to develop software appli-
cations built on top of the corresponding databases.
On the other hand, just as starting to directly write Java (or C# or
whatever other PL) code, immediately after getting some desired soft-
ware application requirements, without proper problem analysis, software
architecture and design is a huge mistake, starting to directly create tables
in a database on a RDBMS is an even greater one: databases are the foun-
dations of db software applications.
Just like no building resists earthquakes, tsunamis or even significant
winds or rains if its foundations are not solid enough, no db software
application may be easily designed, developed, maintained, and used if its
underlying db was not properly designed and implemented.
As such, whenever a new database is needed or an existing one has to
be expanded, first of all, besides establishing exactly what are the corre-
sponding data needs, data analysis has to also discover all business rules
that govern the corresponding subuniverse.
Then, these findings should be structured according to a data model
that, even if it is not as powerful as needed in order to guarantee data plau-
sibility, is comprehensible by the customers too. We will see in the next
chapter that E-R data models (i.e., E-RDs accompanied by restriction sets
20 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
FIGURE 1.0 Proposed algorithms for data analysis, db design, implementation, and
optimization, as well as their dual reverse engineering ones.
from REAF0′ followed by REA0 and REA1; finally, any algorithm from
REAF0–2 is the composition of one algorithm from REAF0–1 followed
by REA2.
16 algorithms are of translation and 9 of refinement types: A2, A3, A3’,
A4, A5, A6, A7/8–3, AF10, and AF11 are refining-type algorithms; all of the
others are translation-type (from one formalism to another) ones. 15 algo-
rithms need not human intervention, while 10 (the assisting ones) do need
it: A0, A2, A3, A3’, A4, A7/8–3, AF9, AF9’, AF10, and AF11 are assisting
algorithms, as only humans may take corresponding needed decisions.
All of these 25 algorithms may and should be embedded in any power-
ful DBMS. Unfortunately, today’s RDBMSs are only providing A8, AF8′
and, partially, REAF0–2 and/or AF10/AF11; only a couple of them (e.g.,
DB2 and Oracle) are also offering AF1–8 (but without the refinements
brought by the algorithms A2 to A6 or, at least, by A7/8–3).
24 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
13
Data is heterogeneous when has several format types (e.g., a mixture of at least two of the follow-
ing: binary, contiguous, graph, image, map, audio, video, ranking, URL, etc.). Heterogeneous DBMS
(HDBMS) are systems able to integrate heterogeneous data managed by other DBMSs.
14
NoSQL DBMS (NoSQL = Not Only SQL) are systems that replaced RDM with some other types of
less restrictive data models (e.g., based on pairs of type <key, value>, or graphs, or documents, etc.),
but still use SQL-like languages (for both declaring and processing data).
Data,
Information and Knowledge in the Computer Era 25
I like the dreams of the future better than the history of the past.
—Thomas Jefferson
Real generosity toward the future lies in giving all to the present.
—Albert Camus
In my opinion, modern data modeling era opened with Mealy (1967) and
Quillian (1968).
Codd (1970) introduced the RDM.
Chen (1976) introduced the E-RDM.
Smith and Smith (1977) was the first to explore using generalization
and aggregation as data modeling tools.
Gallaire and Minker (1978) first married logic and databases: knowl-
edge bases (in particular, deductive databases) sprung from it.
Hammer and McLeod (1978) proposed a first semantic data model.
Mancas (1985) introduced the (E)MDM, and Mancas (1990, 1997,
2002) extended it.
The two main prototype implementations that immensely contributed
to the technology of the RDBMS field were the System R project at IBM
Almaden Research Center (Astrahan, 1976) and the INGRES project at
Berkeley University (Stonebraker et al., 1976). Both of them were crucial
for establishing this type of DBMS as the dominant db technology of today,
including the relational languages SQL (Chamberlin and Boyce, 1974), the
de facto standard one today, and its rival Quel (Stonebraker et al., 1976).
IBM DB2 (http://www-01.ibm.com/software/data/db2/), System R’s
direct descendant, celebrated a couple of years ago its 30 years of success
and superior innovation (Campbell et al., 2012).
DB2′s main competitor is another System R’s descendant, the Oracle
Database (http://www.oracle.com/us/products/database/overview/index.
html), or, simply, Oracle, which managed to reach the market some two
years before DB2 and which continued ever since to challenge it in a beau-
tiful race of innovation and excellence (Ashdown and Kyte, 2014).
INGRES too spawned several very successful commercial RDBMSs: the
current open source commercial Ingres supervised by Actian Corporation
26 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
companies too, among which, for example, are CERN, the Swiss-based
European Council for Nuclear Research, which built and uses the Large
Hadron Collider, eBay, and The Weather Channel), HBase (http://www.
webuzo.com/sysapps/databases/HBase), the main Hadoop db engine
(http://hadoop.apache.org/, whose underlying technology was invented
by Google and then developed by Yahoo), and Hive (http://hive.apache.
org/, of SQL-type, born within the Hadoop project too, but now being a
project of its own), all of them open source and best suited for mission-
critical data that is replicated in hundred worldwide scattered data centers.
For example, (Redmond and Wilson, 2012) presents 7 NoSQL open
source solutions (including HBase and Postgres). Moreover, there are also
bridges built between the complementary SQL and NoSQL solutions, for
bringing speed and BI capabilities to querying heterogeneous big data,
like, for example, the one between the MS SQL Server 2012 and Hadoop
(Sarkar, 2013).
O-ODBMS were introduced starting with the 1982 Gemstone proto-
type (see http://gemtalksystems.com/index.php/products/gemstones/).
Even if they only managed to conquer niche areas (CAD, multimedia pre-
sentations, telecom, high energy physics, molecular biology, etc.), their
influence on RDM and RDBMSs, at least for the hybrid object-relational
(O-R) systems was significant. Out of them, probably the most famous,
especially after IBM bought it (from Illustra), is Informix (http://www-01.
ibm.com/software/data/informix/).
Only some 10 years after the first commercial RDBMSs were launched,
Chang (1981) described a first system, which is marrying rdbs and infer-
ence rules. Most probably, the name Datalog was first coined to the cor-
responding language in Maier (1983).
It then took almost another decade until Naqvi and Tsur (1989), after
a 5 years efforts of MCC, introduced LDL, a language extending Datalog
with sets, negation, and data updating capabilities, which was the founda-
tion of the first widely known homonym KDBMS (Chimenti et al., 1990).
The University of Wisconsin at Madison started in 1992 the CORAL
project (Ramakrishnan et al., 1992), experimenting a marriage between
SQL and Prolog, which was implemented in a corresponding homonym
KDBMS one year later (Ramakrishnan et al., 1992). CORAL++ (Srivas-
tava et al., 1993) is its object-oriented extension.
28 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Today, there are lot of prototype KDBMS and very few commercial
ones. Among the former, for example, there are the notable open source
semantic web framework Jena (http://jena.apache.org//), developed
(first by HP, but then by Apache) in Java and also embedding Datalog
and OWL, the Datalog educational system DES (http://www.fdi.ucm.es/
profesor/fernan/des/), licensed by the Free Software Foundation, and the
Apache Clojure and Java library for querying data stored on Hadoop clus-
ters Cascalog (http://cascalog.org/).
Among the latter, for example, there are QL (de Moor et al., 2008), a
commercial object-oriented variant of Datalog created by Semmle Ltd.
(http://semmle.com/), a free-of-charge database binding for pyDatalog (a
logic extension for Python, see https://sites.google.com/site/pydatalog/)
from FoundationDB (https://foundationdb.com/key-value-store/docu-
mentation/datalog.html), and a distributed db system for scalable, flexible
and intelligent applications running on cloud architectures, which uses
Datalog as the query language, called Datomic (http://www.datomic.com/
rationale.html), from Cognitect (http://cognitect.com/).
Even Microsoft Research developed and provides a security policy
language based on Datalog called SecPal (http://secpal.codeplex.com/).
MatBase has several prototype versions (MS C and Borland Paradox,
MS Access, C# and SQL Server) and was presented in a series of papers:
Mancas et al. (2003), Mancas and Dragomir (2004), and Mancas and Man-
cas (2005, 2006). Its users may either work in (E)MDM (and the system
automatically translates everything into both E-RDM and RDM), which
also includes a Datalog engine, in E-RDM (and the system automatically
translates everything into both (E)MDM and RDM), or in RDM (and
the system automatically translates everything into both E-RDM and (E)
MDM).
The excellent (Tsichritzis and Lochovsky, 1982) and the more recent
(Hoberman, 2009; Hoberman et al., 2009; Simsion and Witt, 2004; Sim-
sion, 2007), etc. are devoted to data modeling.
The recently revived graph-based database models are discussed in
depth, for example, in Robinson et al. (2013).
For data model examples (even if not all of them correct and/or opti-
mal), see, for example, http://www.databaseanswers.org/data_models/
index.htm.
Data,
Information and Knowledge in the Computer Era 29
Not only in my opinion, the current “bible” version for database the-
ory still remains (Abiteboul et al., 1995), while (Garcia-Molina et al.,
2014) is the one for DBMSs.
Other remarkable books in this field are, for example, Date
(2003, 2011, 2012, 2013), Churcher (2012), Hernandez (2013), Light-
stone et al. (2011), Ullman, (1988, 1989), and Ullman and Widom
(2007).
Especially in O-ODBMSs, but not only, data is frequently
stored in XML; as such data is somewhere at the half-road between
dbs and plain text documents it is called semistructured; again not
only in my opinion, the current “bible” in this field remains (Abite-
boul et al., 2000).
DBLP (http://www.sigmod.org/dblp/db/index.html), started and
still managed by Michael Ley, lists now more than 1,200,000 pub-
lications in the db field. A smaller searchable index of db research
papers (only some 40,000 entries, but out of a total of more than
3,000,000 ones in computer science, with more than 600,000 also
having links to the full papers), http://liinwww.ira.uka.de/bibliog-
raphy/Database/, is maintained by Alf-Christian Achilles.
All URLs mentioned in this section were last accessed on July
11th, 2014.
KEYWORDS
•• big data
•• business rule (BR)
•• closed domain assumption
•• closed world assumption
•• coherent constraint set
•• conceptual data modeling
•• consistent database instance
•• constraint
•• constraint satisfaction
•• constraint set
•• constraint set closure
30 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• information
•• knowledge and database management system (KDBMS)
•• knowledge base (kb)
•• knowledge-based management system (KBMS)
•• Logical Data Model (LDM)
•• metadata
•• minimal constraint set
•• negative data
•• NoSQL DBMS
•• object set
•• object set property
•• plausible constraint
•• positive data
•• redundant constraint
•• redundant constraint set
•• Relational Data Model (RDM)
•• semistructured data
•• Structured Query Language (SQL)
•• trivial constraint
•• unique objects assumption
REFERENCES
Abiteboul, S., Buneman, P., Suciu, D. (2000). Data on the Web. From Relations to Semis-
tructured Data and XML. Morgan Kaufman: San Francisco, CA.
Abiteboul, S., Hull, R., Vianu, V. (1995). Foundations of Databases; Addison-Wesley:
Reading, MA.
Ashdown, L., Kyte, T. (2014). Oracle Database Concepts, 12c Release 1 (12.1). Oracle
Corp. (http://docs.oracle.com/cd/E16655_01/server.121/e17633.pdf).
Astrahan, M. M., et al. (1976). System R: A Relational Approach to Database Manage-
ment. In ACM TODS, 1(2), 97–137.
32 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Campbell, J., Haderle, D., Parekh, S., Purcell, T. (2012). IBM DB2: The Past, Present and
Future: 30 Years of Superior Innovation. MC Press Online, LLC, Boise, ID.
Chamberlin, D. D., Boyce, R. F. (1974). SEQUEL—A Structured English QUEry Lan-
guage. In SIGMOD Workshop, 1, 249–264.
Chang, C. C. (1981). On the Evaluation of Queries Containing Derived Relations in a Rela-
tional Database. In H. Gallaire, J. Minker, and J. Nicolas, eds., Advances in Database
Theory, Vol. I, Plenum Press.
Chen, P. P. (1976). The entity-relationship model: Toward a unified view of data. ACM
TODS 1(1), 9–36.
Chimenti, D., Gamboa, R., Krishnamurti, R., Naqvi, S., Tsur, S., Zaniolo, C. (1990). The
LDL system prototype. IEEE TKDE 2(1):76–90.
Churcher, C. (2012). Beginning Database Design: From Novice to Professional, 2nd ed.,
Apress Media LLC: New York, NY.
Codd, E. F. (1970). A relational model for large shared data banks. CACM 13(6), 377–387.
Date, C. J. (2003). An Introduction to Database Systems, 8th ed., Addison-Wesley: Read-
ing, MA.
Date, C. J. SQL and Relational Theory: How to Write Accurate SQL Code, 2nd ed.; Theo-
ries in Practice; O’Reilly Media, Inc.: Sebastopol, CA, 2011.
Date, C. J. Database Design & Relational Theory: Normal Forms and All That Jazz; Theo-
ries in Practice; O’Reilly Media, Inc.: Sebastopol, CA, 2012.
Date, C. J. Relational Theory for Computer Professionals: What Relational Databases are
Really All About; Theories in Practice; O’Reilly Media, Inc.: Sebastopol, CA, 2013.
De Moor, O. et al., (2008). QL: Object-Oriented Queries Made Easy. Lammel, R., Visser,
J., Saraiva, J. (Eds.): GTTSE 2007, LNCS 5235: 78–133, Springer-Verlag, Berlin
Heidelberg.
Gallaire, H., Minker, J. (1978). Logic and Databases. Plenum Press, New York, U.S.A.
Garcia-Molina, H., Ullman, J. D., Widom, J. (2014). Database Systems: The Complete
Book, 2nd ed., Pearson Education Ltd.: Harlow, U.K., (Pearson New International
Edition)
Hammer, M., McLeod, D. (1978). The Semantic Data Model: a Modeling Mechanism for
Database Applications. In ACM SIGMOD Int. Conf. on the Manag. of Data.
Hernandez, M. J. (2013). Database Design for Mere Mortals: A Hands-on Guide to Rela-
tional Database Design, 3rd ed., Addison-Wesley: Reading, MA.
Hoberman, S. (2009). Data Modeling Made Simple 2nd Edition. Technics Publications
LLC, Bradley Beach, NJ.
Hoberman, S., Blaha, M., Inmon, B., Simsion, G. (2009). Data Modeling Made Simple:
A Practical Guide for Business and IT Professionals 2nd Edition. Technics Publica-
tions LLC, Bradley Beach, NJ, (Fourth Printing).
Lightstone, S. S., Teorey, T. J., Nadeau, T., Jagadish, H. V. (2011). Database Modeling and
Design: Logical Design, 5th ed., Data Management Systems; Morgan Kaufmann:
Burlington, MA.
Maier, D. (1983). The Theory of Relational Databases. Computer Science Press: Rock-
ville, MD.
Mancas, C. (1985). Introduction to a data model based on the elementary theory of sets, re-
lations and functions (in Romanian). In Proc. of INFO IASI’85, 314–320, A.I. Cuza
University, Iasi, Romania.
Data,
Information and Knowledge in the Computer Era 33
Mancas, C. (1990). A Deeper Insight into the Mathematical Data Model. Proc. 13th Intl.
Seminar on DBMS, ISDBMS’90, 122–134, Mamaia, Romania.
Mancas, C. (1997). Conceptual data modeling. (in Romanian) Ph.D., Thesis: Politehnica
University, Bucharest, Romania.
Mancas, C. (2002). On Knowledge Representation Using an Elementary Mathematical
Data Model. In Proc. IASTED IKS 2002. Conf. on Inf. and Knowledge Sharing,
206–211, Acta Press, St. Thomas, U.S. Virgin Islands, U.S.A.
Mancas, C., Dragomir S., Crasovschi, L. (2003). On modeling First Order Predicate Cal-
culus using the Elementary Mathematical Data Model in MatBase DBMS. In Proc.
IASTED AI 2003. MIT Conf. on Applied Informatics, 1197–1202, Acta Press, Inns-
bruck, Austria.
Mancas, C., Dragomir S. (2004). MatBase Datalog Subsystem Metacatalog Conceptual
Design. In Proc. IASTED SEA 2004. MIT Conf. on Software Eng. and App., 34–41,
Acta Press, Cambridge, MA.
Mancas, C., Mancas, S. (2005). MatBase E-R Diagrams Subsystem Metacatalog Concep-
tual Design. In Proc. IASTED DBA 2005. Conf. on DB and App., 83–89, Acta Press,
Innsbruck, Austria.
Mancas, C., Mancas, S. (2006). MatBase Relational Import Subsystem. In Proc. IASTED
DBA 2006. Conf. on DB and App., 123–128, Acta Press, Innsbruck, Austria.
Mealy, G. H. (1967). Another Look at Data. In AFIPS Fall Joint Comp. Conf., 31, 525–534.
Naqvi, S., Tsur, S. (1989). A Logical Language for Data and Knowledge Bases. Computer
Science Press: Rockville, MD.
Quillian, H. R. (1968). Semantic memory. In Semantic Inf. Processing, Minskz, M. ed.,
M.I.T. Press.
Ramakrishnan, R., Srivastava, D., Sudarshan, S. (1992). CORAL: Control, Relations and
Logic. In VLDB.
Ramakrishnan, R., Srivastava, D., Sudarshan, S., Sheshadri, P. (1993). Implementation of
the {CORAL} deductive database system. In SIGMOD.
Redmond, E., Wilson, J. R. (2012) Seven Databases in Seven Weeks: A Guide to Modern
Databases and the NoSQL Movement. Pragmatic Bookshelf.
Robinson, I., Webber, J., Eifrem, E. (2013). Graph Databases. O’Reilly Media Inc.,
Sebastopol, CA.
Sarkar, D. (2013) Microsoft SQL Server 2012 with Hadoop. Packt Publishing, Birmingham,
Mumbai.
Simsion, G. C. (2007). Data Modeling: Theory and Practice. Technics Publications LLC,
Bradley Beach, NJ.
Simsion, G. C., Witt, G. C. (2004). Data Modeling Essentials, 3rd Edition. Morgan-Kauf-
mann Publishers, Burlington, MA.
Smith, J. M., Smith, D. P. C. (1977). Database abstractions: Aggregation and generaliza-
tion. ACM TODS 2(2):105–133.
Srivastava, D., Ramakrishnan, R., Sudarshan, S., Sheshadri, P. (1993). CORAL++; Adding
Object-orientation to a Logic Database Language. In Proc. VLDB Dublin, Ireland,
158–170.
Stonebraker, M., Wong, E. A., Kreps, P., Held, G. (1976) The design and implementation
of Ingres. In ACM TODS, 1(3), 189–222.
Tsichritzis, D. C., Lochovsky, F. (1982). Data models. Prentice-Hall: Upper Saddle River, NJ.
34 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
CONTENTS
2.12 Conclusion.................................................................................... 98
2.13 Exercises....................................................................................... 99
2.14 Past and Present...........................................................................110
Keywords................................................................................................111
References...............................................................................................112
Both types of these object sets have associated attributes (e.g., Sex, First-
Name, LastName, e-mailAddr, BirthDate, MarriageDate, HireDate, Cus-
tomerName, CityName, ZipCode, Address, CountryName, CountryCode,
CountryPopulation, WarehouseName, ProductTypeName, StockQuantity,
EndDate, etc.), graphically represented by ellipses attached to the corre-
sponding rectangles or diamonds.
Please note that in any such data model, names of the object sets (be
them entity or relationship type) must be unique. Similarly, all attributes
of any given object set should have distinct names (but there may be attri-
butes of different object sets having same names).
Mathematically, attributes are mappings defined on corresponding
associated sets and taking values into corresponding value sets (i.e., sub-
sets of data types; for example, {‘F’, ‘M’}, [01/01/1900, 31/12/2012],
{True, False}, [30, 250], ASCII(32), UNICODE(32), etc.).
In the original E-RDM, attributes may take values into Cartesian prod-
uct sets as well; we do not allow it and restrict their codomains to atomic
sets only.
The E-RDM also generally recommends adding for each object set (be
it of type entity or relationship) a distinguished attribute called surrogate
key, having no other semantics except for numeric unique identification
The
Quest for Data Adequacy and Simplicity 39
The relationship examples from Figs. 2.3 and 2.4 suggest that not all rela-
tionship types need to be associated with object sets: all the one-to-one,
The
Quest for Data Adequacy and Simplicity 43
needed object sets and mappings, one of the main measures of model
complexity.
From this point of view too, but not only, please note that the cardi-
nalities of relationship types are not absolute, but highly depending on the
context.
For example, if we are only interested in current marriages (and not in
their history too) for a subuniverse where only one marriage is allowed at
a time, then MARRIAGES from Fig. 2.1 is not anymore a many-to-many,
but a one-to-one relationship type.
In such cases, obviously, by applying KPP, we should model MAR-
RIAGES rather as the one-to-one Spouse, like in Fig. 2.9 (which is an
example of a so-called recursive relationship18, which is any functional
relationship between a set and itself); please also note that, in such con-
texts, MarriageDate is not anymore a property of MARRIAGES, but of
PEOPLE.
Mathematically, recursive relationships are autofunctions (functions having same domain and codo-
18
main).
The
Quest for Data Adequacy and Simplicity 45
FIGURE 2.10 An example of a relationship and entity type object sets hierarchy.
FIGURE 2.11 Alternative flat solution instead of using relationship and entity types
object sets hierarchy.
The
Quest for Data Adequacy and Simplicity 47
Otherwise, rather sooner than later, you will lose them, as competition in
this field too is merciless.
Consequently, abstract relationship hierarchies whenever this is pos-
sible in the considered subuniverse of discourse. Please note that this pro-
cess is, essentially, just another type of factorization of data as much as
possible, so that each data is placed on the highest possible generalization
level, in order to avoid storing any duplicates.
The above example should not mislead us: properly defined, without
semantic overloading, relationships having arity greater than two may be
successfully used too. For example, in order to store various options for
flying between pairs of airports, the following ternary relationship might
be used, like in Fig. 2.12.
However, we will see in the second volume that (E)MDM includes
both a theorem proving that the only relationships fundamentally needed
for data modeling are the functions (i.e., the arrows in E-RDM), which
means that we could never use relationship-type sets, as well as counter-
examples proving that, generally, it is not a good idea to use relationships
of arity greater than two.
20
For example, Barker (Oracle) notation uses dotted half-lines for indicating that no element might
correspond to the corresponding object set.
50 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The most trivial restriction type is the attributes’ ranges: what should be,
for example, the plausible values for the attributes in Fig. 2.10?
For some of them, the answer is obvious to most humans; for example,
Weekday should probably take values from {1, 2, 3, 4, 5}, as there are no
classes either on Saturdays or Sundays. Note that, even for these attributes,
their ranges are not absolute, but relative to the corresponding context:
for example, under the communism, in all east European countries almost
everybody was working Saturdays too, including students; probably today
The
Quest for Data Adequacy and Simplicity 51
too there are still countries where students have classes on Saturdays (or
even Sundays).
Similar considerations apply for StartH and EndH (most probably the
former always ranges from 8 to 19 and the latter from 9 to 20).
Even for text-type attributes we should indicate what is the longest
possible corresponding text string; for example, how long could be a per-
son (be it student or teacher) or discipline name? Obviously, it is not plau-
sible that they might have, for example, 4 billion characters each.
Similarly, it is trivial that Date should not take values between, for exam-
ple, January 1st, 4712 BC and December 31st, 9999 (as Oracle is letting us
to store), but only between something like October 1st 2010 and today.
For other attributes, the answer is not that obvious; for example, should
Room# take natural or text values? Probably, as at least some schools are
labeling rooms by using letters too, in order to accommodate them too (for
having as many customers as possible), we would decide to restrict them
to text strings (e.g., of at most 8 or 16 ASCII characters), even if numbers
are stored and processed more conveniently by computers.
Similar considerations apply for Grade (in some countries varying
from F to A+, in others from 1 to 7, or to 10, or to 20 etc.).
The nine digits social security number (SSN) is used for uniquely iden-
tifying U.S. citizens and residents; in Romania, for example, its equivalent
(called personal numeric code), has 13 digits (and embeds birthdate and
sex); other countries have other types of equivalent codes that include let-
ters too (e.g., Greece, Italy, Hong Kong, etc.), while others (e.g., U.K.) do
not have such codes.
Last, but not least, we should always also specify the maximum car-
dinality of each object set. For example, corresponding values might be
1,000 for DISCIPLINES, TEACHERS and ROOMS, 10,000 for COMPE-
TENCES and CLASSES, 100,000 for STUDENTS and SCHEDULES, and
1,000,000,000 for ATTENDANCES.
The restriction on the maximum cardinality of the corresponding set
is the range of its surrogate key: for example, the range of ID_Discipline
would be [1; 1,000].
Specifying maximum cardinality restrictions should be done too, not
only for later implementation considerations, but also for having the most
precise idea possible on the complexity of the considered subuniverse.
52 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Range restrictions are the basic ones for guaranteeing data plausibility.
Consequently, for any attribute (ellipsis) of any E-RD you should add a
corresponding range restriction.
Unfortunately, there are still very many of those believing that it is
enough to specify as ranges data types: numbers (generally divided in
implementation subtypes: naturals between 0 and 256, integers between
–2,147,483,648 and 2,147,483,647, etc.), calendar dates, etc. Obviously,
this is not generally guaranteeing data plausibility (except for rare cases,
like the Boolean data type): for example, StockQty should never be nega-
tive and should always be less than a maximum value that depends on the
context, but which is never that large as 2,147,483,647 (also see Altitude
from Table 1.1).
To conclude with, to any attribute of any E-RD you should add a sen-
sible range restriction, such that its values are plausible in the correspond-
ing context.
Notationally, each range restriction may be represented by the name
of the corresponding attribute, followed by a column (or by the keywords
“in” or “[“), the corresponding range set, and the (unique) restriction
label (possibly enclosed in parentheses) (see R02, R03, etc.).
Distinction between integer and noninteger ranges should be done by
corresponding minimum and maximum values: for example, [1, 5] is a
subset of the naturals, while [1.0, 5.0] is one of the reals (rationals only
for computers).
ASCII(n), UNICODE(n), etc. represent the sets of strings made out of
characters from the corresponding alphabets and having at most n (natu-
ral) such characters.
Similarly, NAT(n) represents the subset of naturals having at most n
digits, INT(n) the corresponding one of integers, RAT(n.m) the one of
rationals having at most n digits before the decimal dot (or comma in
Europe) and m after it, and RAT+(n.m) its subset of positive values (obvi-
ously, NAT(n) = INT+(n)).
CURRENCY(n) is a subset of RAT+(n.2).
For expressing the particular case of cardinality constraints, we may
use the standard algebraic mappings card, which returns the number of
elements in a set and max, which computes the maximum value of an
expression (see R01, R04, R07, etc.).
The
Quest for Data Adequacy and Simplicity 53
For example, the following might be (for Romania) the range restric-
tions associated to the E-RD in Fig. 2.10 (where SysDate() is a system
function returning current OS calendar date):
STUDENTS
max(card(STUDENTS)) = 105 (R01)
SSN: [1000101000000, 8991231999999] (R02)
Name: ASCII(255) (R03)
TEACHERS
max(card(TEACHERS)) = 103 (R04)
SSN: [1000101000000, 8991231999999] (R05)
Name: ASCII(255) (R06)
DISCIPLINES
max(card(DISCIPLINES)) = 103 (R07)
Discipline: ASCII(128) (R08)
ROOMS
max(card(ROOMS)) = 103 (R09)
Room#: [1, 104] (R10)
CLASSES
max(card(CLASSES)) = 104 (R11)
Date: [01/10/2010, SysDate()] (R12)
SCHEDULES
max(card(SCHEDULES)) = 105 (R13)
Weekday: [1, 5] (R14)
StartH: [8, 19] (R15)
EndH: [9, 20] (R16)
ATTENDANCES
max(card(ATTENDANCES)) = 109 (R17)
Grade: [1, 10] (R18)
COMPETENCES
max(card(COMPETENCES)) = 104 (R19)
ASCII(2) for Grade) and the corresponding widest intervals (e.g., [1, 7]
for Weekday).
Please note that, generally, besides choosing the right data type set for
every attribute, it is crucial that all of your range restrictions specify as
narrow as possible intervals (explicitly or implicitly of the type [minValue,
maxValue]), such as to actually guarantee that only plausible data will be
accepted for all attributes.
As a counterexample, many RDBMSs (including DB2, MS SQL
Server, Access, and MySQL) chose for the DATE data type ranges starting
in our era and ending on Dec. 31st, 9999 (most probably dreaming that
their products will last unchanged, from this point of view, at least another
7,986 years from now); obviously, we generally need no more than 100
years in the future, but, whenever confronted with mankind history data,
we badly need dates before our era, which are very hard to store and pro-
cess with these RDBMSs.
For example, the famous battle of Actium took place on September
2, 31 BC; the almost equally one of Kadesh, in May 1274 BC; the old-
est recorded one (on the walls of the Hall of Annals from the Temple of
Amun-Re in Karnak, Thebes—today’s Luxor) is the one of Megiddo,
which took place on April 16, 1457 BC (or, after other calculations, 1479
BC or even 1482 BC), etc.
Consequently, a much more intelligent choice for DATE would have
been something like [1/1/–7000, 12/31/2999], as, almost surely, 985 years
from now on are enough from any practical point of view, and, for exam-
ple, the oldest intact bows discovered (in Holmegaard, Denmark) have
some 8000 years, the Goseck Circle astronomic observatory discovered
in Germany has some 7000 years, the oldest discovered (near Ljubljiana,
Slovenia) wheel has some 5200 years, etc.
object set and separating the names of the properties by columns (see R20,
R21, R24, R25, R26, and R27).
For example, the following might be the compulsory data restrictions
associated to the E-RD in Fig. 2.10.
Just like for (algebraic) sets, db object sets do not allow for duplicates: it
does not make sense to store data on objects that are not uniquely identifi-
able.
For example, in the subuniverse modeled by the E-RD from Fig. 2.10,
we should be able to uniquely distinguish between any two students,
teachers, disciplines, rooms, etc.
Any time when we do not need to uniquely distinguish between ele-
ments of a same set of objects (like, for example, two kilos of sugar that
we are buying) for which we would like to store data, we should abstract
a corresponding higher conceptual level set instead (e.g., PRODUCT_
TYPES, having “sugar” as one of its elements and a Quantity property,
besides the ProductTypeName that should be unique).
Uniqueness is a constraint that applies either on properties or on sets of
properties. For example, by definition, SSN values are unique, but in order
to uniquely distinguish between world’s states we need both their names
The
Quest for Data Adequacy and Simplicity 57
and the corresponding country (as there may not be states having same
names in a same country, but, worldwide, there may be states of different
countries having same names: for example, both Spain and Argentina have
states called Cordoba and La Rioja).
Object sets may have several uniqueness constraints; for example, as
in no room may simultaneously either start or end more than one class,
SCHEDULES above has two such triple constraints: one built with Room,
Weekday and StartH, and the other (its dual one) with Room, Weekday and
EndH.
At this conceptual level, however, in order to be sure that object sets
are correctly defined (i.e., from this point of view, they do not allow for
duplicates), it is enough to add to any of them at least one uniqueness con-
straint; of course that anytime we discover even at this stage several ones,
we should add them all to the corresponding uniqueness restriction set.
There are only a couple of exception types among object sets that might
not have associated uniqueness restrictions: subsets, people genealogy-
type sets, and poultry/rabbit/etc. cages type sets.
For example, the subset DRIVERS of EMPLOYEES (that we might
abstract in order to store only for them the otherwise nonapplicable prop-
erties LicenseType, LicenseDate, etc.) might not have any uniqueness
restriction. For subsets, unique identification may be done through the
uniqueness restrictions of the corresponding supersets: all subsets of any
set inherit all of its restrictions.
PEOPLE in a genealogical db might not have uniqueness restrictions
either: for example, there may be several persons having same names on
which we do not know anything else (but their names). For such sets, the
only way to uniquely distinguish between their elements is by the values
of their surrogate keys.
Poultry/rabbit/etc. CAGES in a (farming) db might not need any other
uniqueness than the one provided by surrogate keys: you can use their
system generated numbers as cage labels too (surrogate keys thus getting
semantics too).
Notationally, we may simply specify uniqueness restrictions just like
the compulsory data ones; however, as they are not that obvious, we
should add the corresponding reason in parenthesis after each of them.
58 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
STUDENTS: SSN (there may not be two students having same SSN) (R28)
TEACHERS: SSN (there may not be two teachers having same SSN) (R29)
DISCIPLINES: Discipline (there may not be two disciplines having
same name) (R30)
ROOMS: Room# (there may not be two rooms having same number) (R31)
CLASSES: Date • Schedule (there may not be two classes scheduled at
the same date) (R32)
SCHEDULES:
Room • Weekday • StartH (in no room may simultaneously start more
than one class) (R33)
Room • Weekday • EndH (in no room may simultaneously end more
than one class) (R34)
ATTENDANCES: Student • Class (there is no use to store more than once
the fact that a student attended a class) (R35)
COMPETENCES: Teacher • Discipline (there is no use to store more
than once the fact that a teacher has the competence to teach a discipline) (R36)
Even a very simple subuniverse, like, for example, the one in Fig. 2.10
may have other type of business rules than the three above that are gov-
erning it. Some of them are of common sense types (like, for example,
21
Mathematically, ‘•’ is the Cartesian product operator on mappings.
The
Quest for Data Adequacy and Simplicity 59
SCHEDULES:
StartH < EndH (no class may end before it starts) (R37)
No teacher may be simultaneously present in
more than one room. (R38)
No student may be simultaneously present in more than
one room. (R39)
No room may simultaneously host more than one class.22 (R40)
There may not be two people (be them teachers or students)
having same SSN.23 (R41)
22
Note that (R40) implies both (R33) and (R34).
23
Note that (R41) implies both (R28) and (R29).
60 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
In order to skip redrawing the E-RDs too many times, I’m then also
drawing their attention to other two relevant facts, proving that, if left like
that, EDITIONS is also overloaded semantically:
• Often (possibly regardless of the initial (co)author(s) book vol-
umes structuring, which is of no great interest to libraries), editions
are split into several volumes, each with their own distinct ISBN
and possibly having their own title.
• Sometimes, even if not that often, volumes may contain several books
(e.g., “Complete Shakespeare’s Dramatic and Poetical Works”).
This proves that their initial BOOKS object set was overloaded with the
semantics of five object sets, namely: BOOKS, VOLUMES_CONTENTS,
VOLUMES, EDITIONS, and COPIES.
Finally, I’m also suggesting them to add a Boolean Author? to PEO-
PLE, which will prove very useful for both users and db application
developers.
Never trust anyone who has not brought a book with them.
—Lemony Snicket
b. Data ranges:
ISBN: ASCII(16) (RV1)
VTitle: ASCII(255) (RV2)
VNo: [1, 255] (RV3)
Price: [0, 100000] (RV4)
c. Compulsory data: Edition, VNo, Price (RV5)
d. Uniqueness: ISBN (by definition, ISBNs are unique for
any volume) (RV6)
The db should store data on books (title, writing year, first author,
coauthors, and their order), people (authors and/or library subscrib-
ers) e-mail addresses, first and last names, whether they are authors or
not, as well as (book copies) borrows by subscribers (borrow, due, and
actual return dates).
Books are published by publishing houses, possibly in several
editions, even by same publishers (in different years). Each edition
may contain several volumes, each having a number, a title, a price,
and an ISBN code. Volumes may contain several books.
The library owns several copies (uniquely identified by an inven-
tory code) of any volume and may lend several copies for any bor-
row; not all borrowed copies should be returned at a same date;
maximum lending period is 300 days.
Last names, book titles, first authors, publisher names, edition pub-
lishers, first books and titles, as well as volume numbers and prices,
copy inventory codes, borrow subscribers, dates, and due return dates
are compulsory; for subscribers, first names and e-mail addresses are
compulsory too.
People are uniquely identified by their first and last names, plus
e-mail address; books by their first author, title, and writing year;
publishers by their names; editions by their first book, publisher,
title, and year; volumes by corresponding edition and volume num-
ber, as well as by their ISBN; copies by their inventory number.
There may be at most 1,000,000 persons, books, and editions,
2,000,000 coauthoring and volumes, 10,000 publishers, 4,000,000
books publishing, 32,000,000 copies, 100,000,000,000 borrows, and
1,000,000,000,000 copy borrows of interest. For example, first bor-
row date of interest is 01/06/2000.
No author should appear more than once in any book coauthors list.
For any edition, its first book should be the first one published in
its first volume. No edition may contain same book more than once.
No copy may be borrowed less than 0 days or more than 300 days.
No copy may be simultaneously borrowed to more than one
subscriber.
No copy may be returned before it was borrowed or after 100
years since corresponding borrow date.
The
Quest for Data Adequacy and Simplicity 69
The secret of getting ahead is getting started. The secret of getting started is
breaking your complex overwhelming tasks into small manageable tasks,
and starting on the first one.
—Mark Twain
The intuition gained with the above case study should make easier the
task of always applying first the algorithm for assisting data analysis and
conceptual modeling processes (A0) presented in Fig. 2.1624.
24
Whose complexity is linear in the sum of object set collection, property and restriction sets cardinals
(see Exercise 2.9)
70 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
if current E-R subdiagram is a new but not the first one then
link the new object set with at least another existing one (from
previous E-R subdiagrams);
case: object set property
if the corresponding object set does not exist in the data model
then
add a corresponding object set like above;
if property is a functional relationship then
add corresponding arrow (correspondingly labeled);
else add corresponding ellipse (correspondingly labeled);
case: business rule
add corresponding restriction to the restriction set (note: some-
times it may be the case that, in order to do it, you should first add
new object sets and/or properties);
end select;
end repeat;
meet again customers and ask them about everything which is still
not clear or missing, if any, or whether they agree with your deliver-
ies;
if everything is clear and you got customers’ OK then done = true;
end while;
End algorithm A0;
Note that such algorithms are sometimes referred as being of type for-
ward engineering (as opposed to their reverse engineering dual ones—see,
for example, Sections 3.5, 4.4, and 4.6).
By applying A0 to our variant of the E-RDM, the corresponding E-R
data metamodel from Fig. 2.17 is obtained.
The
Quest for Data Adequacy and Simplicity 71
b. Data ranges:
Name, Description: ASCII(255) (RO2)
MaxCard: NAT(38) (maximum cardinality) (RO3)
c. Compulsory: Name, MaxCard, Description
(RO4)
d. Uniqueness: Name (There may not be two object sets
having same names in a same subuniverse of discourse.) (RO5)
e. Other type restrictions:
There are no other types of object sets than entity and
relationship ones and any object set is either of entity
or of relationship-type.25 (RO6)
3. NON_FUNCTIONAL_RELATIONSHIP_TYPES # OBJECT_
SETS (The finite object sets subcollection of Cartesian
product type sets)
*Arity: corresponding Cartesian product arity
7. FUNCTIONAL_RELATIONSHIPS # MAPPINGS
(The finite mapping subset of functional relationships)
a. Compulsory: CodomainObjectSet (RF1)
There are at most 1038 (for example) object sets of interest in any sub-
universe of discourse; some of them may be included in others; they are
characterized by the compulsory properties name (an ASCII string of max-
imum 255 characters, for example), uniquely identifying each object set in
any given subuniverse of discourse, maximum cardinal (a natural having
at most 38 digits, for example) and a description (an ASCII string of maxi-
mum 255 characters, for example).
Object sets may be either atomic (of the so-called “entity” type) or of
Cartesian product type (the so-called “relationship”).
The Cartesian product type ones obviously have arity greater than
one and are nonrecursively defined on other underlying object sets. For
any such relationship-type set, each canonical Cartesian projection has a
compulsory name (of at most, for example, 255 ASCII characters), unique
within the set and called “role”.
Each subuniverse of interest may also need at least one and at most
100 (for example) so-called (programming) “data types”, which are finite
subsets of the naturals, integers, rationals, strings of characters over some
alphabet (generally ASCII or UNICODE), calendar dates, Boolean truth
values, etc. They are uniquely identified by their compulsory name (an
The
Quest for Data Adequacy and Simplicity 75
ASCII string of maximum 255 characters, for example) and some of them
may be included in other ones.
There are at most 1038 (for example) mappings of interest in any sub-
universe of discourse defined on object sets; they are characterized by the
compulsory properties name (an ASCII string of maximum 255 characters,
for example), uniquely identifying each mapping within those defined on
the same object set, and a description (an ASCII string of maximum 255
characters, for example); it is also compulsory to specify for any map-
ping whether or not its values are compulsory (or may be sometimes left
unspecified)—the so-called “compulsory restriction”—and are uniquely
identifying the elements of its domain object set (i.e., if the mapping is
one-to-one or not)—the so-called “uniqueness restriction”.
Mappings may take values in either object sets (in which case they
are called “functional relationships”, which are formalizing existing links
between object sets) or data ranges (in which case they are called “attri-
butes”, which are formalizing object set properties).
Attribute codomains (the so-called “range restrictions”) are subsets
(called “ranges”) of data types, uniquely identified by the triple made out
of their compulsory data type superset, minimum and maximum values
(both being, for example, ASCII strings of maximum 255 characters);
some ranges may be included into other ones.
In order to keep things simpler, this E-R data metamodel does not
include restrictions of other types and uniquenesses made out of several
mappings (see Exercise 2.6, as well as Section 3.3).
Here is a set of best practice rules in data analysis and conceptual modeling.
R-DA-1.
a. All software objects (e.g., dbs, object sets, properties, tables, col-
umns, constraints, applications, classes, variables, methods, librar-
ies, etc.) should be named consistently, adequately, in a uniform
manner, and their names should be the same at all involved levels
(data analysis, E-RDs, restriction sets, mathematical, relational,
and implemented db schemas, SQL, high level programming lan-
guages code, as well as both global/external and local/embedded
in-line documentation).
b. Object set names should be plurals and uppercase; property and
restriction names should be singular and using generally lower-
case, except for acronyms and letters starting words. Examples:
The
Quest for Data Adequacy and Simplicity 77
R-DA-2. Object sets are either atomic (i.e., of “entity” type) or not (i.e., of
non-functional “relationship” type) that is, mathematically, relations (i.e.,
subsets of Cartesian products).
a. For each abstracted object set, you should always first determine
its type in the corresponding context.
b. In different contexts, object sets might have different types.
c. Obviously, if a set is not atomic, then you should also exactly deter-
mine and specify its underlying object sets.
For example, considering a logistic subuniverse where several product
types might be stored in a same warehouse and warehouses may store sev-
eral product types, we should abstract three related object sets:
78 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
PRODUCTS (atomic),
•
WAREHOUSES (atomic), and
•
STOCKS (nonatomic, linking the former two atomic ones, which
•
are its underlying sets, through two corresponding roles: Product
and Warehouse).
The attribute StockQty is a property of STOCKS, as there may be, for
example, q1 products of type p1 in warehouse w1, q2 in w2, and so on.
The corresponding E-RD fragment should then be as in Fig. 2.18.
When there is only one warehouse, both WAREHOUSES and STOCKS are
no more needed, and attribute StockQty is associated to PRODUCTS instead.
R-DA-4.
a. Objects should be always abstracted as set elements, not as char-
acter strings, numbers, calendar dates, etc., that is object property
The
Quest for Data Adequacy and Simplicity 79
FIGURE 2.19 Example of using abstraction in data modeling for reducing models
complexity.
A theory is the more impressive the greater the simplicity of its prem-
ises is, the more different kinds of things it relates, and the more extended
is its area of applicability. Therefore the deep impression which classical
thermodynamics made upon me. It is the only physical theory of univer-
sal content concerning which I am convinced that within the framework
of the applicability of its basic concepts, it will never be overthrown.
—Albert Einstein
I bought a two story house. … One story before I bought, and another after.
—(anonymous author)
R-DA-7.
a. Fundamental dbs should store any fundamental data only once.
Any computable data is redundant; hence, generally, it should not
be stored.
b. Exceptions justified by querying speeding-up needs should not only
be documented as such (starting with prefixing their name with a
special character, like ‘*’, indicating the corresponding expression
for computing it, as well as the reasons why it was added to the
model), but always be made read-only for end-users in any appli-
cation built on that db, and their values be updated only automati-
cally, by trigger-like methods, immediately when needed (which is
referred to as controlled data redundancy).
c. Generally, such exceptions should not be introduced at this level
(the only conceptual one available to customers too), as it may con-
fuse customers (possible exception: when you want to justify to
your customers higher costs for faster solutions), but only starting
with the next conceptual level.
d. Before introducing such exceptions, you should make sure that they
are either critical or that significant querying speed-up advantages
justify additional costs in developing, db size, and corresponding
automatic updates needed time.
e. You might add both computed sets, possibly having associated
properties and computed properties of fundamental sets.
f. Computed restrictions and fundamental objects uncontrolled
redundancy should be banned!
For example, let us consider the E-RD from Fig. 2.22 (see also, for
example, the standard MS Northwind demo db), consisting of object sets:
• CUSTOMERS,
• CITIES,
• STATES, and
The
Quest for Data Adequacy and Simplicity 83
• COUNTRIES,
where:
– every customer has its headquarters in only one city,
– any city belongs to only one state, and
– any state belongs to only one country, and, consequently,
– any headquarter is located in only one state and country.
Both CState and CCountry are computable, hence redundant:
CState = State ° CCity and
CCountry = Country ° CState = Country ° State ° CCity.27
This means that, whenever you also add them to your solution, you
should also add two corresponding additional constraints (CState = State
° CCity and either CCountry = Country ° CState or CCountry = Country
° State ° CCity) to it: otherwise, implausible instances would be possible.
For example, even if, in CITIES, Paris belongs to Ile-de-France, in
STATES Ile-de-France belongs to France, and Tennessee to U.S., in CUS-
TOMERS Sephora, whose headquarters are actually in Paris, may oth-
erwise, according to such a db instance, be located, for example, in the
Greenland’s unknown state from Romania!
Obviously, such a solution is either very costly (and not that much
because of the needed extra db space for storing these two computed col-
umns and time to update their values too—be it automatically, as it should,
or manually, as it shouldn’t—, but especially for enforcing these two extra
needed constraints) or an incorrect one (whenever these two constraints
are not added too).
27
Recall please that “°” is the algebraic symbol for function composition; for example, for any city c,
the country to which c belongs may be computed by composing Country and State: (Country ° State)
(c) = Country(State(c)).
84 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
FIGURE 2.23 Resulting E-RD after breaking the cycles in Fig. 2.22.
FIGURE 2.24 Resulting E-RD after adding controlled redundancies to the one in
Fig. 2.23.
The
Quest for Data Adequacy and Simplicity 85
FIGURE 2.25 Resulting E-RD after optimizing controlled redundancies of the one in
Fig. 2.24.
R-DA-8. Fundamental dbs should not ever store computable data instead
of corresponding fundamental one.
For example, you should not ever store (for people/living beings/spiri-
tual periods, etc.) Age instead of, say, BirthDate/Year: trivially, Age is com-
putable from BirthDate/Year; moreover, if you would add Age instead of
BirthDate/BirthYear, you are forcing users each day/year to check whether
or not one or more of its values should be incremented and, whenever it is
the case, to do it!
Computers are like Old Testament gods; lots of rules and no mercy.
—Joseph Campbell
Any kind of restrictions put on free speech would have worse conse-
quences than bullying.
—Colleen Martin (aka Lady Starlight)
R-DA-12. For each object set its maximum possible number of elements
should be added as its first associated restriction.
For example, #Cities values should be between 0 and 1,000 for states/
regions/departments/lands/etc., or 100,000 for countries, or 100,000,000
worldwide, whereas #Countries values should be between 0 and 250.
88 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
I admire how she protects her energy and understands her limitations.
—Terry Tempest Williams
R-DA-14. For each object set, there should be at least one compulsory
property (i.e., one that should not accept NULL values), besides the cor-
responding surrogate key. Do not forget to also ban dirty nulls.
The
Quest for Data Adequacy and Simplicity 89
R-DA-15.
a. For each object set, except for very rare cases, we should be able to
uniquely identify any of its elements through at least one semantic
(i.e., nonsurrogate) attribute, functional relationship, or aggrega-
tion of attributes and/or functional relationships.
For example, countries should be uniquely identified by their Coun-
tryName, CountryCode, and Capital, states should be uniquely
identified by the product StateName • Country (as no country may
simultaneously have two states having same name), people should
be uniquely identified by their SSN, etc.
b. There are only three exception types to “a.” above:
• Subsets: some of them might not have any semantic key (please
note that they may, however, be uniquely identified by the
semantic uniquenesses of their corresponding objects supersets,
as they inherit them all). For example, DRIVERS # EMPLOY-
EES is not generally having any semantic uniqueness of its own
(but it inherits EMPLOYEES’ ones, for example SSN).
• Object sets for which it makes sense to overload their surro-
gate (syntactic) keys with some semantics (and, consequently,
it does not make sense anymore to duplicate its values in an
identical semantic property). For example, RABBIT_CAGES
90 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
R-DA-17.
a. You should consider object sets in the descending order of their
importance in the corresponding subuniverse.
b. For each object set you should specify its restrictions grouped by
their types (and always order types consistently; above R-DA-12 to
R-DA-16 rule types order is preferable).
c. For other restriction types involving several object sets (e.g., Bor-
rowDate ≤ DueReturnDate and BorrowDate ≤ ActualReturnDate
from the case study above), you should place them either in the
restrictions subset of the last involved object set or at the end of the
restriction set, in a dedicated multiset restrictions section.
d. All restrictions of any data model should be uniquely named
within it.
R-DA-18. Except for cases when (preferably generally, not only in that
particular subuniverse) object sets, properties, and restriction names are
unambiguous and self-explanatory (e.g., PEOPLE, FirstName, SSN, PEO-
PLE_SSNuniqueness, etc.), you should always associate substantial, com-
prehensive, concise, clear, nonambiguous descriptions for both object sets,
properties, and restrictions.
92 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
R-DA-19.
a. First establish atomic subdiagrams, only containing one object set
and its basic properties (i.e., one rectangle or one diamond – with
its underlying object sets too – and all associated ellipsis).
b. All subsequent E-RDs should be structural diagrams (i.e., only
containing rectangles, diamonds, and arrows, without ellipsis).
c. Structural E-RDs should be established by (sub)domains of the
universe of discourse (HR, Salaries, Accounting, Finance, Custom-
ers, Orders, etc.).
d. Links between these structural E-RDs are naturally ensured by
commonly shared object sets (e.g., EMPLOYEES will appear both
in HR, Salaries, and Orders).
e. Each E-RD should fit on one page.
f. Rectangles, diamonds, ellipsis, arrows, and lines should not inter-
sect each other.
For example, the E-RD from Fig. 2.15 could be split into nine atomic
subdiagrams and one structural E-RD, as presented in Figs. 2.27–2.37.
The person who reads too much and uses his brain too little will fall
into lazy habits of thinking.
—Albert Einstein
The
Quest for Data Adequacy and Simplicity 93
Surrogate keys are natural injections defined on object sets (not more
than one per set) having no semantics other than uniquely identifying their
elements: ∀O[Ω, there may be only one surrogate key x : O → NAT(n), n
natural; other authors are naming them ID or OID or IDO or #O etc. Obvi-
ously, surrogate keys are attributes.
Any restriction set C associated to an E-RD D is a nonempty set of
closed Horn clauses whose first order predicates only refer to sets and
mappings of D.
Any (attribute) range restriction c[C is, in fact, specifying the codo-
main of a mapping a[A (c = “range a is C ”, where dom(a) = O [ Ω, is
equivalent to a : O → C).
Any (attribute) compulsory restriction c[C is stating that the codo-
main of a mapping m[M is disjoint from NULLS (c = “compulsory m” is
equivalent to codom(m) ù NULLS = Ø).
Any uniqueness restriction c[C is stating that a mapping m[M is one-
to-one or that a mapping product m1 • … • mn, mi[M, ∀i[{1, …, n},
n > 1 natural, is minimally one-to-one. Note that, unfortunately, some
authors and db “experts” wrongly consider one-to-one “relationships” as
also being onto.
From the E-RDM point of view, the relationship-type object set UNDER-
LYING_SETS from Subsection 2.9.2 above is correctly used: any object
set may be the underlying set of several non-functional relationship-type
sets and any non-functional relationship-type set has several (at least two)
underlying object sets, so this is a “typical” so-called many-to-many rela-
tionship-type set.
In fact, modeled as such, it is not even a set, as it should accept dupli-
cates: any non-functional relationship-type set may have a same object
The
Quest for Data Adequacy and Simplicity 97
2.12 CONCLUSION
Db design should start with a thorough data analysis and modeling, for
correctly identifying all needed object sets, their properties, as well as the
rules governing the corresponding subuniverse of discourse. This process
should result in an E-R data model—that is a set of E-RDs, an associ-
ated restriction set, and an informal but clear, concise, and comprehensive
description of the corresponding subuniverse.
Not only that you should never start creating the tables of a db with-
out a proper data analysis, but, moreover, during data analysis you should
never think in terms of db tables, but in terms of object sets, their proper-
ties, the relationships between them, and the business rules that are gov-
erning the corresponding subuniverse.
Algorithm A0 from Section 2.9, together with the best practice rules
from Section 2.10, provide a framework for assisting this process. As also
seen from the case study presented in Section 2.8, this process is an itera-
tive one. In the end, customers and db designers should agree (preferable
formally) on all three deliverables of this process.
Obviously, this process is greatly favored by abstraction power, knowl-
edge of the subuniverses of discourse semantics and governing rules, com-
munication skills (in order to both find out from customers exactly what
they need and everything relevant you don’t know, but they do, as well
as to adequately present your deliverables as to convince and get from
your customers the final agreements), and experience (i.e., a library of best
solutions to as many subuniverses as possible: contributing to this library
is also one of the main goals of this book).
The
Quest for Data Adequacy and Simplicity 99
Two golden rules are crucial to the success or failure of data analysis:
• Never overload object sets with several semantics.
• Discover and include in your models not only all needed object
sets, properties of and functional relationships between them, but
also all business rules that are governing the corresponding subuni-
verses of discourse.
Although further refinements are generally needed, as shown in the
second volume, a corresponding relational db scheme and an associated
(possibly empty) set of non-relational constraints (that will have to be
enforced either through db triggers or by db applications) may already be
obtained from these deliverables by applying algorithm A1-7 introduced
in Section 3.3 and also exemplified in Section 3.4.
Dually, when you have to work with an unknown db, for which there is
no such documentation, by using the reverse engineering algorithm REA1-2
introduced and exemplified in Section 3.5 you will benefit a lot especially from
the structural E-RD and restriction set that you will obtain, which is always
providing a very useful overview on both the set of all existing interobject sets
connections, as well as on the business rules governing that subuniverse.
Hopefully, the E-R data model of the E-RDM introduced in Section
2.9 makes understanding of this data model simpler. The math behind this
E-RDM kernel (presented in Section 2.11) is very straightforward, while
providing a deeper understanding of this simple, yet powerful data model.
As many are wrongly assuming that one-to-one relationships are also onto,
as using many-to-many ones proves dangerous (see Subsection 2.11.2) and
not fundamental (see Section 2.10 of the second volume), and as many-to-one
and one-to-many ones may be misleading too (as, in fact, they are both ordi-
nary mappings with only the arrow direction being different between them),
we would strongly recommend to get rid of them all and always think in data
modeling only in terms of atomic object sets and mappings defined on them.
2.13 EXERCISES
In any situation, the best thing you can do is the right thing;
the next best thing you can do is the wrong thing;
the worst thing you can do is nothing.
—Theodore Roosevelt
100 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
2.0. Apply algorithm A0 and the above best practice rules to design the
E-RDs, restriction set, and description for the following simplified pay-
ments subuniverse: payments to suppliers (names and locations) by cus-
tomers (names and locations) are done for invoices (supplier, customer,
number, date, total amount) either cash, with credit/debit cards (type,
credit/debit, number, expiration month and year, holder, security code,
account), or through bank wires; payment documents are, correspond-
ingly, receipts, bank payment confirmations and orders; for all of them,
supplier, customer, number, date, and total amount are needed; moreover,
the paid total amount of a payment may represent the sum of several par-
tial invoice payments; for bank payments also needed are involved bank
accounts (bank subsidiaries’ addresses and cities, holders, number, IBAN
code, available amount); suppliers may be customers too; there may not be
two cities of a same country having same zip code.
Solution:
Banks may also be suppliers and/or customers, so we abstract them
too as partners; as not all suppliers and/or customers are banks, we add to
PARTNERS a Boolean property Bank? too.
For simplicity, we assume that each bank account has only one holder
and at most one associated card valid in any given time period, and that
any card is valid at least one year.
Obviously, total amounts on both invoices and payment documents are
computable (for payment documents, from corresponding amounts in pay-
ments; for invoices, from corresponding subtotal amounts in invoice lines,
not figured here as it does not matter from the payments point of view);
however, in order to speed-up querying, we have added two computed
*TotAmount properties, one for INVOICES and the other for PAYM_DOCS.
The corresponding description of this subuniverse is as follows:
1. Payments to suppliers by customers are done for invoices (supplier,
customer, number, date, all of them compulsory) either cash, with
The
Quest for Data Adequacy and Simplicity 101
How you think about a problem is more important than the problem itself
– so always think positively.
—Norman Vincent Peale
deteriorated, etc.). Establish E-R subdiagrams for each object set and one
or several structural ones, and update correspondingly restriction set, and
description of this augmented subuniverse.
2.2. Extend the public library db for also managing journals, maga-
zines, newspapers, movies, music, etc., both in printed and electronic edi-
tions. Update correspondingly E-RDs, restriction set, and description of
this extended subuniverse.
2.3. The subuniverses from Figs. 2.5, 2.12, 2.17, 2.18, 2.23–2.25, as
well as the one mentioned in Section 1.6.
2.6. Add to the E-R data metamodel presented in Section 2.9 unique-
ness restrictions made out of several mappings, as well as all existing
restrictions of other types.
2.7. Design an E-R data metamodel for the original E-RDM.
2.8. Add to Exercise 2.0 products (unique name and unit price) and
invoice details (invoice, line number, product, unit price, quantity, and
total amount).
2.9. Characterize algorithm A0, proving that its complexity is linear
in the sum of object set collection, property and restriction sets cardinals.
2.10. Analyze the data models proposed by http://www.databas-
eanswers.org/data_models/index.htm and establish E-R data models for
each of them.
2.11. Extend problem 2.0 considering that a bank account can have
several holders and several associated cards simultaneously valid, one for
each such holder.
110 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The E-RDM introduced by Chen (1976), was the first hugely successful
higher level conceptual data model, not only still in use today, but, very
probably, to be at least as used and revered in the foreseeable future too.
Each year there are dozens of E-RDM international conferences,
mainly proposing new extensions to it (e.g., the International Conferences
on Conceptual Modeling (ER), see, for example, http://www.informatik.
uni-trier.de/~ley/db/conf/er/index.html).
Besides the thousands of papers on it, there are dozens of good books
too; for example, one of the “bibles” in E-RDM is Thalheim (2000).
A recent book in this field is Bagui and Earp (2012).
The math behind E-RDM section 2.10 of this chapter follows the
E-RDM algebraic formalization from Mancas (1985), which also extended
E-RDM with hierarchies of relationships, refined in Mancas (2005).
Even if E-RDM is so simple and with very limited constraint expres-
sive power, it is still heavily used in conceptual data modeling, generally
as a first step, due to its graphical suggestive power.
E-RDs (also abbreviated ERDs) too are ancestors of UML dia-
grams (see http://www.iso.org/iso/iso_catalog/catalog_tc/catalog_detail.
htm?csnumber=32624 and http://www.omg.org/spec/UML/2.4.1/Super-
structure/PDF/), both being types of organograms.
Most of the RDBMS are offering their variants of E-RDs, at least for
visualizing links between db tables. For example, MS Access provides
among its db tools the so-called Relationships window; its MS SQL Server
equivalent is the Database Diagrams; IBM Data Studio (http://www.ibm.
com/developer works/downloads/im/data/index.html) and Oracle SQL
The
Quest for Data Adequacy and Simplicity 111
KEYWORDS
•• attribute
•• attribute range restriction
•• Barker E-RD notations
•• Chen E-RD notations
•• compulsory data restriction
•• computed data
•• computed objects
•• controlled data redundancy
•• data analysis
•• data computability
•• data modeling
•• E-R data model (E-RDM)
•• E-RD architecture
•• entity
•• Entity-Relationship (E-R)
•• Entity-Relationship Diagram (E-RD)
•• functional relationship
•• implausible restriction
•• Key Propagation Principle (KPP)
112 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
REFERENCES
Bagui, S., Earp, R. (2012). Database Design Using Entity-Relationship Diagrams, 2nd
Edition: CRC Press, Boca Raton, FL.
Chen, P. P. (1976). The entity-relationship model: Toward a unified view of data. ACM
TODS 1(1), 9–36.
The
Quest for Data Adequacy and Simplicity 113
CONTENTS
3.1 First Normal Form Tables, Columns, Constraints, Rows,
Instances.......................................................................................118
3.2 The Five Basic Relational Constraint Types............................... 123
3.3 The Algorithm for Translating E-R Data Models into
Relational Schemas and Non-Relational Constraint Sets (A1-7):
An RDM Model of the E-RDM.................................................. 146
3.4 Case Study: The Relational Scheme of the Public Library
Data Model.................................................................................. 152
3.5 The Reverse Engineering Algorithm for Translating Relational
Schemas into E-R Data Models (REA1-2).................................. 156
3.6 The Algorithm for Assisting Keys Discovery (A7/8-3).............. 164
116 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
TABLE 3.1B RDM 1NF Table Storing Data from Which Cross-Tabulation in
Table 3.1a May Be Computed
28
But there are both theoretical (e.g., O-O extensions of RDM) and implementation (e.g., MS Access
2010 “multiple values” combo and list boxes) approaches of nested table schemes too.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 119
Also note that RDM’s relations (derived from the homonym sets
algebra ones) may be also represented using graphs or lists, etc., but, for-
tunately, their most common representation uses tables. Historically, in
RDM tables are called relations, columns are called attributes, and rows
are called tuples (or n-tuples).
Table headers (schemas) contain both the corresponding finite set of
column names (which should be unique across any table) and a finite set
of relational constraints. The number of columns of a table (its arity) is a
natural strictly positive number. Table rows (if any) are making up corre-
sponding table (data) instances. Note that, abusively, relation sometimes
designates in RDM only the graph of a relation (i.e., its corresponding
table instance). The number of rows of a table (or its cardinal) is a natural
number. The union of all table headers of an rdb constitutes its scheme; the
union of all table instances of an rdb constitutes its instance.
Figure 3.1 summarizes basic notion equivalences between these three
formalisms’ notational conventions.
For example, in Fig. 3.2, COMPOSERS has arity 5 and cardinality 8;
MUSICAL_WORKS has arity 6 and cardinality 9; TONALITIES has arity 2
and cardinality 6; in total, this rdb instance has cardinality 8 + 9 + 6 = 23.
Parenthesis after tables names, as well as second and third table head-
ers rows contain the associated relational constraints:
Ø in the parenthesis after the table names are listed the key (unique-
ness) constraints, stating that corresponding columns or con-
catenation29 of columns do not accept duplicate values (in this
example: in table COMPOSERS there may not be two rows hav-
ing same values in columns FirstName and LastName; in table
MUSICAL_WORKS there may not be two rows having same val-
ues in columns Composer, Opus and No; in table TONALITIES
there may not be two rows having same values in column Tonal-
ity); consequently, they may be used for unique identification of
corresponding instance rows;
Ø note that each table has a so-called primary key, denoted x in
this example (and almost always in this book: Subsection 3.2.3.1
explains why); primary key names are underlined; unfortunately,
RDM allows to declare as primary any key, which you should never
Formally, f • g denotes the mapping (Cartesian) product of mappings f and g (see Appendix). In
29
do (see why in Subsection 3.2.3.1); also note that all tables should
have all existing constraints, so all existing keys, which means that,
with a couple of exceptions to be explained and exemplified in
Subsection 3.2.3.1, all tables should have at least one non primary
key as well: for example, if Tonality were not a key in TONALI-
TIES, instances of this table might store, for example, “G major”
twice (or any number of times, thus violating axiom A-DI17 from
Section 5.1.2);
Ø after these parenthesis are listed the check constraints, each of which
is stating that the corresponding formula should hold for every corre-
sponding instance row (in this example, BirthYear + 3 < PassedAway
The
Quest for Data Independence, Minimal Plausibility, and Formalization 121
122 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Year states that every composer has to live at least 3 years –as it is not
plausible that somebody could compose before being 3–, while Passed-
AwayYear—BirthYear < 120 states that no composer may live longer
than 120 years);
Ø in the first header line are listed the names of the columns (attri-
butes, mappings);
Ø in the second header line are listed both the domain constraints and
the foreign keys:
ü domain constraints specify column range restrictions (in this
example, all x column values are naturals auto-generated by
the system: for COMPOSERS having at most 8 digits, for
MUSICAL_WORKS at most 12, and for TONALITIES at most
2; FirstName and LastName may only be character strings
using the ASCII alphabet, having length at most 32; Passed-
AwayYear may only be a natural number between 1200 and the
current year; etc.);
ü foreign keys specify that corresponding columns should
take values only among the values of other columns (in
this example, there are only two such foreign keys, both in
the MUSICAL_WORKS table: Composer has to take values
only among those of column x in table COMPOSERS, while
Tonality has to take values only among those of column x in
table TONALITIES); foreign keys are interrelating tables or
same table’s instances; if intelligently used, always only like
in the MUSICAL_WORKS table (and everywhere throughout
The
Quest for Data Independence, Minimal Plausibility, and Formalization 123
FIGURE 3.3 Equivalences between informal, RDM, and math relational constraint
types terminology.
30
where the Cartesian product is not commutative.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 125
The shallow consider liberty a release from all law, from every constraint.
The wise man sees in it, on the contrary, the potent Law of Laws.
—Walt Whitman
Very often, it is the case that some values are not applicable or, at least
temporarily, are unknown. In table MUSICAL_WORKS of Fig. 3.2, for
example, there are no values in some rows for the No columns: some
musical works have both an opus and number (within that opus), like, for
example, Chopin’s Nocturne in G major opus 37 no. 2; others do not have
a number (as there is only one musical work within that opus), as it is the
case for all other musical works in Fig. 3.2—for which this number is not
applicable. On the other hand, if we were, for example, to add to COM-
POSERS Vladimir Cosma (born in 1940 and, thanks God, still alive at the
time of finishing this volume), his passing away year, although applicable,
remains unknown.
Conventionally, such values are called nulls. The ANSI report (ANSI/
X3/SPARC, 1975) distinguishes between 14 types of nulls; most of us
consider, however, that there are only three basic types (the rest being
either particular cases of these three or only due to implementation con-
siderations): nonexistent (inapplicable), (temporarily) unknown, and no-
information (the logical or between the first two: it is not known whether
such a value exists or, if it were existing, nothing is known on its nature).
Examples for this last type do not abound; my favorite one is on
Moldavian king Stephen the Great (1433–1504) and the Russian lan-
guage: did he read Russian? As Russians existed at that time, even if it
took another two centuries to become Moldavia’s neighbors, it is possi-
ble that the answer would be affirmative; however, no proof exists for or
against it, and the chances to find at least one in the foreseeable future are
almost negligible.
Consequently, in dbs we have to accommodate with null values too.
However, constraints should also be placed on their usage: what would be
the meaning of a row having only nulls (even if except for its autonumber
The
Quest for Data Independence, Minimal Plausibility, and Formalization 127
primary key)? Obviously, at least one non-surrogate key column per each
table should not accept nulls: not-null (mandatory, compulsory, required,
totality) constraints are used to specify such columns.
Please note that there is great misunderstanding on null values: some
DBMSs (e.g., MS SQL Server) wrongly assume that there is only one null
value, so columns accepting nulls are not allowed in unique constraints;
others (IBM DB2, Oracle, and even MS Access) assume correctly that
there is a countable infinite number of nulls (so they accept non not null
columns into keys).
One final note on dirty nulls—that is not empty text strings made out of
only non-printable/displayable characters (spaces, for example): obviously
they should be banned too, but, unfortunately, very few DBMSs are doing
it (and sometimes when they may be banned too, as, for example, in MS
Access, this is not the default setting). Consequently, whenever a DBMS
provides this facility too, we should always use it for all columns, regard-
less of the fact that they should accept nulls or not (what would be the sense
to allow storing dirty nulls even in columns that accept nulls?); whenever a
DBMS does not provide this facility, we should enforce dirty nulls banning
through the software applications managing corresponding dbs.
The truly important things in life –love, beauty, and one’s own uniqueness–
are constantly being overlooked.
—Pablo Casals
table and for each of its rows we would store a unique type name (plus
description, total number of chairs per type or, in other tables, per type and
room/apartment/building, if needed).
A key constraint is a statement of the type “C1 • … • Cn key”, where
n > 0 is a natural, Ci are columns of a table T, “•” denotes concatenation31
(which is, generally, omitted) of these columns and key means minimally
unique32—that is it is unique and it does not include any other key (see, for
example, Fig. 3.2). When n = 1, the key is called simple; when n > 1 it is
called concatenated; if a unique column concatenation properly contains a
key (i.e., the included key has smaller than n arity), then it is not a key (as
it is not minimal), but a superkey; note that superkeys are of no actual, but
only theoretical interest.
For example, in Fig. 3.2, Tonality is a simple key of TONALITIES,
while FirstName • LastName and Composer • Opus • No are concatenated
keys of COMPOSERS and MUSICAL_WORKS, respectively. FirstName
• LastName • BirthYear, FirstName • LastName • PassedAwayYear, and
FirstName • LastName • BirthYear • PassedAwayYear are all superkeys33.
FirstName • LastName : COMPOSERS → ASCII(32) × ASCII(32) is,
in this context, one-to-one (as all “old” composers had unique such pairs
and today no composer would dare to take both first and last names of an
“old” one and even if a composer were having, by chance, identical first
and last names as another today’s one, he/she would at least add a surname
in order to differentiate him/herself) and also minimally one-to-one (as
neither of its subproducts are one-to-one: FirstName, as there are several
composers having same first name –for example Franz Schubert and Franz
Liszt– and LastName, as there are several composers having same last
name—for example Johann Strauss and Richard Strauss).
Similarly, Composer • Opus • No : MUSICAL_WORKS → [0,
99999999] × ASCII(16) × [1, 255] is one-to-one too (as, by definition,
musical works are uniquely identified by their composer, opus (catalog)
and number) and also minimally one-to-one (as neither of its subproducts
are one-to-one: Composer • Opus, as there are several works by a same
composer having same opus but different numbers, Composer • No, as
there are several works by a same composer having same number but dif-
31
Mathematically, this means mapping (Cartesian) product.
32
That is minimally one-to-one (see Appendix).
33
Recall that if f : A → B is one-to-one then, for any g : A → C, f • g is one-to-one too (see Appendix).
The
Quest for Data Independence, Minimal Plausibility, and Formalization 129
ferent opuses, and Opus • No, as there are several works having same opus
and number, but different composers).
Only keys need to be declared, not superkeys34: for example, if instead
of FirstName • LastName you declare FirstName • LastName • BirthYear
as a key, then implausible data might be stored (e.g., a second Antonio
Vivaldi, born in 1927 or any other valid year than 1678, could be stored
too); if FirstName • LastName and FirstName • LastName • BirthYear
are both declared as keys, then no implausible data might be stored, but
updates to COMPOSERS would be slower (and the db would need addi-
tional disk and memory space to store and process the superkey FirstName
• LastName • BirthYear), as the system would enforce the superkey too.
Although RDM does not require it, it is a best practice to declare for
each table a primary key: a key that is not accepting nulls and which is
the default referenced one by foreign keys35. Conventionally, its name is
underlined.
Unfortunately, RDM allows users to declare any key as being the
primary one; consequently, it is a widespread bad practice to declare as
primary keys concatenated ones and/or keys containing columns that are
non-numeric. For example, a vast majority of db “experts” would declare
FirstName • LastName as COMPOSERS’ primary key (and would not add
x to this table); consequently, they would need FirstName • LastName
as a foreign key in MUSICAL_WORKS (referencing COMPOSERS) and
declare FirstName • LastName • Opus • No as its primary key (not adding
x to this table either).
There are several disadvantages of this approach:
34
Unfortunately, most RDBMSs (including, for example, Oracle, MS SQL Server and Access, etc.)
allow for enforcement of both keys and superkeys!
35
Moreover, for most RDBMS versions, table instances are stored by default in the order of their
primary keys, if any.
130 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
become clear in Section 3.6 and especially in the second chapter of the
second volume.
There are only two things in the world: nothing and semantics.
—Werner Erhard
Note that, unfortunately, there are lot of dbs whose fundamental tables do
not have keys (and, generally, no other constraints either); others always
have only the primary one (generally, of surrogate type). Any table should
have all keys existing in the subuniverse modeled by the corresponding
db, just like it should have any other existing constraint: a table lacking
an existing constraint allows for implausible data storage in its instances.
All nonprimary keys are called semantic (candidate36) keys.
For example, let us consider table COUNTRIES from Fig. 3.4, where all
columns except for Population are keys: x, the surrogate primary one, by
definition; Country, as there may not be two countries having same names;
CountryCode, as there may not be two countries having same codes37;
finally, Capital, as there may not be two countries having same city as their
capitals.
If we were not declaring, for example, Capital as a key too, then
users might have entered 4 (instead of 5) in this column for the fifth row
too, storing the highly implausible fact that Washington D.C. is not only
U.S.A.’s capital, but also Romania’s one.
Please note that uniqueness is not an absolute, but a relative property:
in some contexts a column or concatenation of columns are unique, while
in others (even in a same subuniverse) they are not.
For example, in the subuniverse modeled by Fig. 3.4, where only big
cities are considered, there may not be two cities having same name in a
same state; if we would need to store villages and small cities too, this
product (City • State) would not anymore be unique, as there may be sev-
eral such villages or even small cities of a same state having same name.
36
Unfortunately (from the semantic point of view), in RDM candidate means that any of them might
be chosen as the primary one.
37
See ISO 3166 codes.
132 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
In this case, for example, in most of Europe (where states are called
regions, lands, departments, cantons, etc., depending on countries) there
are three rules:
Ø villages and small cities may be subordinated to bigger ones, gen-
erally called “communes” and there may not be two cities of a
same commune having same name;
Ø there may not be two communes of a same state having same
names;
Ø there may not be two identical zip codes in a same country (where
small cities have one zip code, bigger cities none, but their street
addresses have one too).
Figures 3.4 and 3.5 show the corresponding CITIES tables.
Another, even worse example (as the subuniverse is strictly the same),
is computer file names: almost everybody knows that “there may not be
two files having same name and extension in a folder”, which means that
the triple FileName • FileExt • Folder is a key; in fact, OSs enforce a
supplementary constraint too: there may not be two files in a same folder
having null (no) extensions and same name, which means that for the sub-
set of files without extension FileName • Folder is a key (and FileName •
FileExt • Folder is a superkey).
Consequently, first of all, semantic keys, just like any other type of
fundamental (i.e., not derived) constraints may only be discovered and
declared by humans: there may not ever be any tools, be them hardware,
software, conceptual, etc., able to do such a job. Moreover, especially keys
are not at all easy to discover: besides their relativeness, as the number of
columns increases, the number of their products (theoretically very many
of them being possible keys) increases exponentially. This is why Section
3.6 presents an algorithm for assisting their discovery.
Please note that, dually, again just like for any other constraint type,
you should never assert keys that do not exist in the corresponding sub-
universe: if you were doing it, you would aberrantly prevent users to store
plausible data. For example, if in any table of Figs. 3.4 and 3.5 you were
declaring Population as a key too, then no two corresponding objects
(countries, states, communes, or/and cities) having same population fig-
ures could be stored simultaneously.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 133
You can use all the quantitative data you can get, but you still have to
distrust it and use your own intelligence and judgment.
—Alvin Toffler
FIGURE 3.5 Tables COMMUNES for big cities and CITIES for small ones.
For example, in Fig. 3.5, columns x from both tables, Commune and State
from COMMUNES, as well as City, Commune, *Country, and Zip from CIT-
IES are prime, whereas columns Population from both tables are nonprime.
Just like RDM itself, defined as such, primeness and nonprimeness are
purely syntactic concepts: they are applied only after the set of table keys
are established.
As it will become clear why in (E)MDM (see the second chapter of the
second volume), we need in fact slightly different semantic definitions for
them, to be applied before the set of all keys for a table is established: a
136 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Links between tables, as well as those between a table and itself are done
by foreign keys: pointer-type columns whose values should always be
among the values of the corresponding referenced columns. For exam-
ple, from Fig. 3.2, Composer and Tonality in MUSICAL_WORKS, from
Fig. 3.4, Capital in both COUNTRIES and STATES, Country in STATES,
State in CITIES and from Fig. 3.5, State in COMMUNES and Commune
and *Country in CITIES are all foreign keys; for example, by its defini-
tion, values of *Country are always among the values of STATES.Country,
which, in their turn, are always among the values of COUNTRIES.x.
This is why, on the second table header row, instead of a corresponding
domain constraint, for all foreign keys is shown the so-called referential
138 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Foreign keys arouse from the following Key Propagation Principle (KPP),
to be explained in what follows by an example (also see Sections 2.4 and
3.12): let us consider the tables COUNTRIES and STATES from Fig. 3.4,
except for the foreign key column Country of STATES. In order to be able
to link data from these tables, we should store for each state the country
to which it belongs—that is the graph of mapping Country : STATES →
COUNTRIES (as each state belongs to one and only one country).
As for any finite mapping f : A → B, the first (brute force type) idea
of storing its graph is a table with two columns: x (from A) and f(x) (from
45
Probably the most famous example is MS Access: when programmatically creating a foreign key,
the default is “ON DELETE RESTRICT”, whereas when interactively creating it, for example, with
its Lookup Wizard (but not only), the default is “ON DELETE NO ACTION”.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 141
B). Table 3.2 shows this solution for the graph STATES_COUNTRIES of
Country : STATES → COUNTRIES, where, for optimality reasons, State
references STATES.x and Country references COUNTRIES.x.
Both from A and B, keys should be chosen for storing f’s graph (in
order to uniquely identify any element, both from A and B). As all keys
of a set are equivalent (see Exercise 3.42(iv)), for any mapping f, any key
from A could be chosen as x and any key from B could be chosen as f(x).
For optimality reasons, however, the best choice for them is always the
two involved surrogate primary keys.
When we add this table to the rdb scheme in Fig. 3.4 it is almost impos-
sible not to notice that both columns x and State of it are redundant, as they
are duplicating values of column x from STATES. Consequently, an equiv-
alent but much more elegant and optimal solution to this data modeling
problem is simply adding the foreign key Country from STATES_COUN-
TRIES to table STATES, instead of adding the three columns of STATES_
COUNTRIES, out of which two are redundant.
This process is called key propagation for obvious reasons: for any
such mapping (i.e., functional relationships between two tables), accord-
ing to KPP, instead of adding a new table, it is better to propagate in the
domain table46 a key from the codomain one47. This is also why propagated
columns are called foreign keys: generally, they take values from keys of
other (foreign) tables.
Country : STATES → COUNTRIES
Note that “foreign” in the “foreign key” syntagm is a false friend: when
applying KPP to auto-mappings (i.e., to mappings defined on and taking
values from a same set), corresponding foreign keys are referencing a key
from the same table (not from a “foreign” one).
For example, applying KPP to ReportsTo : EMPLOYEES → EMPLOY-
EES, results in the “foreign” key ReportsTo of table EMPLOYEES that
references its x column, as shown by Table 3.3, where, obviously, the first
employee does not report to anybody, the second one reports to the first
one, whereas the third one reports to the second one.
46
Generally, the “many” one: in the above example, STATES. For the one-to-one ones, it is, generally,
the one potentially having less rows than the other, like COUNTRIES (as compared to CITIES) for
Capital (see Fig. 3.4).
47
For those including one “many”, it is the “one” one: in the above example, COUNTRIES. For the
one-to-one ones, it is, generally, the one potentially having more rows than the other, like CITIES (as
compared to COUNTRIES) for Capital (see Fig. 3.4).
142 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
TABLE 3.2 Brute Force Type Graph Storage Implementation of Mapping Country :
STATES → COUNTRIES
TABLE 3.3 An Example of a Foreign Key Which is Actually Not “Foreign” At All
In fact, the syntagm “foreign key” is a double false friend: not only “for-
eign” is a false friend, but also “key” is one, as, generally, foreign keys are
not also (unique) keys.
For example, let us consider table COUNTRIES from Fig. 3.4 aug-
mented with a foreign key Currency (referencing a table CURRENCIES).
Obviously, Capital is both a key and a foreign key, Country is a key but
not a foreign key, Currency is a foreign key but not a key (as there are, for
The
Quest for Data Independence, Minimal Plausibility, and Formalization 143
even the current 12c version of Oracle doesn’t have anything against it! It
is true that the only harm done by enforcing such trivial inclusions is that
it is wasting for nothing additional time for enforcing the corresponding
referential integrity constraint for each row insertion into the correspond-
ing table.
Moreover, pay attention to inclusion cycles too: recall from sets alge-
bra that, for any sets S1, S2, …, Sn, n natural, S1 # S2 # … # Sn # S1
S1 = S2 = … = Sn (i.e., all sets involved in a cycle of inclusions are equal).
First of all, equal sets should be relationally modeled by using only one
table for all of them; but what is much, much worse is that most RDBMSs
(including, for example, MS Access 2010 and Oracle 12c) are allowing
you to define such cycles (even when they only involve two tables!),
which results in not being then able to insert any line in any of the involved
The
Quest for Data Independence, Minimal Plausibility, and Formalization 145
The more constraints one imposes, the more one frees one’s self.
—Igor Stravinsky
48
The set of db standard operators always include the corresponding math ones, plus some additional
ones, either derived from them, like between ... and, or generally coming from regular expressions
manipulation, as like (which uses, just like in OSs, meta-characters too: ‚%’ or ‚*’ for any character
string, including empty ones, etc.).
146 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
There are truths that you cannot justly understand without previously
experiencing some errances.
—Lucian Blaga
The algorithm is shown in Figs. 3.7–3.1149. Note that A1–7 assumes the
following: the corresponding rdb exists and there are no duplicated object
set names or mapping names defined on a same object set.
49
whose complexity is linear in the sum of object sets collection, property and restriction sets cardi-
nals, as proved in Section 3.12 (see Proposition 3.4).
The
Quest for Data Independence, Minimal Plausibility, and Formalization 147
FIGURE 3.7 Algorithm A1–7 (Translation of E-R data models into corresponding
relational schemas and non-relational constraint sets).
Section 3.4 is applying the algorithm A1–7 to the public library E-RD and
restriction set designed in Section 2.8.
Solved exercise 3.0 provides another example of its application (to a
subuniverse of invoice payments).
By applying this algorithm to the E-R data metamodel of the E-RDM
presented in Section 2.9 above, the RDM model of the E-RDM shown in
148 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Fig. 3.12 is obtained (regardless of the fact that it is applied to the E-RD
from Subsection 2.9.1 or to the one from Subsection 2.11.2).
Note that the unlabeled autofunctions of OBJECT_SETS, DATA_
TYPES and RANGES, which are modeling the partial ordering of the ele-
ments of these sets by inclusion, have all been labeled Superset, because
their graphs store for each such element the corresponding smallest super-
set that includes them too.
For example, ASCII(16) , ASCII(32) , ASCII(64) , ASCII(128) ,
ASCII(255), [2, 16] , [1, 16] , [1, 255] , NAT(38) , INT(38) , RAT(38,
2), CURRENCY(38) , RAT(38, 2), [–2500, Year(SysDate())] , INT(38),
[6/1/2011, SysDate()] , [6/1/2011, SysDate() + 300] , DATE, etc.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 149
This rdb may be used as the kernel of the metadata catalog (see Section
3.7) of a DBMS providing E-RDM to its users.
The example instance provided corresponds to the public library case
study from the Section 2.8 (also see Section 3.4).
Note that there is no need (in this kernel metacatalog) for a table
ENTITY_TYPES, as it would only have an x primary column, which would
duplicate the one of OBJECT_SETS.
The associated non-relational constraint set is made out of the follow-
ing two constraints:
150 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
FIGURE 3.12 A RDM model of the E-RDM and a valid instance of it corresponding to
the public library case study from Section 2.8.
1. PERSONS
FName and e-mail should be compulsory for
subscribers (RP7)
154 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The hardest thing is to go to sleep at night, when there are so many urgent
things needing to be done. A huge gap exists between what we know is pos-
sible with today’s machines and what we have so far been able to finish.
—Donald Knuth
Figure 3.13 shows the corresponding public library rdb and a possible
plausible instance of it.
156 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Very often, probably much more often than having opportunities to design
a new db, we are faced with a dual problem, of reverse engineering (RE)
The
Quest for Data Independence, Minimal Plausibility, and Formalization 157
FIGURE 3.14 Algorithm REA1–2 for reverse engineering RDM schemas into
E-R data models.
160 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
50
Moreover, this implies that E-RDM and functional data models are strictly equivalent as expressive
powers.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 161
FIGURE 3.16 The E-RD corresponding to the rdb example from Fig. 3.15.
Do the difficult things while they are easy and do the great things while
they are small. A journey of a thousand miles must begin with a single step.
—Lao Tzu
Generally, even db architects and designers are designing very few new
dbs and are modifying/using much more other existing ones, which very
rarely have all constraints (and, consequently, all keys as well) enforced.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 163
FIGURE 3.17 The restriction set associated to the E-RD from Fig. 3.16.
164 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
FIGURE 3.18 The description corresponding to the rdb example from Fig. 3.15.
As we will see in the second chapter of the second volume, keys should
be designed earlier, on higher conceptual levels, not on the RDM one or,
even lower, on the RDBMS managed rdbs one, but better later than never.
It is true that both this chapter, as well as the second chapter of the
second volume are also providing reverse engineering algorithms, which,
given any rdb scheme, are computing corresponding higher conceptual
level schemas, so db designers might first apply them for obtaining an
E-R data model, then apply A1 to translate this model into a(n) (E)MDM
scheme, A2 to A6 (which includes A3/3′) to refine it and then A7, A8, and
the needed algorithm from AF8’, in order to obtain the new rdb scheme
with the keys designed according to A3 or A3′.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 165
51
51
Depending on RDBMS versions, some data types, generally those needing “huge” storage sizes
(for example memos, pictures, etc.) are not actually allowed to be declared as prime.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 167
07. StateCode is not one-to-one (as there may be two states hav-
ing same code, but from different countries), so neither T’, nor K’ are
modified;
10. StateCode is prime (as there may not be two states having same
code in a same country), so T’ is not modified;
07. Population is not one-to-one (as there may be two states hav-
ing same population, even in a same country), so neither T’, nor K’ are
modified;
168 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
TABLE 3.6 Output Example Table for Applying Algorithm A7/8–3 to Table 3.5
The
Quest for Data Independence, Minimal Plausibility, and Formalization 173
quently, A7/8–3 needs to deal with them all, in fact it will ask users only
at most C(m, [m/2]) times whether or not some declared keys are actually
keys, as it will automatically discard the remaining 2m – 1 – C(m, [m/2])
ones as superkeys.
For example, if, initially, K = { x, Capital, State • Country, State •
Capital, Country • Capital, State • Capital • Country }, immediately after
users are confirming that Capital is a key, the algorithm is automatically
discarding superkeys State • Capital, Country • Capital, and State • Capi-
tal • Country.
174 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The purpose of models is not to fit the data, but to sharpen the questions.
—Samuel Karlin
Table 3.7 shows the metacatalog table DATABASES and its instance for the
two dbs above: System and BigCities. Note that db names (maximum 255
ASCII characters) are mandatory and unique (for any RDBMS).
This table actually has other columns too (e.g., DateCreated, Pass-
word, Status, etc.) and associated constraints.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 179
Table 3.8 presents the metacatalog table RELATIONS and its instance
for the two dbs above. Note that table (relation) names (maximum 255
ASCII characters) are mandatory and unique within any db. As any col-
umn belongs to only one table and primary keys are columns, the Pri-
maryKey foreign key (referencing the CONSTRAINTS’ primary one) is a
semantic key too. As primary keys are optional, this column is not manda-
tory. The foreign key Database (referencing DATABASES’ primary one) is
mandatory, as any table belongs to a db. The first seven rows correspond
to the System metacatalog tables, while the last two to the BigCities tables.
This table actually has many other columns too (e.g., DateCreated,
Tablespace, Status, Cardinal, AverageRowLength, Cached, Partitioned,
Temporary, ReadOnly, etc.) and associated constraints.
180 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Table 3.9 shows the metacatalog table DOMAINS, which stores both
the RDBMS provided column data types and the users defined ones, as
well as a partial valid instance for it. Note that domain names (maximum
32 ASCII characters) are mandatory and unique (for any RDBMS). This
table has actually other columns too (e.g., Owner, UpperBound, Length,
CharacterSet, etc.) and associated constraints, as well as very many other
rows.
All DOMAINS seven rows are system ones, as BigCities does not con-
tain any user-defined domain. Note that by INT, RAT, and DATE/TIME,
respectively, it is understood corresponding OS/RDBMS representable
subsets of the integers, rationals, and calendar dates, that autonumber is
actually a synonym for INT (the only difference being that its values are
automatically generated by the system and are unique for any primary
key), and that BOOLE is the set {True, False}.
ASCII and UNICODE are subsets of the sets55 of strings built upon
these two character sets and having a maximum (RDBMS dependent)
allowed length.
This table actually has other columns too (e.g., minValue, maxValue,
etc.) and associated constraints as well.
Table 3.10 presents the metacatalog table ATTRIBUTES and its instance
for the two dbs above (System and BigCities). Note that column (attribute)
55
Mathematically, the freely generated monoids over the corresponding alphabets.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 181
names (maximum 255 ASCII characters) are mandatory and unique within
any table. The foreign key Relation (referencing RELATIONS’ primary
one) is mandatory, as any column belongs to a table.
The foreign key Domain (referencing DOMAINS’ primary one) is
mandatory too, as any column stores values of a desired data type. Total?
is also mandatory, as for any column the system has to know whether or
not to accept null values for it.
Note that columns Domain, DConstr, and Total? are storing domain
(the first two) and not-null (the third) constraints. Note also that all primary
keys are taking values into the autonumber subset of the integers, while
foreign keys have integer values. The first twenty eight rows correspond
to the System metacatalog tables, while the last six to the BigCities tables.
This table actually has very many other columns too (e.g., Position,
DefaultValue, LowestValue, HighestValue, DistinctValues, CharacterSet,
AverageLength, Hidden, Computed, etc.) and associated constraints.
Table 3.11 shows the metacatalog table CONSTRAINTS, which stores
the relational constraints of the remaining three types (‘K’ for (unique)
keys, ‘F’ for foreign keys, and ‘T’ for tuple/check ones), as well as its
instance for the two dbs above (System and BigCities).
The
Quest for Data Independence, Minimal Plausibility, and Formalization 183
Note that constraint names (maximum 255 ASCII characters) are man-
datory and unique within any db. This is why the computed foreign key
*DB (referencing DATABASES’ primary one) is needed; its graph is com-
putable56.
Note that, except for the key, no constraints are needed on it. The
foreign key ConstrRelation (referencing RELATIONS’ primary one) is
mandatory, as any constraint belongs to a table scheme. ConstrType is
mandatory too, for partitioning the table’s instance according to the three
constraint types stored in it. The first twenty seven rows correspond to the
System metacatalog tables, while the last seven to the BigCities tables.
This table actually has many other columns too (e.g., Generated, Status,
Deferrable, Deferred, Validated, Invalid, etc.) and associated constraints.
56
Mathematically, *DB = Database ° ConstrRelation. Relationally:
SELECT CONSTRAINTS.x, Database
FROM CONSTRAINTS INNER JOIN RELATIONS
ON CONSTRAINTS.ConstrRelation = RELATIONS.x
The
Quest for Data Independence, Minimal Plausibility, and Formalization 185
No foreign key should contain a same column more than once: hence,
FKAttribute • ForeignKey is also a semantic key of this table. For example, its
first row (x=0) store the foreign key Database (RelationsDatabaseFK, con-
straint 17, attribute 5, position 1) of system table 1 (RELATIONS), which is
referencing attribute 0 (x), the autonumber (domain 0) primary key (#Data-
basesPK, constraint 0) of table 0 (DATABASES). The first ten rows correspond
to the System metacatalog keys, while the last two to the BigCities ones.
This table too may actually have other columns and associated con-
straints as well.
Note that some actual RDBMS metacatalog tables are not even in
BCNF (see Subsection 3.10.5). Moreover, surprisingly, some of them do
not even correctly implement all of the RDM concepts.57
57
For example, Oracle, MS SQL Server, and Access do not enforce either C3 or C11, that
is minimal uniqueness (as they allow defining both keys and superkeys) or foreign keys
acyclicity (as they allow attributes to reference themselves, both directly, except for MS
Access, and indirectly).
188 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Also note that RDBMS engines have to enforce, besides the relational
constraints stored in CONSTRAINTS (see Table 3.11), many other non-
relational constraints, including the ones shown in Fig. 3.21.
Note that for also storing tuple constraints you should either add at
least a column to table ATTRIBUTES above or another kernel table (see
Exercise 3.75).
There are truths so less striking that their discovery is almost creation.
—Lucian Blaga
d. Uniqueness: x,
FKAttribute • ForeignKey (no attribute should take
part more than once in a foreign key), (RFK3)
ForeignKey • Position (no position of a foreign key
may be occupied by more than one attribute). (RFK4)
7. KEYS_COLUMNS (The set of pairs of type <k, c> storing the fact
that column c is a member of the (unique) key constraint k)
a. Cardinality: max(card(KEYS_COLUMNS)) = 109 (RK0)
b. Data ranges: Position: [1, 64] (RK1)
c. Compulsory data: x, Key, Attribute, Position (RK2)
d. Uniqueness: x,
Attribute • Key (no attribute should take part more than
once in a key), (RK3)
Key • Position (no position of a key may be occupied
by more than one attribute). (RK4)
All nonrelational constraints from Fig. 3.21 must be added here too.
For example (for all the following maximum cardinalities), any RDBMS
manages at most 104 rdbs, which are characterized by their compulsory
unique name (at most 255 ASCII characters), and provides at most 102
data types, which are characterized by their compulsory unique name (at
most 32 ASCII characters).
Any rdbs may include several relations, at most 108 in total, for all
rdbs, which are characterized by their compulsory name (at most 255
ASCII characters), unique within their rdbs.
Any relation should include at least one attribute; attributes are charac-
terized by their compulsory name (at most 255 ASCII characters), unique
within their relation, NOT NULL constraint (does the corresponding column
also accept null values or not?) and data type; optionally, a domain restriction
The
Quest for Data Independence, Minimal Plausibility, and Formalization 193
FIGURE 3.23 ANSI-92 SQL statements for creating the db scheme of Fig. 3.2.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 195
3.2, the statements shown in Fig. 3.23 are needed (see algorithm A8 from
Fig. 4.1).
Please note the following:
0. Actual RDBMS implementations syntax varies both as compared to
the ANSI SQL standards and even between their versions. However, for
most of them, “;” is ending statements.
1. Auto-generated numbers (autonumbers)58 were introduced in the
ANSI-2003 SQL standard.
2. “– –” are introducing (i.e., are followed by) comments.
3. As table names should be unique only within db schemas, full refer-
ence to any table should be of the type db_name.table_name (e.g., Clas-
sical_Music.COMPOSERS). When the context is unambiguous, the
db names may be skipped.
4. Except for some aggregation ones (e.g., COUNT, SUM, MAX,
MIN, AVG, obviously computing cardinals, total amounts, maximum,
minimum, and arithmetic average values, respectively), there are no
ANSI-92 SQL standard functions; however, SysDate (returning current
system date) and Year (returning the year part of a calendar date value)
above, as well as many others are provided by any RDBMS (possibly
slightly renamed)59.
5. CREATE TABLE statements are creating corresponding tables and
their columns; for each column, its name is followed by its data type and,
whenever needed, its maximum allowed length (e.g., VARCHAR2(32)
means variable length strings of maximum 32 characters; NUMBER(12,0)
means integer numbers (i.e., having 0 digits after the decimal point) using
maximum 12 digits; INT(12) … IDENTITY(1,1) means an auto gen-
erated integer of maximum 12 digits, whose values start with and are
incremented always by 1 (which is the default60). PRIMARY is declar-
ing the corresponding column as being the corresponding table’s primary
key, while NOT NULL is declaring that the corresponding column is not
accepting null values.
58
As such, for example, Oracle continued not to provide them, to the despair of most of its users, up
until December 2013, when version 12c was released.
59
Note that, for example, Oracle does not allow for SysDate either in table constraints or computed
columns definitions; consequently, corresponding constraints may only be implemented through PL/
SQL (see Subsection 4.3.2).
60
Implementations are sometimes also requiring explicit start and increment values.
196 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Databases are not used only for data storage, but especially in order to
retrieve and process it (e.g., by computing cardinals, totals, averages,
minimums, maximums, etc.). This is why all computer data models
include data manipulation languages. Querying languages are tools for
formulating, syntax and semantic correctness checking, and computing
answers to all possibly valid queries over any db. Data manipulation lan-
The
Quest for Data Independence, Minimal Plausibility, and Formalization 197
guages that include them are also providing statements for adding, modi-
fying, and deleting rows in/from tables.
Although data manipulation languages are beyond the scope of this
book, we have to review their basics, as they are needed even in db design,
implementation and, especially, optimization.
Get the facts first and then you can distort them as much as you please.
—Mark Twain
FIGURE 3.24 ANSI-92-type SQL statements for inserting instances of Fig. 3.2 into the
rdb scheme presented in Fig. 3.23.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 199
Caution: if you omit the optional WHERE clause of UPDATE then all
corresponding rows will be updated. For example all rows of table MUSI-
CAL_WORKS would contain 3 in its column Tonality after executing the
statement: UPDATE MUSICAL_WORKS SET Tonality = 3;
3. For removing existing data, SQL offers the DELETE statement; for
example, if we would like to remove Brahms’ 2nd piano concerto, we
could execute: DELETE FROM MUSICAL_WORKS WHERE x=8;
if we were trying to delete any line in TONALITIES, we wouldn’t be
successful as all of them are referenced by at least one line in MUSI-
CAL_WORKS.
Caution: if you omit the optional WHERE clause of DELETE then all
corresponding rows will be deleted. For example all rows of table MUSI-
200 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Note, however, that this is advisable in Oracle, DB2, and latest versions
of SQL Server only because these RDBMS have internal optimizers that
convert any such Cartesian products followed by WHERE filters into corre-
sponding inner joins (for which computation best strategies are also chosen,
depending on current data instances, indexes, etc.). Generally, this is not at all
advisable for RDBMSs which do not have such optimizing capabilities (e.g.,
MS Access and MySQL), as they will first compute the Cartesian product and
then will retain from it only the subset satisfying the WHERE filter (which is
generally taking much, much more time than the corresponding inner join).
8. Outer joins are of three types: left, right, and full ones.
9. The left (outer) join unions the inner join and all other rows from
its left operand, if any (padded with nulls for corresponding right
operand’s columns). For example, the statement:
computes the result presented in Table 3.14, which does not contain
either Bach’s Well tempered clavier or Mozart’s Don Giovanni, as
they have no tonality; if we also need these two works, then we may
use a left join with TONALITIES instead (Table 3.15 shows its result):
10. The right (outer) join is the dual of the left one: it unions the inner
join and all other rows from its right operand, if any (padded with
nulls for corresponding left operand’s columns). For example, the
statement:
The
Quest for Data Independence, Minimal Plausibility, and Formalization 205
63
For example, the set of MUSICAL_WORKS rows may be partitioned by Tonality values into six
partitions: one for those that do not have tonality values (here only “The well-tempered clavier” and
“Don Giovanni”) and five for those that have, namely one for each of the five distinct tonalities (three
of them made out of only one row in this example, while the one corresponding to G major, which
comprises Haydn’s “Surprise” symphony and Chopin’s Nocturne and the one corresponding to B
minor, including Schubert’s “Unfinished” symphony and Brahms’ clarinet quintet having two rows
each). Generally, partitions are the classes of an equivalent relation. In this case, the equivalent relation
is “has same tonality as”.
206 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
13. The GROUP BY clause may contain several columns from the table
expression of the FROM clause (separated by commas), which are
defining partition hierarchies. For example, if we were to compute
how many works were composed by composers in each tonality,
the following statement partitions each partition corresponding to
a composer into as many subpartitions as the number of tonalities
that he/she used (in this example, all partitions are subpartitioned
in only one subpartition, as there are only one work per composer,
except for the one corresponding to Brahms, which is subparti-
tioned in two, as there are two works by him, each of which was
written in a distinct tonality):
that, by default, UNION eliminates duplicates; you can ask for pre-
serving them by using UNION ALL instead.
27. Not all query results are updatable: for example, those obtained
through DISTINCT, GROUP BY, and UNION are not.
28. One might also ask queries whose answer cannot be computed only
based on a db because either there is no such stored data or queries
are incorrectly or ambiguously formulated.
For example, considering Fig. 3.2, the answer to the query “Which is
the tonality in which Mozart wrote his Requiem?” (SELECT Tonality
FROM MUSICAL_WORKS WHERE Composer = 4 and Title =
‘Requiem’;) is the empty set (to be interpreted as “There is no data on
a work by Mozart called Requiem”); the answer to the query “In which
year was first publicly performed Brahms’ 2nd piano concerto?” (SELECT
FirstPublicPerformanceYear FROM MUSICAL_WORKS WHERE
Composer = 8 and Title = ‘The 2nd piano concerto’;)
is a syntactic error (to be interpreted as “this db does not store musical
works first public performance year”); the answer to the query “Who are
the composers whose birth year is greater than their last name?” (SELECT
FirstName, LastName FROM COMPOSERS WHERE BirthY-
ear > LastName;) is a semantic error (to be interpreted as “there is no
meaning in comparing birth years, which are naturals, with last names,
which are text strings”).
Note that the following two subsections closing 3.9.2, devoted to rela-
tional calculus and algebra, may be skipped by all those not interested in
the theoretical foundations of querying. However, especially the second
one (on the relational algebra) could be interesting in order to understand
how the SELECT SQL statement is internally evaluated by RDBMSes.
Moreover, this second subsection is also providing intuition on the math
formalization (presented in section 3.12) of the GROUP BY clause.
64
SQL is the most frequently used, but there are other equivalent languages too as, for example, Quel,
ISBL, UnQL, YQL, as well as subsets and/or extensions of them designed and used for particular do-
mains, as DMX (for data mining), MQL and SMARTS (for chemistry), MDX (for OLAP dbs), TMQL
(for topic maps), XQuery (for XML files), etc.
65
QBE is the only successful DRC language, which was imitated by nearly all RDBMSs (e.g., MS
calls it Query Design mode in Access and Query Designer in SQL Server, pretending that they are
Visual QBEs). Its success is due to its graphical nature, simplicity, and intuitiveness.
66
In dbs, constraints are propositions, while queries are open formulas. The db instance subsets for
which queries are true are considered to be their interpretation (meaning).
The
Quest for Data Independence, Minimal Plausibility, and Formalization 213
CRT and CRD are equivalent as expressive powers67, with CRT for-
mulas being slightly simpler. This allows, for example, translating QBE in
SQL (and vice-versa too).
The relational algebra (RA) is an extension of the algebra of sets (SA) with
five fundamental and several derived operators; RA operands are table
instances.
Recall that fundamental algebra of sets operators are the union, inter-
section, complementation, and Cartesian product (denoted ø, ∩, C, ×,
respectively). Derived ones include, for example, difference (or relative
complement: \, —) and direct sum (!). Moreover, besides equality, inclu-
sions (#, ,, ., $, ÷) play a significant role.
Fundamental RA proper initial (i.e., added from the first RDM specifi-
cation) operators are selection, projection and renaming.
Selection is a “horizontal splitting” operator: it is selecting from a RA
expression E instance only those rows that satisfy a logic formula F (nota-
tion: σF(E)). SQL implements it in its clauses WHERE and HAVING. Trivi-
ally, if all E’s rows satisfy F, then no selection is actually performed, the
result being equal to E’s instance; dually, if no E’s row is satisfying F, then
the result of the selection is the empty instance (of the E’s scheme).
Note that selection compositions may be consolidated into only one
selection, by using the logical and operator: σF(σG(E)) = σF ` G(E).
Projection is a dual “vertical splitting” operator: it is retaining from
a RA expression E corresponding instance only desired explicitly stated
columns included in its not empty selection set (also called target list)
S (notation: πS(E)). SQL implements it in its clause SELECT. Trivially, if
S = E, no projection is actually performed: πE(E) = E; in order to simplify
writing in this case (i.e., not being obliged to mention all E’s columns
67
In the rdbs querying context, a language expressive power means the class of queries that the lan-
guage can express.
214 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
68
Unfortunately, perhaps the most frequently used statement in MySQL (be it under PHP or whatever
other high level programming platform), as its users do not generally care of projecting desired expres-
sions on only actually needed columns.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 215
69
ker is the kernel (nucleus) operator on mappings (see Appendix), which places in a same equivalence
class all elements of the operand mapping’s domain for which the mapping computes a same value.
70
A quotient set is the set of partitions generated by an equivalence relation.
216 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
71
Actually, the best of them will optimize it (as union is the most expensive operator) by embedding
the union within the inner join.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 219
There is also a dual notion to the full join, which is also important
in higher RDM so-called normal forms: a table is said to have a loss-
less(–join) decomposition (LD) with respect to some of its projections if
its instance can be exactly reconstructed from these projections (i.e., if
pX1(r) |><| … |><| pXn(r) = r, where r is the instance of a table R and X1, …,
Xn are (concatenations of) its columns).72
Joins are dual to projections, as they are “vertical composition” opera-
tors (building up tables with columns from two tables, through the embed-
ded Cartesian product).
The nature of this duality is much more profound than graphical
appearances; it can be shown, for inner joins, that:
ü the instance of the projection of a join between n RA expressions
over the scheme of any of its operands is a subset of that operand’s
instance;
ü above inclusions are equalities when joins are full;
ü the composed operator defined above is idempotent (i.e., repeat-
edly applying same projection of a join over the scheme of one
of its operands yields same result as applying it only once) and,
dually, that:
ü any RA expression instance is a subset of the instance of the join of
any of its scheme full projections (i.e., projections whose union is
the expression’s scheme);
ü for any full projections of a RA expression the composition between
the corresponding join and projections is idempotent too.
When F is missing, joins are equal to the corresponding Cartesian
products: E1 |><|Ø E2 = E1 × E2.
Joins of same type are associative (consequently, they may also be con-
sidered as n-ary operators, n natural) and inner as well as the full outer
ones are commutative too.
However, composition of several types of joins is sometimes ambigu-
ous, so they are not allowed: for example, E1 L|><|F E2 |><|G E3 L|><|H E4,
where L|><| denotes a left join, is ambiguous as it is not associative—(E1
L
|><|F E2) |><|G (E3 L|><|H E4) ≠ E1 L|><|F (E2 |><|G (E3 L|><|H E4)).
72
Note that, in this context, “lossless” is a false-friend: lossy joins are not “loosing” stored data, but,
on the contrary, computing additional, not stored data.
220 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Also note that joins (even if heavily optimized) are the third more
time consuming (after unions and Cartesian products) operators (taking
some 95% of the average time for typical queries not having groupings
and orderings, and, as approximate orders of magnitude, some 66% of
the ones having groupings and orderings, which are spending only some
10% for grouping, 5% for filtering before grouping, 1% after and 18% for
ordering). Outer joins (and especially the full ones) are more expensive
than inner ones.73, 74, 75
As, most frequently, RA expressions only use the select, project, and
join operators (and much rarely renaming, Cartesian product, union, etc.)
they are also called SPJ expressions.
73
R and S must have same number of columns, pairwise compatible.
74
R and S have to be joinable on at least one column.
75
Beware that, if the corresponding RDBMS does not have a query optimizer able to detect and re-
place the Cartesian product followed by a selection with a corresponding join, then this alternative so-
lution is very inneficiently computed and may take hours or even days to complete for large instances.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 221
Other derived, less frequently used RA operators are left and right semi-
joins (joins projected on their corresponding operands: the left one is T |><
U = πT(T |><| U), whereas the right one is T ><| U = πU(T |><| U)) and left
and right antijoins (or antisemijoins: T |> U = T—T |>< U, T <| U = U—T
><| U), as well as division (T ÷ U = πC1, …, Cn(T)—πC1, …, Cn(πC1, …, Cn(T) ×
U—T), where C1, …, Cn are T’s columns which are not joined to U’s ones).
Note that RA’s expressive power is considered the minimum accept-
able standard for any relational querying language: any language that has
at least its expressive power is called complete. Also note that RA and RCs
are incomparable from this viewpoint: while RCs are, generally, much
more expressive, they cannot, however, express unions. In order for them
to have at least RA’s expressive power (and thus becoming comparable)
the union operator has to be added to RCs too. Also note that RA expres-
sions can compute all possible table instances only containing data stored
in corresponding rdb instances.
Table 3.24 shows the equivalences between RA and SQL.
Most RDBMSs provide saving queries (under unique names per db) as
views, which are considered being computed tables; the rationale is that
their parsing, compiling, and optimization is done only once and, each
time that their results are needed, they may be referred to in any FROM
clause just like an ordinary table. Note that, generally, views do not accept
parameters76 and that their names should be distinct not only from any
other view, but also from any other table of the db. Often, hierarchies of
views are declared, saved, and used later too.
Similarly, any parameterized sequences of extended SQL manipula-
tion statements may be saved too (under unique names per db) as stored
76
However, for example, MS Access ones (called queries) do accept them, as any symbol which is not
a constant, a table, or column name is considered as being a parameter.
222 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
First relational languages (RLs) limitation, a minor one, is the lack of func-
tion symbols, except for the aggregate ones. Practically, RDBMSes pro-
vide lot of library functions that empower their SQLs.
Second RLs limitation, a major one, is their incapacity to compute,
for example, transitive closures of binary homogenous relations (i.e., for
example, the set of all ancestors and/or descendants of somebody, or the
set of all files belonging to an OS logical drive, or the set of all employees
subordinated, directly or indirectly, to any given employee, etc.)77.
Consequently, two types of solutions were devised: embedded and
extended SQL.
The vast majority of high-level PLs are embedding SQL, that is include
at least one statement that accepts as parameters text strings containing
SQL statements and/or clauses or views and/or stored procedure names
and their actual parameters that are passed (generally, through middle-
wares like ADO78) to SQL engines; if they are queries, their results are then
passed back from these engines to the host high-level PL in accordingly
structured memory buffers called data (or record) sets.
The dual approach consists of extending SQL with high-level PL con-
structs (variable declarations, if, while, etc.79): IBM DB2 SQL PL, Oracle
PL/SQL, Sybase and MS T-SQL, etc. are examples of extended SQLs.
77
In fact, RLs limitations are much more severe: they cannot express any unbounded computation
(i.e., computations on unbounded data), as they are not Turing complete (or computationally univer-
sal) languages.
78
The acronym for ActiveX Data Objects, a MS COM-type middleware between PLs and OLE DB
(the acronym for Object Linking and Embedding DataBase, a MS application programming interface
(API) providing uniform manner access to data stored in a variety of formats).
79
Moreover, in order to also compute transitive closures (more generally, in fact, basic linear and even
nonlinear and mutual recursion), SQL was extended by powerful RDBMSs with a “WITH RECUR-
SIVE” statement (see first Chapter of the 2nd volume of this book).
224 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Static SQL is code created before execution and which is not changing
during execution, be it inline or as views and/or parameterized stored pro-
cedures.
Dynamic SQL is SQL code created dynamically (be it from scratch or
using static SQL that is modified), during its execution, either inline or
in stored procedures, be it in host high-level PLs embedding SQL or in
extended SQLs.
Static SQL has no disadvantage and the following speed advantages:
ü Only at creation/replacement time, its code is parsed and, if ok, then
translated into internal RA expressions, which are then optimized,
The
Quest for Data Independence, Minimal Plausibility, and Formalization 225
and for which, finally, the system builds and saves best execution
plans possible, depending on current instance, indexes and, option-
ally, on DBA directions too. Note that optimization is done not only
locally, for each statement, but globally too, for each views hierar-
chy, transaction, and stored procedure.
ü When executed (which is generally very frequent when com-
pared to creation/replacement), there is no need for any of the
above steps to be redone: the system just executes directly the
corresponding execution plan (possibly parameterized with actual
parameter values).
Dynamic SQL has no significant advantage, except for increasing
developers’ productivity, and the following speed disadvantage: for each
execution, each dynamic statement has to be parsed, translated into inter-
nal RA expressions, which are then optimized, and for which, finally, the
system builds (but does not save) and executes best execution plans pos-
sible. Optimization may be done only locally, for each statement, but not
globally too, either for inline code or for stored procedures.
Any dynamic SQL code may be equivalently rewritten statically.
It is true that, sometimes, dynamic SQL is easier to develop than its
equivalent static one; as such, it could be a temporary solution when dead-
lines are tight. However, keep in mind that, generally, there are no other
more permanent solutions than the temporary ones! Normally, once dead-
lines are met successfully, dynamic SQL should be replaced by equivalent
static one.
Unfortunately, there are always new deadlines, almost all the time
tighter, plus fatigue, which are not at all encouraging rewriting SQL code.
This leads to a PHP+MySQL-type architecture, which is extremely slow,
not using RDBMS’s powerful views and static parameterized stored pro-
cedure facilities.
Here is an example of equivalent static and dynamic SQL, for querying
some data for which end-users might ask for any combination of n condi-
tions c1, …, cn, n > 0 natural; obviously, there are 2n combinations:
CASE c1 IS NULL AND … AND cn IS NULL:
SELECT tl FROM te;
CASE c1 IS NOT NULL AND c2 IS NULL AND … AND cn IS NULL:
SELECT tl FROM te WHERE c1;
226 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
…
CASE c1 IS NOT NULL AND … AND cn IS NOT NULL:
SELECT tl FROM te WHERE c1 AND … AND cn;
In order to reduce complexity from exponential (2n) to linear (n), most
programmers, unfortunately, write the following type of dynamic embed-
ded SQL code:
sqlStr:= “SELECT tl FROM te”;
where:= false;
for i:= 1, n begin
if not IsNull(ci) then begin
if not where then begin
sqlStr:= sqlStr & “ WHERE ”;
where:= true;
end;
sqlStr:= sqlStr & “ AND “ & ci;
end;
end;
execute sqlStr;
The following static mixed extended and embedded solution is equiva-
lent and has same linear complexity:
CREATE OR REPLACE PROCEDURE p (
c1 VARCHAR DEFAULT NULL, …, cn VARCHAR DEFAULT
NULL)IS
BEGIN
SELECT tl FROM te WHERE
((c1 IS NOT NULL AND c1) OR c1 IS NULL) AND … AND
((cn IS NOT NULL AND cn) OR cn IS NULL);
END;
This is executed only once, for creating parameterized stored proce-
dure p; the only necessary embedded SQL statement for calling it is of
the type:
execute “p(“ & c1 & ” , “ & … & ” , “ & cn & “);”;
Whenever a condition ci is null, corresponding WHERE line reduces to
“AND true”, which is neither applying ci, nor tampering with the rest of
the conditions; whenever condition ci is not null, corresponding WHERE
The
Quest for Data Independence, Minimal Plausibility, and Formalization 227
line reduces to “AND ci”, which is applying ci, so that exactly same desired
functionality is achieved also in only n steps.
Readers who are not interested in why db design should not be done
in RDM could skip this subsection, except for 3.10.1, without losing
anything.
There is a hierarchy of relational normal forms (NFs), each of which is
essentially trying to eliminate particular cases of the so-called data manip-
ulation anomalies: insertion, update, and deletion ones.
TABLE 3.18 A Table Having an Insertion, a Deletion, and Several Update Anomalies
Customer- Customer- OrderNo OrderDate OrderedItems
ID Name
NAT(4) ASCII(64) NAT(4) [1/1/2010, SysDate()] Im(PRODUCTS.x)
NOT NOT NULL NOT NOT NULL NOT NULL
NULL NULL
1 Some Cola 1 3/15/2013 1
1 Someother 1 3/16/2013 2
Cola
2 Some 2 3/16/2013 1
Customer
Among other NFs, the 2nd, the 3rd, and the Boyce-Codd one, situated at
the base of the relational NFs hierarchy are based on a constraint called
functional dependency.
Let A and B be any two columns of a table T; B is said to be functionally
dependent on A (or, equivalently A is said to functionally determine B) if
and only if, by definition, to any value of A there is only one corresponding
value of B. Symbolically, any such assertion, which is called a functional
dependency (FD) or, abbreviated, dependency, is denoted by A → B (or,
sometimes, dually: B ← A). A is called the left hand and B the right hand
of A → B.
For example, in Table 3.6, x and Capital are each functionally deter-
mining any other column (including themselves): for example, both x →
Capital and Capital → x, as well as Capital → Population hold.
These notions are immediately extendable to column concatenations:
let X and Y be any such column products of a table T; Y is said to be
functional dependent on X if for any rows having same values for X, cor-
responding values for Y are the same.
For example, in Table 3.6, Country • State → Population • Area.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 229
80
Moreover, trivially, Ø → Ø and X → Ø are trivial FDs too, completely uninteresting consequently,
whereas FDs of type Ø → Y, which implies that Y is a constant function, should never be asserted: on
the contrary, such completely uninteresting constant Ys should be eliminated from all schemes.
230 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Any BCNF table is also in 3NF, but BCNF is not always achievable:
for example, the set of FDs {A • B → C, C → B} cannot be represented by
a BCNF scheme.
Knowledge rests not upon truth alone, but upon error also.
—Carl Jung
Table 3.21 is in BCNF (as both of its FDs, Student • Course • Club → x
and x → Student • Course • Club, have keys as their left hand sides), but
exhibits, however, bad design issues.
No FDs are governing this table; unfortunately, two unrelated between
them enrollment types (courses and clubs) are each related, however, to
students. RDM calls this a multivalued dependency (MVD) and denotes it
by Student →→ Course • Club.
Generally, a multivalued dependency X →→ Y • Z, where X, Y, and Z
are (concatenations of) columns of a table T, is defined as a constraint (of
“tuple generating” type) asserting that whenever two rows of T have same
values a for X, for which Y • Z’s values are <b, c> and <d, e> respectively,
then there should also exist other two rows in the same T’s instance having
a values for X, but <b, e> and <d, c> values for Y • Z (i.e., all combina-
tions of Y and Z values should exist for each common X value). Note that
a MVD X →→ U of T is trivial if U # X or if T = X • U.
A table is said to be in the 4th NF (4NF) if any nontrivial MVD has a
superkey as its left-hand side.
Table 3.21 is in BCNF (as it has no FDs), but not in 4NF (as Student is
not a key); this is why it stores that much redundant data for each student
in both Course and Club.
Note that:
ü any table in 4NF is in BCNF too;
ü tables that are not in 4NF have their semantics overloaded with 2
distinct related ones;
232 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
In fact, MVDs are only the simplest particular case (for n = 2) of a yet
weaker constraint type called join dependency (JD), denoted |><| [X1, X2,
…, Xn], where n > 1 natural, X1, X2, …, Xn , T and X1X2…Xn = T (i.e., all
The
Quest for Data Independence, Minimal Plausibility, and Formalization 233
Before being able to define the so-called “ultimate” RDM NF, the Domain-
Key one (DKNF), as it has been shown that any table in DKNF is in 5NF
too, we need to investigate interactions between domain and other type
of constraints and also formally define compatibility between rows and
instances and their associated anomalies.
Trivially, empty domains only allow for corresponding empty instances,
which trivially satisfy any FD, MVD, and JD. If the domain of a column
C of a table T only has one value, then any FD X → C holds, for any (con-
catenation of) column(s) X of T. Moreover, no JD having n components,
n > 2 natural, may hold in any table not having at least n values for each of
the involved columns. This is why, theoretically, RDM considers infinite
domains.
It can be shown that, for any table, if all of its involved columns may
take at least two values, then any of its FDs and MVDs is implied only by
FDs and MVDs (not by its domain constraints) and that if they may take
at least n values, n > 2 natural, then any of its JDs with n components is
implied only by its keys.
Domain and key constraints are called primitive constraints.
By definition, a table T having an associated constraint set CS is said
to be in DKNF if any constraint c of CS is implied by the primitive con-
straints belonging to CS+.
Row t is said to be compatible with table T if, together with T’s
instance, it satisfies T’s primitive constraints (i.e., all of its values satisfy
corresponding domain constraints and none of them is duplicating values
of T’s keys).
A table T is said to have an insertion anomaly whenever there is a
consistent instance of and a compatible row with it, but the union of that
instance and that compatible row is not a consistent instance of T; dually, T
is said to have a deletion anomaly whenever there is a consistent instance
and a row of it such that when removing that row its remaining instance
is inconsistent.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 235
81
Unfortunately, published true counter-examples of tables in 5NF but not in DKNF are very, very
rare; a fortunate exception is the one given by Wikipedia.
236 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The U.S. states that allow for citizens’ initiatives tend to have fewer laws
and lower taxes that the ones that don’t. But the beauty of the system is
that it encourages the spread of best practices.
—Daniel Hannan
At the RDM level, to the db axioms presented in section 5.1 bellow cor-
respond the following best practice rules (which are divided into two cat-
egories, corresponding to the DDL and DML, respectively).
3.11.1.1 Tables
Einstein’s results again turned the tables and now very few philoso-
phers or scientists still think that scientific knowledge is, or can be,
proven knowledge.
—Imre Lakatos
R-T-1. (Table creation, update, and drop)
1. For any dynamic fundamental data object set having at least one
dynamic property a corresponding table should be added in the
corresponding relational scheme. For example, even for a set of
configuration data always having only one element, a correspond-
ing table (which will always have only one line and one column)
should be added.
2. On the contrary, for (not only small instances) static sets (i.e., sets
whose elements should never be updated; for example, people title
names, rainbow colors, payment methods, etc.), corresponding
static enumerated sets or ranges may be used instead in the cor-
responding “foreign-like” keys.
238 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
3.11.1.2 Keys
Angels are like diamonds. They can’t be made, you have to find them.
Each one is unique.
—Jaclyn Smith
R-K-1.(Primary keys)
Recall that, by definition, primary keys are unique and do not accept
NULLs.
1. Any fundamental table should have a primary key (implementing
the corresponding surrogate key).
2. Any primary key should be of integer type (range restricted accord-
ing to the maximum possible corresponding instances cardinal-
ity—see R-DA-12), and except for subsets (see R-T-2) they should
be of the autonumber type.
3. By exception, not referenced tables might have concatenated pri-
mary keys instead, provided that all of their columns are of type
integer. You should never make such exceptions for referenced
tables: as soon as such a table is referenced, the former primary
key should be downgraded to a non primary (semantic) one, and a
sole surrogate primary key should be added to it.
4. Derived/computed tables may have no keys (be them primary or not).
R-K-2.(Semantic keys)
1. Any fundamental table should have all of the corresponding
semantic keys. With extremely rare exceptions (see the rabbit
cages example in R-DA-15 above), any nonsubset table should
consequently have at least one semantic key: any of its lines should
differ in at least one non primary key column (note that there is an
infinite number of NULLs, all of them distinct!).
240 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
An expert is a person who has made all the mistakes that can be made in
a very narrow field.
—Niels Bohr
The
Quest for Data Independence, Minimal Plausibility, and Formalization 241
You have to learn the rules of the game. And then you have to play
better than anyone else.
—Albert Einstein
R-G-0.(Think in sets of elements, not elements of sets)
Whenever you design queries, think in terms of sets of elements, not ele-
ments of sets.
For example, by using either extended or embedded SQL, you can
always process rows of tables one by one, just like in ordinary sequential
programming; however, especially as you are expected to always design
the fastest possible solutions, do not ever forget that SQL is optimized for
performing in parallel any (parameterized) data processing tasks on any
number of rows and not just one.
Consequently, for example, whenever not compulsory (which is
extremely rare), use pure SQL and forget about cursors.
R-G-1.(Use parallelism)
Whenever possible, be it for querying and/or updating, use parallel pro-
gramming. For example:
Ø whenever available, partition large tables instances (e.g., by months/
years/etc.) and use parallel queries on several such partitions;
Ø whenever available, use parallel-enabled functions (including
pipelined table ones), which allows them to be used safely in slave
sessions of parallel DML evaluations;
Ø process several queries in parallel by declaring and opening
multiple explicit cursors, especially when corresponding data is
stored on different disks and the usage environment is a low-
concurrency one.
R-G-2.(Minimize I/Os)
Always factorize I/O operations, for keeping them to the minimum pos-
sible. For example:
Ø always use only one UPDATE SQL statement to update several col-
umns simultaneously, instead of several such statements—one per
column; for example, replace:
242 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
UPDATE T SET A = v;
UPDATE T SET B = w;
by:
UPDATE T SET A = v, B = w;
Ø always update only needed rows; for example, replace:
UPDATE T SET A = TRUE WHERE B = w;
by:
UPDATE T SET A = TRUE
WHERE B = w AND A = FALSE;
Ø when selecting data, always select only needed columns, from only
needed tables, and eliminate all unneeded rows as soon as pos-
sible; for example, if, in the end, you only need male person names,
(even if you are for the time being, for example, only a PHP +
MySQL developer) replace:
SELECT * FROM PERSONS;
by:
SELECT Name FROM PERSONS WHERE Sex = ‘M’;
Ø use cache and, generally, internal memory always when available;
for example, small static tables, functions, and queries should be
cached whenever they are frequently used.
R-G-3.(Reuse db connections)
1. Avoid continually creating and releasing db connections.
2. In web-based or multi-tiered applications where application serv-
ers multiplex user connections, connection pooling should be used
to ensure that db connections are not reestablished for each request.
3. When using connection pooling always set a connection wait time-
out to prevent frequent attempts to reconnect to the db if there are
no connections available in the pool.
4. Provide applications with connect-time failover for high-availabil-
ity and client load balancing.
R-G-6.(Unit testing)
Do not ever deliver any query, view, stored procedure, function, or trigger
without, at least, completely and thoroughly unit testing it.
All subsequent tests should be run only with nearly perfectly function-
ing units.
3.11.2.3 Querying
R-Q-1.(WHERE-HAVING rule)
Always use the WHERE clause to eliminate all unneeded rows before
grouping (if any); always use the HAVING clause only for conditions
involving application of aggregation functions on groups.
Dually, never place on HAVING a filter that can be placed on WHERE.
R-Q-2.(WHERE rules)
a. In any adjacent conditions pair, the first condition should eliminate
more rows than the second one.
b. Do first inner and then outer joins.
c. Join tables in their increasing number of rows order.
d. Do not prevent index usage by applying unneeded functions to the
corresponding columns; take also care of implicit conversions.
e. Not only in Oracle, if you use filtered Cartesian products instead of
explicit joins, always place non join filters before join conditions.
(uncorrelated) ones, for which subqueries are evaluated only once, before
evaluating the corresponding query, the dynamic ones need to be reevalu-
ated for each row in the corresponding query).
END Get_Orders_ID;
…
if obType = Get_Orders_ID then …
It is true that, for example, unfortunately, in Oracle, you can use named
constants only in PL/SQL, but not also in SQL (like, for example, DDL
statements, views, triggers, etc.), which needs constant functions instead;
however, this is not a reason not to use constants in PL/SQL.
82
Unfortunately especially as g needs not be one-to-one, which would have always guaranteed that fk
is not only existing, but is also unique.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 249
Definition 3.1 (First Normal Form) A relation scheme for which all
function members are atomic, that is they are not either function products
or taking values from Cartesian products or power sets, etc., is said to be
in the First Normal Form (1NF). An rdb scheme whose relations are all in
1NF is in 1NF too.
implies that:
(ii) Im(fk) # Im(a) (called the associated referential integrity constraint)
(iii) card(Im(fk)) = card(Im(f))
(iv) if f is one-to-one, then fk is one-to-one too.
Proof:
(i) Let fk : R → A such that, ;x[R, fk(x) = a(f(x)); trivially, fk is totally
defined: ;x[R, 'y[Im(f) # R’ such that y = f(x) and 'z[Im(a) #
A, such that z = a(y) = a(f(x)) = fk(x); moreover, fk is well-defined:
;x, y[R such that x = y, according to a ° f definition, a(f(x)) =
a(f(y)) fk(x) = fk(y).
Obviously, fk = a ° f: ;x[R, let y = a(f(x)) [ Im(a) # A; according
to fk definition, fk(x) = y a(f(x)) = fk(x) a ° f = fk.
Moreover, fk is the only function having these properties: let us
suppose that 'fk’: R → A such that a ° f = fk’ ;x[R, a(f(x)) =
fk’(x); according to fk definition, fk(x) = a(f(x)) fk(x) = fk’(x)
fk ≡ fk’.
(ii) Trivially, ;y[Im(fk), 'x[R such that y = fk(x); as f is totally
defined, it follows that 'z[R’ such that z = f(x); as a is totally
defined, it follows that 'u[ Im(a) such that u = a(z) = a(f(x)) =
fk(x) = y.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 253
Note that, trivially, even if a is not one-to-one, then fk still exists, it is the
only function satisfying 3.2(i) above, and its associated referential integrity
constraint still holds (as a’s one-to-oneness is not needed in their proofs).
Unfortunately, this allows RDM to define foreign keys referencing any
columns or column products, be them one-to-one or not.
Obviously, referencing a column or column product which is not one-
to-one has two disadvantages (also see Exercise 3.43) and no advantage;
let us consider, for example, a = CityName : CITIES → ASCII(256), which
is not generally one-to-one, as there may be cities having same names
even inside a same country; suppose, for example, that CityName(1) =
CityName(2); trivially, any foreign key fk referencing CityName has the
following two disadvantages:
• card(Im(fk)) ≤ card(Im(f)), that is fk is not always unambiguously
identifying the elements of R’ computed by f; for example, if f =
BirthPlace : PEOPLE → CITIES, BirthPlace(1) = 1, BirthPlace(2)
= 2, then, although, in fact, people 1 and 2 were born in different cit-
ies, the corresponding foreign key fk = CityName ° BirthPlace would
mislead us by indicating that both of them were born in a same city
(as, in fact, it does not compute unique cities, but their names, which
are the same in this case)
• fk does not preserve anymore f’s one-to-oneness; for example, if
f = Capital : COUNTRIES → CITIES, which is obviously one-to-
one (as no city may simultaneously be a capital of two or more
countries), Capital(1) = 1, Capital(2) = 2, then, although, in fact,
254 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Proof:
() Let us define fd : Im(f) → Im(g) such that ;y[Im(f), fd(y) = z,
where y = f(x) and z = g(x), and firstly prove that fd is well-defined:
(fd totally defined) According to mapping image definition, ;y[Im(f),
'x[S, y = f(x); as g is totally defined, ;x[S, 'z[Im(g), z = g(x)
;y[Im(f), fd(y) = z.
(fd functional) Let t, u[Im(f), t = u; by fd definition, ;x, y[S such
that fd(t) = v, where t=f(x), v=g(x), and fd(u)=w, where u=f(y) and w=g(y);
t=u f(x)=f(y) (x, y)[ker(f); as ker(f) # ker(g), it follows that (x, y)
[ker(g) g(x) = g(y) v = w fd(t) = fd(u).
The
Quest for Data Independence, Minimal Plausibility, and Formalization 255
Let us also show that g = fd °f; ;x[S, denote z = g(x)[Im(g) and y = f(x)
[Im(f); according to fd definition, fd(y) = z fd(f(x)) = g(x) fd °f = g.
Finally, we prove that fd is unique: assume that 'h : Im(f) → Im(g)
such that h °f = g; it follows that, ;x[Im(f), 'y[S, x = f(y) such that h(f(y))
= g(y); according to fd definition, fd(x) = g(y) fd(x) = h(x) fd ≡h.
() Assume ¬(f →g); ¬(ker(f) #ker(g)) '(x, y) [ ker(f) such that
(x, y) Óker(g), that is '(x, y)[S2, f(x) = f(y) and g(x) ≠g(y); as fd functional,
f(x) = f(y) fd(f(x)) = fd(f(y)), which, by fd definition, implies that g(x)
= g(y); this trivially means that the assumption is false and, consequently,
f →g. Q.E.D.
finite (hence, A1–7 never loops infinitely). Consequently, in the worst case
scenario (when no surrogate key is declared in the input E-R data model),
the total number of steps is 2*(e + d) + r + f + a + c + v. Q.E.D.
Proposition 3.5 (Algorithm REA1–2 complexity) Let t be the total num-
ber of tables and views of S and the corresponding ones for foreign keys be
f, for non foreign keys be a, for keys be k, for NOT NULL constraints be
n, and for tuple (check) constraints be c (t, f, a, k, n, c naturals); REA1–2
has complexity O(t + 2*a + f + k + n + c).
Proof:
Obviously, REA1-2 only processes once any table, view, column (be it
foreign key or not), and constraint (be it domain, key, NOT NULL, or tuple/
check); it has one outer loop, which is executed t times, and which embeds
other four ones: in total, the first one is executed a times (for creating the
ellipses), the second one f times (for the arrows), the third one k times (for
the keys), and the last one a + n + c times (for adding the corresponding
domain, NOT NULL, and tuple/check constraints, respectively); trivially,
all of the above loops are finite (hence, REA1-2 never loops infinitely). Con-
sequently, the total number of steps is t + 2*a + f + k + n + c. Q.E.D.
3.13 CONCLUSION
Besides the five basic type ones implemented by nearly all RDBMSs,
dependency theory formalized a plethora of integrity constraints, out of
which only the most “famous” ones are presented in Section 3.10 (some
others are briefly discussed in the last section of this chapter).
Relational languages used by programmers are of logic type, being
declarative and nonprocedural; internally, they are translated by RDBMSs
into optimized relational algebra expressions. Their main limitation (not
being able to compute transitive closures) is overcome by recursive
extensions that are presented in the first chapter of the second volume of
this book.
RDM is also important because of its very interesting interactions with
logic and object-oriented programming, as well as expert systems and AI.
Moreover, I consider that it is a must to include RDM in any db book,
even if the literature in this field is that abundant, at least for stating your
point of view on its most important results, its limits, and how to over-
come them.
Hopefully, besides this point of view, this chapter also provides a
rich collection of correct relational solutions for interesting subuniverses
of discourse, as well as some of our results in the field (especially the
algorithms A1–7, A7/8–3 and REA1–2, the best practice rules, the for-
malization of the SQL GROUP BY clause with quotient sets computed
by the kernel of mapping products equivalence relations, the sections 3.7
–including its RDM and E-R data models of the RDM–, and 3.12 –mainly
including the formalization of the KPP and the proof that any table having
n columns may have at most C(n, [n/2]) keys–, and the subsections 3.2.3.3,
3.2.3.4, 3.2.4.2, 3.2.4.3, and 3.3.2—the relational model of the E-RDM);
the Domain-Key-Referential-Integrity Normal Form (DKRINF) is charac-
terized in the second volume of this book.
RDM ones are needed for db design, in order to fully guarantee data
plausibility.
For (just another) example, RDM is not able to accept even very com-
mon constraints such as “population of a country should be greater or
equal than the sum of populations of its states” and “population of a state
should be greater or equal than the sum of populations of its cities”, which
are very similar to check (tuple) ones, except for the fact that they involve
columns from different tables.
Consequently, db design should always be done at a higher, semantic
level and RDM should only be used as an abstraction layer above RDBMS
versions.
On the contrary, querying should always be done relationally, as it is
not only powerful, simple, and elegant, but also benefits from very fast
processing.
3.14 EXERCISES
3.0. (a) Apply algorithm A1–7 to the E-R data model of Exercise 2.0 above
and provide a valid instance of at least two rows per table.
(b) Apply algorithm A7/8-3 to the above obtained relational scheme in
order to discover and add to it any other existing keys (hint: skip verifying
existing keys as we just correctly identify all of them).
264 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Solution:
(a) The associated non-relational constraint set (that will have to be
enforced either through db triggers or by db applications) is pre-
sented in Fig.3.25.
Obviously, acyclicity (P6) is not a relational type constraint.
Each one of (PD6), (PD7), (PM4), and (C8) involve columns from
more than one table.
(P7), (PD8), (PD9), and (A5) apply to only some rows of the corre-
sponding tables.
Corresponding relational scheme, together with a test plausible and
valid instance are presented in Fig. 3.26.
(b)
(i) Table COUNTRIES
m = 2; K’ = K = {x, Country}; T’ = Ø; n = 0; l = k = 2;
FIGURE 3.26 Relational Scheme from Exercise 2.0 and a Valid Test Instance.
i = 1 (C(3, 1) = 3)
1. City is not a key, as there may be cities having same names but dif-
ferent zip codes in a same country or even same zip codes too but from
different countries;
2. ZipCode is not a key, as there may be cities having same zip codes
but different names in a same country or even same names too but in dif-
ferent countries;
266 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
3. City • Country is not a key, as there may be cities from a same coun-
try having same name, but having different zip codes;
i = 3 (C(3, 3) = 1)
268 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
20. Date • Type • From is not a key, as there may be several payments
of a same type, from a same bank account, and at a same date, but
having different numbers and/or destination bank accounts, etc.;
21. Date • Type • To is not a key, as there may be several payments of a
same type, to a same bank account, and at a same date, but having
different numbers and/or source bank accounts, etc.;
22. Date • Type • Card is not a key, as there may be several payments
having a same date, done with a same card, but having different
numbers and/or destination bank accounts;
23. Date • From • To is not a key, as there may be several payments
done at a same date, from a same bank account to a same bank
account, but with different numbers, etc.;
24. Date • From • Card is not a key, as there may be several payments
done at a same date, from a same bank account, with a same card,
but with different numbers and/or to different destination bank
accounts, etc.;
25. Date • To • Card is not a key, as there may be several payments
done at a same date, to a same bank account, with a same card, but
with different numbers, etc.;
26. Debtor • Type • From is not a key, as there may be several pay-
ments of a same type, from a same bank account, done by a same
debtor, but having different numbers and/or dates and/or destina-
tion bank accounts;
27. Debtor • Type • To is not a key, as there may be several payments
of a same type, to a same bank account, done by a same debtor,
but having different numbers and/or dates and/or source bank
accounts;
28. Debtor • Type • Card is not a key, as there may be several payments
done by a same debtor with a same card, but having different num-
bers and/or dates and/or destination bank accounts;
29. Debtor • From • To is not a key, as there may be several payments
from a same bank account to a same bank account, done by a same
debtor, but having different numbers and/or dates;
30. Debtor • From • Card is not a key, as there may be several pay-
ments done by a same debtor with a same card, but having different
numbers and/or dates and/or destination bank accounts;
The
Quest for Data Independence, Minimal Plausibility, and Formalization 279
15. No. • Debtor • From • Card, is not a key, as there may be several
payments done by a same debtor with a same card and having a
same number, but different dates and/or destination bank accounts;
16. No. • Debtor • To • Card, is not a key, as there may be several
payments done by a same debtor with a same card to a same bank
account and having a same number, but different dates;
17. No. • Type • From • To, is not a key, as there may be several pay-
ments of a same type from a same bank account to a same bank
account with a same number, but having different dates;
18. No. • Type • From • Card, is not a key, as there may be several
payments done with a same card and having a same number, but
different dates and/or destination bank accounts;
19. No. • Type • To • Card, is not a key, as there may be several pay-
ments done with a same card to a same bank account and having a
same number, but different dates;
20. No. • From • To • Card, is not a key, as there may be several pay-
ments done with a same card to a same bank account and having a
same number, but different dates;
21. Date • Debtor • Type • From, is not a key, as there may be several
payments of a same type done by a same debtor from a same bank
account at a same date, but having different numbers and/or desti-
nation bank accounts;
22. Date • Debtor • Type • To, is not a key, as there may be several
payments of a same type done by a same debtor to a same bank
account at a same date, but having different numbers and/or source
bank accounts;
23. Date • Debtor • Type • Card, is not a key, as there may be several
payments done by a same debtor with a same card at a same date,
but having different numbers and/or destination bank accounts;
24. Date • Debtor • From • To, is not a key, as there may be several
payments done by a same debtor from a same bank account to a
same bank account at a same date, but having different numbers;
25. Date • Debtor • From • Card, is not a key, as there may be several
payments done by a same debtor with a same card at a same date,
but having different numbers and/or destination bank accounts;
The
Quest for Data Independence, Minimal Plausibility, and Formalization 281
21. Debtor • Type • From • To • Card is not a key, as there may be sev-
eral payments done by a same debtor with a same card to a same
bank account, but having different numbers and/or dates;
i = 6 (C(7, 6) = 7)
1-5. No. • Date • Debtor • Type • From • To, No. • Date • Debtor • Type
• From • Card, No. • Date • Debtor • Type • To • Card, No. • Date
• Debtor • From • To • Card, and No. • Date • Type • From • To •
Card are superkeys;
6. No. • Debtor • Type • From • To • Card is not a key, as there may
be several payments done by a same debtor with a same card to a
same bank account and having a same number, but different dates;
7. Date • Debtor • Type • From • To • Card is not a key, as there may
be several payments done by a same debtor with a same card to a
same bank account and having a same date, but different numbers;
i = 7 (C(7, 7) = 1)
No. • Date • Debtor • Type • From • To • Card is a superkey;
To conclude with, K’ = {x, No. • Date • Debtor • Type, No. • Date •
From, No. • Date • To, No. • Date • Card}; l = 5.
(a) Design, develop, and test ANSI-92 standard SQL statements for
creating and populating these two tables.
(b) Consider that triggers exist for enforcing irreflexivity and asymme-
try of the NEIGHBORS binary relation (i.e., rejecting any attempts
to store in columns Country and Neighbors pairs of type <x, x>,
as well as <y, x>, whenever <x, y> already is stored) and design,
develop, and test SQL statements for computing:
1. The number of neighbors of Romania. What is the corresponding
result?
2. The number of neighbors of Moldova. What is the corresponding
result?
3. The set of countries without neighbors, in ascending order of their
names. What is the corresponding result?
4. The set of countries having at least k neighbors (k natural), in
descending order of the number of their neighbors and then ascend-
ing on their names. What is the corresponding result for k = 3?
5. The set of countries that do not appear in column Country of table
NEIGHBORS, in ascending order of their names, without using
subqueries. What is the corresponding result?
6. The set of pairs <country name, neighbors number>, in descending
order of neighbors number and then ascending on country name,
also including countries that do not have neighbors. What is the
corresponding result?
7. Adding the fact that Ukraine is also neighbor to Russia, without
knowing anything on the two table instances, except for the fact
that Ukraine exists in COUNTRIES, while Russia does not. What
lines will be added according to the current test instance?
8. Changing to uppercase the names of the countries having the prop-
erty that they are neighbors to at least one neighbor of one of their
neighbors (i.e., country x will have its name modified if it is neigh-
bor both to countries y and z, where z is a neighbor of y). What are
the countries whose names will be modified?
9. Discarding all neighborhood data for all countries whose names
start with the letter ‘G’. What lines will disappear?
286 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Solution:
(a)
CREATE TABLE COUNTRIES (
X INT(3) PRIMARY,
COUNTRY VARCHAR(128 BYTE) NOT NULL UNIQUE
);
Insert into COUNTRIES (X, COUNTRY) values
(1, ’Romania’);
Insert into COUNTRIES (X, COUNTRY) values
(2, ’Moldova’);
Insert into COUNTRIES (X, COUNTRY) values
(3, ’Serbia’);
Insert into COUNTRIES (X, COUNTRY) values
(4, ’Bulgaria’);
Insert into COUNTRIES (X, COUNTRY) values
(5, ’Hungary’);
Insert into COUNTRIES (X, COUNTRY) values
(6, ’Ukraine’);
Insert into COUNTRIES (X, COUNTRY) values
(7, ’Greece’);
Insert into COUNTRIES (X, COUNTRY) values
(8, ’Malta’);
CREATE TABLE NEIGHBORS (
X INT(4) PRIMARY,
COUNTRY INT(3) NOT NULL,
NEIGHBOR INT(3) NOT NULL,
The
Quest for Data Independence, Minimal Plausibility, and Formalization 287
ORDER BY COUNTRY;
Result: Malta
4. SELECT COUNTRIES.COUNTRY, NEIGHBORS#
FROM COUNTRIES INNER JOIN
(SELECT COUNTRY, SUM (N#) AS NEIGHBORS# FROM
(SELECT COUNTRY, COUNT(*) AS N#
FROM NEIGHBORS GROUP BY COUNTRY
UNION ALL
SELECT NEIGHBOR, COUNT(*) AS N#
FROM NEIGHBORS GROUP BY NEIGHBOR)
GROUP BY COUNTRY
HAVING SUM (N#) > :K — 1) S
ON COUNTRIES.X = S.COUNTRY
ORDER BY NEIGHBORS# DESC, COUNTRIES.COUNTRY;
Result:
COUNTRY NEIGHBORS#
Romania 5
Bulgaria 3
Serbia 3
5. SELECT COUNTRIES.COUNTRY
FROM COUNTRIES LEFT JOIN NEIGHBORS
ON COUNTRIES.X = NEIGHBORS.COUNTRY
WHERE NEIGHBORS.COUNTRY IS NULL
ORDER BY COUNTRIES.COUNTRY;
Result: Greece
Hungary
Malta
Ukraine
6. SELECT COUNTRIES.COUNTRY, SUM (N#) AS NEIGHBORS#
FROM COUNTRIES INNER JOIN
(SELECT COUNTRY, COUNT(*) AS N# FROM NEIGHBORS
GROUP BY COUNTRY
UNION ALL
SELECT NEIGHBOR, COUNT(*) AS N# FROM NEIGHBORS
GROUP BY NEIGHBOR) S
The
Quest for Data Independence, Minimal Plausibility, and Formalization 289
CREATE TABLE T AS
SELECT X FROM COUNTRIES
WHERE COUNTRY IN (‘Ukraine’, ’Russia’);
INSERT INTO NEIGHBORS
SELECT max(NEIGHBORS.X) + 1, min(T.X), max(T.X)
FROM NEIGHBORS, T;
Result: line < 10, 6, 9 > is appended to table NEIGHBORS
DROP TABLE T;
Note that, for example, the equivalent, much more elegant statement:
3.4. Consider a table r(U) and X1, X2, X , U such that X1 ø X2 = U and
X = X1 ∩ X2; r(U) decomposition over X1, X2 is lossless r(U) satisfies
MVD X →→ X1 (or, equivalently, X →→ X2).
Proof:
() Let r = πX1(r) |><| πX2(r) and t1, t2 [ r such that t1[X] = t2[X]; it has
to be shown that 't[r such that t[X1] = t1[X1] and t[X2] = t2[X2], which
would confirm the fact that r satisfies X →→ X1. Let t1′ = t[X1] and t2′ =
t[X2]; it follows that t1′[πX1(r), and t2′[πX2(r). By join definition, t is a
tuple of πX1(r) |><| πX2(r) that comes out of t1′ and t2′. Finally, as r = πX1(r)
|><| πX2(r), it follows that t[r, and that r satisfies X →→ X1.
() Let r be an instance satisfying X →→ X1 and t[πX1(r)|><|πX2(r);
for proving this implication, it has to be shown that t[r (because r #
πX1(r) |><| πX2(r) always). From the join and projection operators defini-
tions, 't1,t2[r such that t[X1] = t1[X1] and t[X2] = t2[X2]; from the MVD def-
inition, as X = X1 ∩ X2, so t1[X] = t2[X], 't’[r such that t’[X1] = t1[X1] and
t’[X2] = t2[X2]. This obviously means that t = t’, which proves that t [ r.
Q.E.D.
Proof:
() Let R be a DKNF table, r a valid instance of R, and t a tuple
compatible with r; then, r ø {t} satisfies all constraints of R as, from the
compatibility definition, primitive constraints are satisfied, and, from the
DKNF definition, primitive constraints imply all other constraints.
Similarly, for any t[r, r — {t} satisfies all constraints of R as, obvi-
ously, deleting a tuple cannot violate any primitive constraint, and, from
the DKNF definition, primitive constraints imply all other constraints.
Consequently, R does not have update anomalies.
() Let R be a table free of update anomalies; let us assume that R is
not in DKNF; as R is not in DKNF, there is at least one constraint c that is
not implied by the primitive constraints; this means that there is an instance
r of R in which c does not hold, although all primitive constraints are hold-
ing; as the constraint set associated to R is coherent, it follows that there is
at least an instance r* of R that is valid (and, consequently, c holds in it too);
let us build r from r* in the following two steps: first, deleting from r* all
rows, one by one, and secondly inserting in r*, again one by one, all rows
of r (which is obviously possible, as both instances are finite, and all con-
straints on R are static); as r* is valid and r is not, then, during this building
process, there has to be a tuple t and two instances r’ and r” such that:
1. r’ valid
2. r” invalid and
3. r” = r’ ø {t} or r” = r’ — {t}.
Let us consider both above possible cases:
– if r” = r’ ø {t}, then r’ , r; as r satisfies the primitive constraints,
t is compatible with r’; as r” is invalid, it follows that R has an
insertion anomaly;
– if r” = r’ — {t}, then r’ , r*, so r’ is valid; as r” is not valid, it
follows that R has a deletion anomaly.
Trivially, as in any above possible cases the assumption that R does not
have update anomalies is violated, it follows that R is in DKNF.
Q.E.D.
1. COMPOSERS
a. Range restrictions:
- x (cardinality): 108 (RC1)
- FirstName, LastName: ASCII(32) (RC2)
- BirthYear: [1200, currentYear() — 3] (RC3)
- PassedAwayYear: [1200, currentYear()] (RC4)
b. Compulsory data: x, LastName, BirthYear (RC5)
c. Uniquenesses:
- x (surrogate key) (RC6)
- FirstName • LastName (there may not be two
composers having same first and last names) (RC7)
d. Other restrictions:
- Nobody can compose before he/she is three or
after his/her death. (RC8)
- Composers may not live more than 120 years. (RC9)
2. MUSICAL_WORKS
a. Range restrictions:
- x (cardinality): 1012 (RM1)
- Opus: ASCII(16) (RM2)
- Title: ASCII(255) (RM3)
- No: [1, 255] (RM4)
b. Compulsory data: x, Composer, Opus (RM5)
c. Uniquenesses:
- x (surrogate key) (RM6)
- Composer • Opus • No (there may not be two
musical works of a same composer having same
opus and number) (RM7)
3. TONALITIES
a. Range restrictions:
- x (cardinality): 100 (RT1)
- Tonality: ASCII(13) (RT2)
b. Compulsory data: x, Tonality (RT3)
c. Uniquenesses:
- x (surrogate key) (RT4)
- Tonality (there may not be two tonalities having
same names) (RT5)
- *Country • Zip is a key, as there may not be two identical zip codes
in a same country;
- City and Commune are prime, but are not keys as they are part of the
key City • Commune;
- *Country and Zip are prime, but are not keys as they are part of the
key *Country • Zip;
SK = { x, City, Commune, *Country, Zip };
T’ = { Population };
- Population is not a key, as there may be two cities having same
population; moreover, it is nonprime, as population cannot help with
uniquely identifying cities.
T’ = Ø;
T’ = { City, Commune, *Country, Zip };
n = 4;
- as 4 > 1:
l = 3;
kmax = C(4, 2) = 4!/(2! * 2!) = 4 * 3 * 2/(2 * 2) = 3 * 2 = 6;
i = 2;
allSuperkeys = false;
- as 2 ≤ 4 and 3 < 6 and true = true:
allSuperkeys = true;
- City • *Country is not a key, as there may be two cities of a same
country (but of different communes) having same names;
- City • Zip is not a key, as there may be two cities (but of different
countries) having same zip codes;
- Commune • *Country is not a key, as, generally, communes have
several cities;
- Commune • Zip is not a key, as there may be two cities of a same com-
mune having same zip codes;
allSuperkeys = false;
i = 3;
- as 3 ≤ 4 and 3 < 6 and true = true:
allSuperkeys = true;
- as all products of arity 3 are superkeys, the algorithm halts without
discovering any new keys, so that K’ = {x, City • Commune, *Country
• Zip}.
298 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
c. Design, develop, and test a SQL query for computing the set of all
pairs <“leaf” employee, employee to whom the “leaf” employee
reports to>.
d. Design, develop, and test a SQL query for computing the set of
all similar pairs for the first and the second hierarchical levels
(hint: if 1 reports to 2 and 2 reports to 3, then the answer should
include the set {<1,2>, <2,3>, <1,3>}).
e.* Design, develop, and test a high level programming language
program (e.g., C#, Java, C++, VBA, etc.) that generates the
SQL query for computing the set of all such pairs up to the k-th
hierarchical level, k being a natural parameter; test the result-
ing SQL output for k = 4.
f.* Design, develop, and test for k = 4 and k greater than the
ReportsTo tree height a high level programming language pro-
gram (e.g., C#, Java, C++, VBA, etc.) with embedded SQL to
compute same thing, by using a temporary table for storing
the result; if k ≤ 0, then compute the subset of all employees
reporting to nobody (hint: pad second column with nulls); if k
greater than the highest hierarchical level h, stop immediately
after adding level h pairs, that is, when the transitive closure
of ReportsTo was computed (hint: add to the temporary table a
Level numeric column; when k > 0, initialize it with the result
of query (c) above, all of them having 1 for Level; loop until k
or until no more lines are added to the temporary result table;
in each step i, increase Level values by 1 and add to the tempo-
rary tables the pairs of the current level, which can be obtained
by a join between the rows of the temporary table having the
Level values equal to i – 1 and Table 3.3).
g.** Compute the transitive closure of ReportsTo in extended SQL
(e.g., in T-SQL, PL/SQL, etc.); establish a top of execution
speeds for computing it with solutions (e), (f), and (g), when
Table 3.3 has at least 100 rows and 16 levels.
3.28. Consider the subuniverse of a simple OS network file manage-
ment system that is storing only file names, extensions, sizes, folders
(which are files too, having size 0!), virtual, logic, and physical drives, as
well as computers.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 303
a. Using algorithms A0, A1–7 and A7/8–3, as well as the best prac-
tice rules from sections 2.10 and 3.11 design and implement its
DKNF rdb.
b. Populate it with at least 2 computers, each of which having at
least 2 physical, 4 logic, and 2 virtual drives; each logic drive
should have at least 16 non folder files and 8 folders structured
in hierarchies of at least 4 levels.
c.* Design, develop, and test a program for computing the total
size of all files of a subtree of folders, the parameter being the
path of its root folder (hint: see (f) and (g) of exercise 3.27
above, as this is also a transitive closure type problem).
d.* Design, develop, and test a program for computing the set
<computer name, physical drive name, logic drive letter, full
folder path, file name and extension, size> for all computers
having at least k logic drives, each of which having at least
m folders with at least n files each, each such file having size
at least s, where k, m, n, and s are natural parameters, in the
ascending order of computer names, logic drive letters, full
folder paths, then descending on size and ascending on file
names and extensions (hint: also see (f) and (g) of exercise
3.27 above, as computing full paths is also a transitive closure
type problem).
3.29.** Prove that transitive closures are not the only meaningful
queries that complete relational languages cannot express, although their
results contain only rdb stored data87 (hint: consider Table 3.3 and try to
design SQL statements for computing pairs of employees reporting to a
same person such that the result does not contain any pair (f, g) if it already
contains the pair (g, f). Design, develop, and test a program in a high level
programming language at your choice, which embeds SQL, for computing
this desired result. Generalize!).
3.30. Translate all remaining SQL queries from subsection 3.9.2,
as well as those from exercises 3.12 to 3.14 into corresponding RA
expressions.
And, moreover, they are invariant with respect to any automorphism of its instance (where automor-
87
phisms are autofunctions renaming values such that the corresponding set remains the same).
304 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
3.31. Prove that the RA selection and projection operators are mono-
tonic. Extend monotonicity definition to n-ary operators, n > 1, natural,
and prove that the RA join operator is monotonic too.
3.32. Consider any m tables t1(X1), …, tm(Xm), m ≥ 1, natural (where Xj-s
are their schemas); prove that the projection operator is idempotent, as:
πXj(|><|k=1 m (πXk(|><|i=1 m ti))) = πXj(|><|i=1 m ti), ;j, 1≤ j ≤ m.
3.33. Consider same tables as above; prove that there is a duality
between projection and join, as:
(a) πXj(|><|i=1 m ti) # tj, ;j, 1≤ j ≤ m.
(b) πXj(|><|i=1 m ti) = tj, ;j, 1≤ j ≤ m t1, …, tm have a full join.
3.34. Consider table t(X) and X1, …, Xm, m ≥ 1, natural, attribute sets
such that øj=1 mXj=X; prove the following somewhat dual results of the
above ones:
(a) |><|j=1 m (πXj(t)) $ t.
(b) |><|k=1 m (πXk(|><|j=1 m(πXj(t)))) = |><|j=1 m (πXj(t)).
3.35. Prove that any RA expression may be transformed into an
equivalent one in which selections have only atomic conditions, projec-
tions eliminate only one attribute, and the rest of the needed operators are
unions, renaming, differences, and Cartesian products.
3.36. Prove that any RA expression may be transformed into an equiva-
lent one whose constant tables are all defined over only one attribute and
contain only one row.
3.37. Prove that the proposition corresponding to no-information null
values (i.e., R-k(t[A1], ..., t[Ak–1], t[Ak+1], ..., t[An])) ) is equivalent to the
disjunction of the formulas associated to the other two types of null val-
ues: x(R(t[A1], ..., t[Ak–1], x, t[Ak+1], ..., t[An])) ~ (x(R(t[A1], ..., t[Ak–1], x,
t[Ak+1], ..., t[An])) = R–k(t[A1], ..., t[Ak–1], t[Ak+1], ..., t[An])).
3.38. Prove that ;Y , X, card(X) = card(πY(X)) 'K # Y, K key.
3.39. Apply algorithm A7/8–3 to the following table schemas:
a. RULERS (x, Title, Name, Sex, Dynasty, Father, Mother, BirthDay,
BirthPlace, PassedAwayDay, PassedAwayPlace, BurrialPlace,
KilledBy, Nationality, Notes)
b. REIGNS (x, Ruler, Country, FromDate, ToDate)
c. BATTLES (x, BattleYear, BattleSite, Ruler, Opponent, RuledArmy,
EnnemyArmy, Victory?, Notes)
d. MARRIAGES (x, Husband, Wife, WeddingDate, DivorceDate)
The
Quest for Data Independence, Minimal Plausibility, and Formalization 305
3.48. Let t(U) be a table and X1, X2, X , U such that X1 • X2 = U and
X = X1 ∩ X2; t(U) has a lossless decomposition over X1, X2 if it satisfies at
least one of the following FDs: X → X1 or X → X2.
3.49. Prove that if instance i of a table satisfies X→ Y, Z # X and W $
Y, then i satisfies Z → W as well.
3.50. Prove that if instance i of a table satisfies X → Y and Z→ W, Z #
Y, then i satisfies X → YW as well.
3.51. Prove that (unikey FD schemas characterization theorem): a
scheme [R, F], such that R(U) and F = {X1 → Y1, …, Xk → Yk}, k > 0,
natural, has only one key if and only if U — Z1•…•Zk is a superkey, where
Zi = Yi — Xi, 1 ≤ i ≤ k.
3.52. Prove that BCNF 3NF 2NF 1NF (any scheme in BCNF
is in 3NF too, any one in 3NF is in 2NF too, and any one in 2NF is in
1NF too).
3.53. By definition, a table scheme S with FDs is in (3,3)-NF if for any
logic proposition F, a nontrivial FD X → A holds in σF(s) for any valid
instance s of S if and only if X is a superkey of S; prove that (3,3)-NF
BCNF (a scheme in (3,3)-NF is in BCNF too).
3.54.* A scheme [R, F] includes an elementary key K if there is at least
an attribute A[R such that ¬'K’ , K with K’ → A [ F+; [R, F] is in the
elementary keys NF (EKNF) if for any nontrivial FD X → A [ F, either X
is an elementary key or A is part of such a key. Prove that EKNF is strictly
stronger than 3NF, but strictly weaker than BCNF.
3.55. Prove that (MVD complementary property) if R(U) satisfies MVD
X →→ Y, then it satisfies MVD X →→ U — XY too.
3.56. Prove that (trivial MVDs characterization theorem) X →→ U of
T is trivial if and only if U # X or T = X • U.
3.57. Consider table t(U) and XY # U; prove that:
(a) t satisfies FD X → Y t satisfies MVD X →→ Y
(b) the reverse is not true.
3.58. Prove that a relation may satisfy MVD X →→ YZ, although it
violates FD X → Y.
3.59. Prove that X →→ Y ` X →→ Z X →→ Y — Z (and conse-
quently, due to complementarity, it also implies X →→ Y ∩ Z).
3.60. Prove that:
(a) 4NF BCNF (a scheme in 4NF is in BCNF too)
The
Quest for Data Independence, Minimal Plausibility, and Formalization 307
(b) above implication would not be true were 4NF only referring
MVDs from Γ instead of Γ +.
3.61. Prove that a table in 4NF cannot have more than one pair of
complementary nontrivial MVDs.
3.62. (i) What is the truth value of the following sentence: “Any table
with two columns is in PJNF”?
(ii) Propose and prove a proposition that provides necessary and
sufficient conditions for a two columns table to be in PJNF.
3.63. Prove that 5NF 4NF (a scheme in 5NF is in 4NF too).
3.64. Prove that:
(i) a scheme [R, F] is in BCNF (4NF, respectively) if for any FD
(MVD, respectively) d[F there is only one key dependency
implying d.
(ii) a similar result for PJNF (i.e., d is a JD) is not true.
3.65.* Prove that, generally, given a set F of FDs, it is not possible
to find a set M of MVDs such that an instance satisfies F if and only if it
satisfies M.
3.66.* Let R(U) be a relation scheme, D its associated domain constraint
set, K its key dependency set, and M an associated set of FDs and MVDs;
(i) if, for any attribute A[U, card(DA) ≥ 2, then, for any FD or
MVD d of M, d is implied by D ø M if and only if it is implied
by M;
(ii) Prove that in the absence of card(DA) ≥ 2, (a) above does not
hold anymore;
(iii) if, for any attribute A[U, card(DA) ≥ n, n > 1, natural, then, for
any JD j over U having n components (i.e., of form |><| [X1, X2,
…, Xn], where Xi # U, ↔i, 1≤ i ≤ n), j is implied by D ø K if
and only if it is implied by K.
3.67. (DKNF is the highest RDM normal form theorem) Consider a
scheme [R, Γ ] such that all domain constraints allow at least k distinct
values, where natural k > 1 is the components number of the JD having
the greatest number of components from all JDs of a cover J of the JDs
subset of Γ +, and, when this subset is empty, there are at least two possible
distinct values for any attribute of any table from R; prove that:
(i) DKNF PJNF (any scheme in DKNF is in PJNF too)
(ii) DKNF (3,3)-NF (any scheme in DKNF is in (3,3)-NF too).
308 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
There are very many books on RDM. Not only from my point of view, the
current “Bible” in this domain still remains (Abiteboul et al., 1995).
Chris Date’s books (the latest being Date (2013, 2012, 2011, 2003)), are
a must too. Other latest not to miss ones are Garcia-Molina et al. (2014),
Hernandez (2013), Churcher (2012), Lightstone et al. (2011), and Halpin
and Morgan (2008). Jeffrey D. Ullman’s ones (Ullman (1988, 1989), Ull-
man and Widom (2007)) are still of very great value.
314 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Although the big word on the left is ‘compassion’, the big agenda on
the left is dependency.
—Thomas Sowell
89
which simply states that unknown nulls may take on values only in a finitely
restricted attribute domain
The
Quest for Data Independence, Minimal Plausibility, and Formalization 315
Orlowska and Zhang (1992) proved that the PJNF (generally thought
of as being equivalent to 5NF) is stronger than 5NF. Date and Fagin (1992)
proved that any 3NF table only having single keys is also in PJNF.
Unknown nulls were introduced by Grant (1977); Vassiliou (1979)
and Vassiliou (1980a) consider both unknown and nonexistent ones, in
the framework of denotational semantics; no-information ones were intro-
duced by Zaniolo (1977) and then studied by lot of researchers for more
than a decade.
Lipski (1979) and Lipski (1981) proposed a first null values gener-
alization, by using partially specified values (nonvoid subsets of attri-
bute domains from which nulls may take values). Wong (1982) advanced
another generalization, based on assigning probabilities to each value of
any domain: 1 to specified ones, 0 to all others, 0 too for all values corre-
sponding to a nonexistent null, 1/n to all unknown nulls for a domain hav-
ing n values90, 1/k to partially specified nulls for a subset having k values,
and again 0 for all of the rest of the values.
Korth and Ullman (1980), Maier (1980), Sagiv (1981), Imielinski and
Lipski (1981) proposed and studied marked nulls: numbered null values
that allow for storing the fact that certain ones should coincide.
Interaction between FDs and nulls was studied by (Vassiliou, 1980b),
(Honeyman, 1982) (who considered FDs as being interrelational con-
straints), Lien (1982), Atzeni and Morfuni (1984) and Atzeni and Morfuni
(1986).
Existence constraints are due to Maier (1980a). Other constraints on
nulls were proposed by Maier (1980a) (disjunctive existence constraints),
then studied in depth by Goldstein (1981) and Atzeni and Morfuni (1986)
and Sciore (1981) (objects).
Although referential integrity was already well-known in the 70s and
despite their crucial role in any RDBMS, inclusion dependencies were
first considered much later, starting with Casanova (1981) and Casanova
and Fagin (1984). IND interaction with other dependency types is studied,
for example, in Casanova and Vidal (1983).
Tuple (check) constraints were very rarely formally studied. Mancas et al.
(2003) introduced a generalization of tuple (check) constraints called object
constraints, which also presents and studies a first order object calculus,
90
When domains are infinite, distributions should be considered instead.
316 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
strictly more powerful than the relational ones and less powerful than the
predicates one, designed to express both domain and object constraints.
More general constraint classes were introduced and studied by Fagin
(1982) (where also studied is domain independence—see Subsection
3.15.2), Yannakakis and Papadimitriou (1982), Beeri and Vardi (1984a),
and Beeri and Vardi (1984b).
General dependency theory was addressed by very many papers: for
example, Fagin and Vardi (1986), Vardi (1987), Kanellakis (1991), and
Thalheim (1991); Nicolas (1978), Gallaire and Minker (1978), and Nico-
las (1982) approaches it from the first order logic perspective; Imielinski and
Lipski (1983) had an interesting and singular point of view in this field too.
The set of all known dependencies may be partitioned in the following
two classes (Beeri and Vardi (1984a) and Beeri and Vardi (1984b)):
FIGURE 3.32 RDM normal forms and corresponding dependency types’ hierarchy.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 317
NF, Class —the considered constraints class, O(?) stores the algorithm
complexity –P(olynomial) or E(xponential)–, and Prop. the normalization
properties —C(onstraint)P(reservation) and L(ossless)D(ecomposition)):
Initially, normalization aimed to provide automatic design of rdb
schemas. At least starting with the 90s (see, for example, Atzeni and de
Antonellis (1993)), more and more researchers are convinced that such an
approach is not adequate and that it is preferable to design db schemas in
a higher, semantic data model that provides an algorithm for translating
then its schemas into the desired relational NFs. In this context, RDM nor-
malization theory only provides milestones and characterizes the targets of
these translating algorithms.
Whatever may happen to you was prepared for you from all eternity;
and the implication of causes was from eternity spinning
the thread of your being.
—Marcus Aurelius
Derivation rules was the first used technique for studying RDM constraints
implication problem: Armstrong (1974) introduced a sound and complete
set of inference rules for FDs. Of course that, generally, besides being a
“nice” theoretical result per se, the existence of such a set is very useful
for solving the corresponding implication problem.
Note, however, that, surprisingly, for decades, the existence of such
a sound and complete set of inference rules for any constraint class (the
so-called complete axiomatization of the corresponding class) was consid-
ered more important than the implication problem decidability, although
the existence or nonexistence of complete axiomatizations and the decid-
ability (generally, the complexity) of the corresponding implication prob-
lem are orthogonal.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 319
The second technique used for studying the RDM constraints implication
problem was the chasing tableaux algorithms (or, simply, chase): algo-
rithms that start with a tableaux (i.e., a table whose instance contains not
only constants, but variables too, thus abstracting several possible (sub)
instances) and a constraint set, try to build a counterexample (of the fact
that every set is included in its closure) tableaux and force it to satisfy all
given constraints, in order to obtain a tableaux which satisfies the given set
of constraints and is as similar as possible with the original one.
Note that a given constraint is implied by the given set of constraints iff
it holds in the obtained tableaux too. Also note that chase’s appeal comes
not only from its very intuitive nature, but also from the fact that it is
somewhat parameterizable with respect to the considered constraint class.
A chase foremother was already prefigured in Aho et al. (1979a); chase
was introduced in Maier et al. (1979) (that developed ideas from Aho et al.
(1979a) and Aho et al. (1979b)), which also uses it both for solving logic
implication problems and tableaux queries, as well as for studying the FD
and MVD implication problem.
Chang and Lee (1973) and Beeri and Vardi (1980) compare the chase
with the paramodulation resolution theorem proofing technique. Beeri and
Vardi (1984b) extends chase use to the more general data dependencies.
Typed INDs are studied using the chase and equational theories in Cos-
madakis and Kanellakis (1986).
Abiteboul et al. (1995) used chase both for JDs and JD ø FD, proving
that they are NP-complete and that the implication problem JDs |— MVD
is NP-hard.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 321
mains; a language is domain independent iff all of its expressions are domain independent.
322 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
moves the focus from query results for a given db instance to queries
viewed as functions: a language L is CH-complete iff, for any db scheme
and generic query Q, there exists a L-expression E such that, for any cor-
responding db instance i, Q(i) = E(i).
Moreover, this paper also proposed QL, a canonical CH-complete lan-
guage: informally, QL includes RA extended with variables over relations,
an assignment statement, and a looping one equipped with a test for reach-
ing the empty set, the kernel of today’s extended SQLs.
Related notions over generic queries were studied by Hull (1986).
Note that, as we have already discussed, actual commercial SQL
and Quel are generally extended in the CH-completeness sense92 and/or
embedded in high-level host programming languages93 that provide full
computing power.
Aho and Ullman (1979) proved that transitive closures cannot be
expressed in RA.
Chandra (1988) is an excellent overview of query languages, present-
ing hierarchies of such languages based on expressive power, complexity,
and programming primitives; moreover, it provides an impressive bibliog-
raphy, including mathematical logic previously obtained results (in larger
contexts) that apply to rdb querying as well.
There are three approaches for dealing with nulls in rdb querying:
1. based on derived RA operators: for example, augmentation (unary,
computing a relation over the operand’s attributes, plus other ones,
and whose tuples are obtained from those of the operand’s, but
extending them with nulls) and external (natural) join (or full outer
join) from Gallaire and Minker (1978), and total projection (the
usual projection followed by elimination of all tuples containing
nulls) from Kandzia (1979);
2. based on constraints on nulls: see Imielinski and Lipski (1981,
1983, 1984), (Abiteboul et al., 1987), and (Grahne, 1989);
3. based on the IsNULL predicate, which does not originate from the
db research, but from the RDBMS industry.
SQL stemmed from IBM’s SEQUEL (“Structured English QUEry
Language”), introduced by Chamberlin and Boyce (1974), while work-
92
See, for example, IBM DB2 SQL PL, Oracle PL/SQL, MS T-SQL, etc.
93
See, for example, IBM embedded SQL for Java, C, C++, Cobol, etc., MS VBA and .NET languages, etc.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 323
ing for the world’s first RDBMS prototype, System R (Astrahan et al.,
1976; Chamberlin et al., 1981). SQL’s formal semantics is provided by
Negri et al. (1991). SQL is the relational both de facto and de jure (ISO/
IEC, 1991; ISO/IEC, 1992) standard language; its O-O extension (called
O2SQL) was proposed in Cattell (1997).
Starting with the SQL-99 standard (ISO/IEC, 2003), which extended
SQL, among others, with recursion, both linear, nonlinear, and mutual
recursion are allowed, but recursive subqueries and aggregation are disal-
lowed. The current standard is (ISO/IEC, 2011).
Lot of good books on SQL programming are available, from those
offered by the RDBMS providers (e.g., Zamil et al. (2004), Lorentz et al.
(2013), Moore et al. (2013)), part of their online documentation, and up
to the commercial ones (e.g., Rockoff (2011), Ben-Gan (2012), Berkowitz
(2008), Murach (2012)).
A similar TRC-type language, Quel, was introduced for the RDBMS
INGRES (Stonebraker et al., 1976).
QBE is due to another IBM researcher, Moshe Zloof (Zloof, 1977).
Ozsoyoglu and Wang (1993) presents all of its important extensions, as
well as all of its flavors. QBE’s exceptional graphic power is exploited
by several RDBMSs: for example, IBM embeds QBE into its DB2 QMF
(Query Management Facility), Microsoft provides it as its queries’ Design
View, in both Access and SQL Server, and also in Excel, as MS Query,
PostgreSQL offers its OBELisQ, etc.
Selinger et al. (1979) describes SystemR’s query evaluation optimi-
zations. Ullman (1982) presents a widely used today heuristic optimiza-
tion algorithm in six steps. An exhaustive discussion of query evaluation
optimizations is presented in Graefe (1993). Sun et al. (1993) contains
both a proposal of an estimation technique based on series approximating
instances values distribution and the regressions analysis for estimating
queries with joins results cardinality, as well as an impressive list of refer-
ences dealing with query evaluation plans’ cost estimations.
For a couple of decades, db instances modifications were almost
neglected, as it was believed that they may be trivially formalized with
three simple operators (insert, update, and delete of tuples satisfying desired
criteria) that might be reducible to basic set operators. Abiteboul and Vianu
(1988a, 1988b, 1989, 1990) proved that there are, however, depth prob-
324 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
lems in this direction too, which opened a new study field still active. An
overview of the fundamentals in this area is provided by (Abiteboul, 1988).
3.15.3 MISCELLANEA
Note that the relational one from (Mancas, 2002b) is slightly simpler
than this one (coming from Mancas (2008)), as it was in fact designed for
MatBase, which only uses single surrogate primary keys and single for-
eign keys always referencing primary keys.
DKRINF and the best practice rules (see Section 3.11) come from
Mancas (2001).
Definition 3.3 and proposition 3.3, as well as Exercises 3.2, 3.40, 3.41
and 3.42 are due to Val Tannen (Breazu and Mancas, 1980, 1981, and
1982).
The key propagation principle formalization (proposition 3.2), comes
from Mancas (2001) and was first published outside Romania in Mancas
and Crasovschi (2003), in the (E)MDM framework.
For the theory of NP-completeness, see, for example, Garey and John-
son (1979). For the complexity theory, see, for example, the excellent
(Calude, 1988).
Exercises 3.0, 3.1, 3.6 to 3.17, 3.26 to 3.30, 3.39, 3.43, 3.68, and 3.71 to
3.87 are original (most of them are from Mancas (2001), Mancas (2002),
Mancas (2007), and Mancas (2008)). All other exercises are from other
books (mainly from Abiteboul et al. (1995) and Atzeni and de Antonellis
(1993)).
Cons on relational db design (see also, for example, Atzeni and de
Antonellis (1993)) were first presented in Kent (1979). The ones from
Subsection 3.13.2 come also from Mancas (1985, 1997, 2007).
For ISO 3166 country (and their subdivisions) codes international stan-
dard, see http://www.iso.org/iso/home/standards/country_codes.htm.
All URLs mentioned in this section were last accessed on
August 15th, 2014.
KEYWORDS
•• acyclicity constraint
•• aggregation function
•• ALL
•• ALTER TABLE
•• anti join
•• ANY
•• arity
•• AS
•• attribute
•• attribute concatenation
•• attribute domain
•• attribute name
•• attribute product
•• augmentation
•• autonumber
•• AVG
•• BEGIN TRANS
•• Boyce-Codd Normal Form (BCNF, 3.5NF)
•• candidate key
•• cardinal
•• Cartesian product
•• CH-completeness
•• chase
•• CHECK
•• check constraint
•• column
•• COMMIT
•• complementary property
•• complementation
•• complete axiomatization
•• complete join
•• completeness
•• composed key
The
Quest for Data Independence, Minimal Plausibility, and Formalization 327
•• compulsory attribute
•• computable queries
•• computed attribute
•• computed relation
•• concatenated key
•• constraint
•• constraint name
•• constraint preservation
•• constraint representation
•• COUNT
•• CREATE TABLE
•• cross tabulation
•• cursor
•• dangling pointer
•• data definition language (DDL)
•• data dependency
•• data manipulation anomalies
•• data manipulation language (DML)
•• data set
•• decidability
•• decomposition
•• DELETE
•• delete anomaly
•• dependency
•• derivation rules
•• dirty null
•• DISTINCT
•• DISTINCTROW
•• Division
•• Domain-Key Normal Form (DKNF)
•• Domain-Key Referential Integrity Normal Form (DKRINF)
•• domain constraint
•• domain independence
328 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• IsNULL predicate
•• join
•• join dependency
•• kernel equivalence relation
•• key
•• key constraint
•• Key Propagation Principle (KPP)
•• left join
•• LIMIT
•• limited complete axiomatization
•• lossless decomposition
•• mandatory attribute
•• marked null
•• MAX
•• metacatalog
•• metadata catalog
•• MIN
•• minimally one-to-one
•• minimally uniqueness
•• multivalued dependency
•• natural join
•• nested relation
•• no information null
•• non-relational constraint
•• non prime attribute
•• normalization
•• normalization problem
•• not applicable null
•• NOT EXISTS
•• NOT IN
•• NOT NULL
•• not null
•• not null constraint
330 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• null value
•• object constraint
•• ORDER BY
•• outer join
•• partial functional dependency
•• primary key
•• prime attribute
•• primitive constraint
•• projection
•• Projection-Join Normal Form (PJNF)
•• QBE
•• QL
•• Quel
•• query
•• quotient set
•• range
•• rdb instance
•• rdb name
•• rdb scheme
•• record set
•• referential integrity
•• relation
•• relational algebra (RA)
•• relational calculus (RC)
•• relational completeness
•• relational constraint
•• relational db (rdb)
•• relation name
•• renaming
•• required attribute
•• REVOKE
•• right join
•• ROLLBACK
The
Quest for Data Independence, Minimal Plausibility, and Formalization 331
•• row
•• scheme
•• SELECT
•• Selection
•• semantic key
•• semi join
•• simple key
•• SOME
•• soundness
•• SQL
•• static SQL
•• static subquery
•• stored function
•• stored procedure
•• subquery
•• SUM
•• superkey
•• surrogate key
•• syntactic key
•• synthesis
•• table
•• tableaux
•• table name
•• temporarily unknown null
•• temporary relation
•• theta join
•• TOP
•• total attribute
•• totality constraint
•• total projection
•• transaction
•• transitive closure
•• transitive functional dependency
332 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• trigger
•• tuple
•• tuple constraint
•• tuple relational calculus (TRC)
•• tuples generating constraint
•• typed inclusion constraint
•• type domain
•• undecidability
•• union
•• UNION
•• UNION ALL
•• UNIQUE
•• unique attribute
•• uniqueness constraint
•• universal relation
•• UPDATE
•• update anomaly
•• value domain
•• view
•• WHERE
REFERENCES
Abiteboul, S. (1988). Updates, a new frontier. ICDT’88 2nd Intl. Conf. on DB Theory,
1–18, Springer-Verlag, Berlin, Germany.
Abiteboul, S., Hull, R., Vianu, V. (1995). Foundations of Databases; Addison-Wesley:
Reading, MA.
Abiteboul, S., Kanellakis, P. C., Grahne, G. (1987). On the representation and querying of
possible worlds. ACM SIGMOD Intl. Conf. on Manag. of Data, 34–48.
Abiteboul, S., Vianu, V. (1988a). Equivalence and optimization of relational transactions.
JACM, 35(1), 70–120.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 333
Abiteboul, S., Vianu, V. (1988b). Procedural and declarative database update languages.
7th ACM SIGACT SIGMOD SIGART Symp. Princ. DB Syst., 240–250.
Abiteboul, S., Vianu, V. (1989). A transaction-based approach to relational database speci-
fication. JACM, 36(4), 758–789.
Abiteboul, S., Vianu, V. (1990). Procedural languages for database queries and updates. J.
Comp. Syst. Sci., 41(2), 181–229.
Aho, A. V., Beeri C., Ullman J. D. (1979). The theory of joins in relational databases. ACM
TODS 4(3), 297–314.
Aho, A. V., Sagiv, Y., Ullman, J. D. (1979a). Efficient optimization of a class of relational
expressions. ACM TODS 4(3), 435–454.
Aho, A. V., Sagiv, Y., Ullman, J. D. (1979b). Equivalence of relational expressions. J.
Comp. Syst. Sci., 8(2), 218–246.
Aho, A. V., Ullman, J. D. (1979). Universality of data retrieval languages. 6th ACM Symp.
on Princ. of Progr. Lang., 110–117.
Allison, C., Berkowitz, N. (2008). SQL for Microsoft Access, 2nd edition: Wordware Pub-
lishing, Inc. Plano, TX.
ANSI/X3/SPARC Study Group on Database Management Systems. (1975). Interim Report
75–02–08. FDT-Bulletin ACM SIGMOD, 7(2).
Armstrong, W. W. (1974). Dependency structure of database relationships. IFIP Congress,
580–583.
Astrahan, M. M., et al. (1976). System R: A Relational Approach to Database Manage-
ment. ACM TODS, 1(2), 97–137.
Atzeni, P., de Antonellis, V. (1993). Relational Database Theory. Benjamin/ Cummings:
Redwood, CA.
Atzeni, P., Morfuni, N. M. (1984). Functional dependencies in relations with null values.
Inf. Processing Letters, 18(4), 233–238.
Atzeni, P., Morfuni, N. M. (1986). Functional dependencies and constraints on null values
in database relations. Information and Control, 70(1), 1–31.
Bancilhon, F. (1978). On the completeness of query languages for relational databases.
LNCS Math. Found. of Comp. Sci., 64, 112–124, Springer-Verlag: Berlin, Germany.
Beeri, C., Bernstein, P. A. (1979). Computational problems related to the design of normal
form relational schemas. ACM TODS 4(1), 30–59.
Beeri, C., Fagin, R., Howard, J. H. (1977). A complete axiomatization for functional and
multivalued dependencies. ACM SIGMOD Intl. Symp. Manag. Data, 47–61.
Beeri, C., Honeyman, P. (1981). Preserving functional dependencies. SIAM Journal on
Computing, 10(3), 647–656.
Beeri, C., Rissanen, J. (1980). Faithful representation of relational database schemata. Re-
port RJ 2722. IBM Research. San Jose.
Beeri, C., Vardi, M. Y. (1980). A proof procedure for data dependencies (preliminary re-
port). Tech. Rep., Hebrew Univ.: Jerusalem, Israel.
Beeri, C., Vardi, M. Y. (1984a). Formal systems for tuple and equality-generating depen-
dencies. SIAM J. Computing, 13(1), 76–98.
Beeri, C., Vardi, M. Y. (1984b). A proof procedure for data dependencies. JACM 31(4),
718–741.
Ben-Gan, I. (2012). Microsoft SQL Server 2012. T-SQL Fundamentals (Developer Refer-
ence). O’Reilly Media, Inc.: Sebastopol, CA.
334 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Bernstein, P. A. (1976). Synthesizing third normal form relations from functional depen-
dencies. ACM TODS, 1(4), 277–298.
Bernstein, P. A., Goodman N. (1980). What does Boyce-Codd form do?. 6th Intl. Conf. on
VLDB, 245–259, Montreal, Canada.
Biskup, J., Dayal, U., Bernstein, P. A. (1979). Synthesizing independent database schemes.
ACM SIGMOD Intl. Conf. on Manag. of Data, 143–151.
Breazu, V., Mancas, C. (1980). On a Functional Data Model in the database mathematical
theory. Comparison to the Relational Data Model (in Romanian). Unpublished com-
munication at the 6th Symposium on Informatics for Management CONDINF’80,
Cluj-Napoca, Romania.
Breazu, V., Mancas, C. (1981). On the description power of a Functional Database Model.
Proc. 4th Intl. Conf. on Control Systems and Computer Science, 4:223–226, Po-
litehnica University, Bucharest, Romania.
Breazu, V., Mancas, C. (1982). Normal forms for schemas with functional dependencies in the
Relational Database Model (in Romanian). Proc. 12ve Conf. on Electronics, Telecom-
munications and Computers, 152–1566, Politehnica University, Bucharest, Romania.
Calude, C. (1988). Theories of Computational Complexity. North Holland: Amsterdam,
Holland.
Casanova, M. A. (1981). The theory of functional and subset dependencies over relational
expressions. Tech. Rep., 3/81, Rio de Janeiro, Brazilia.
Casanova, M. A., Fagin R., Papadimitriou C. H. (1984). Inclusion dependencies and their
interaction with functional dependencies. J. Comp. Syst. Sci., 28(1), 29–59.
Casanova, M. A., Vidal, V. M. P. (1983). Towards a sound view integration technology.
ACM SIGACT SIGMOD Symp. Princ. DB Syst., 36–47.
Cattell, R. G. G., ed. (1994). The Object Database Standard: ODMB-93. Morgan Kaufman:
Los Altos, CA.
Chamberlin, D. D., Boyce, R. F. (1974). SEQUEL—A Structured English QUEry Lan-
guage. SIGMOD Workshop, 1, 249–264.
Chamberlin, D.D., et al. (1981). A History and Evaluation of System R. CACM, 24(10),
632–646.
Chan, E. P. F. (1989). A design theory for solving the anomalies problem. SIAM J. Com-
puting, 18(3), 429–448.
Chandra, A. K. (1988). Theory of database queries. 7th ACM SIGACT SIGMOD SIGART
Symp. Princ. DB Syst., 1–9.
Chandra, A. K., Harel, D. (1980). Computable queries for relational databases. J. Comp.
Syst. Sci., 21, 333–347.
Chandra, A. K., Lewis, H. R., Makowsky, J. A. (1981). Embedded implicational dependen-
cies and their inference problem. 13th ACM SIGACT Symp. on Theory of Compu-
ting, 342–354.
Chandra, A. K., Vardi, M. Y. (1985). The implication problem for functional and inclusion
dependencies is undecidable. SIAM J. Computing, 14(3), 671–677.
Chang, C. L., Lee, R. M. (1973). Symbolic Logic and Mechanical Theorem Proving. Aca-
demic Press: New York, NY.
Chiang, R. H. L., Barron, T. M., Storey, V. C. (1994). Reverse engineering of relational da-
tabases: Extraction of an EER model from a relational database. Data & Knowledge
Engineering 12(2): 107–142, Elsevier B.V.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 335
Churcher, C. (2012). Beginning Database Design: From Novice to Professional, 2nd ed.,
Apress Media LLC: New York, NY.
Codd, E. F. (1970). A relational model for large shared data banks. CACM, 13(6), 377–387.
Codd, E. F. (1971a). A database sublanguage founded on the relational calculus. SIGFI-
DET’71, Proceedings of the ACM SIGFIDET (now SIGMOD) Workshop on Data
Description, Access, and Control, San Diego, CA, November 11–12, (1971); ACM:
New York, NY, 35–68.
Codd, E. F. (1971b). Further normalization of the data base relational model; Research
Report RJ909; IBM: San Jose, CA.
Codd, E. F. (1972). Relational completeness of data base sublanguages; Research Report
RJ987; IBM: San Jose, CA.
Codd, E. F. (1974). Recent investigations into relational database systems; Research Re-
port RJ1385; IBM: San Jose, CA.
Codd, E. F. (1975). Understanding Relations. FDT (ACM SIGMOD Records), Vol. 7, No.
3–4, 23–28.
Codd, E. F. (1979). Extending the database relational model to capture more meaning.
ACM TODS, 4(4), 397–434.
Cosmadakis, S., Kanellakis, P. C. (1985). Equational theories and database constraints.
ACM SIGACT Symp. on the Theory of Computing, 73–284.
Cosmadakis, S., Kanellakis, P. C. (1986). Functional and inclusion dependencies: A graph
theoretic approach. Adv. in Computing Res., 3, 164–185, JAI Press.
Cosmadakis, S., Kanellakis, P. C., Spyratos, N. (1986). Partition semantics for relations. J.
Comp. Syst. Sci., 32(2), 203–233.
Cosmadakis, S., Kanellakis, P. C., Vardi, M. Y. (1990). Polynomial-time implication prob-
lems for unary inclusion dependencies. JACM, 37(1), 15–46.
Date, C. J. (2003). An Introduction to Database Systems, 8th ed., Addison-Wesley: Read-
ing, MA.
Date, C. J. (2011). SQL and Relational Theory: How to Write Accurate SQL Code, 2nd ed.,
Theories in Practice; O’Reilly Media, Inc.: Sebastopol, CA.
Date, C. J. (2012). Database Design & Relational Theory: Normal Forms and All That
Jazz; Theories in Practice; O’Reilly Media, Inc.: Sebastopol, CA.
Date, C. J. (2013). Relational Theory for Computer Professionals: What Relational Data-
bases are Really All About; Theories in Practice; O’Reilly Media, Inc.: Sebastopol,
CA.
Date, C. J., Fagin, R. (1992). Simple conditions for guaranteeing higher normal forms in
relational databases. ACM TODS 17(3), 465–476.
Delobel, C. (1978). Normalization and hierarchical dependencies in the relational data
model. ACM TODS 3(3), 201–222.
Di Paola, R. (1969). The recursive unsolvability of the decision problem for the class of
definite formulas. JACM, 16(2), 324–327.
Fagin, R. (1977). Multivalued dependencies and a new normal form for relational data-
bases. ACM TODS 2(3), 226–278.
Fagin, R. (1979). Normal forms and relational database operators. ACM SIGMOD Intl.
Conf. on Manag. of Data, 123–134.
Fagin, R. (1981). A normal form for relational databases that is based on domains and keys.
ACM TODS 6(3), 387–415.
336 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Fagin, R. (1982). Horn clauses and database dependencies. JACM, 29(4), 952–983.
Fagin, R., Vardi, M. Y. (1986). The theory of data dependencies: A survey. Math. of Infor-
mation Processing: Proc. of Symp. In Applied Math., 34, 19–71, American Math.
Soc., Providence, RI.
Gallaire, H., Minker, J. (1978). Logic and Databases. Plenum Press: New York, NY.
Garcia-Molina, H., Ullman, J.D., Widom, J. (2014). Database Systems: The Complete
Book, 2nd ed., Pearson Education Ltd.: Harlow, U.K., (Pearson New International
Edition)
Garey, M. R., Johnson, D. S. Victor Klee. ed. (1979). Computers and Intractability: A
Guide to the Theory of NP-Completeness. A Series of Books in the Mathematical
Sciences. W. H. Freeman and Co.
Goldstein, B. S. (1981). Constraints on null values in relational databases. 7th Intl. Conf.
on VLDB, 101–111.
Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Sur-
veys, 25(2), 73–170.
Grahne, G. (1989). Horn tables—an efficient tool for handling incomplete information in
databases. 8th ACM SIGACT SIGMOD SIGART Symp. Princ. DB Syst., 75–82.
Grant, J. (1977). Null values in a relational database. Inf. Processing Letters, 6(5), 156–159.
Groff, J. R. (1990). Using SQL. Osborne McGraw-Hill, New York, NY.
Gurevich, Y. (1966). The word problem for certain classes of semigroups. Algebra and
Logic, 5, 25–35.
Halpin, T., Morgan, T. (2008). Information Modeling and Relational Databases, 2nd Edi-
tion. Morgan Kauffman: Burlington, MA.
Heath, I. (1971). Unacceptable File Operations in a Relational Database. SIGFIDET’71,
Proceedings of the ACM SIGFIDET (now SIGMOD) Workshop on Data Descrip-
tion, Access, and Control, San Diego, CA, November 11–12, ACM: New York, NY,
19–33.
Hernandez, M. J. (2013). Database Design for Mere Mortals: A Hands-on Guide to Rela-
tional Database Design, 3rd ed., Addison-Wesley: Reading, MA.
Honeyman, P. (1982). Testing satisfaction of functional dependencies. JACM, 29(3), 668–
677.
Hull, R. B. (1986). Relative information capacity of simple relational schemata. SIAM J.
Computing, 15(3), 856–886.
Imielinski, T., Lipski, W. (1981). On representing incomplete information in relational da-
tabases. 7th Intl. Conf. on VLDB, 388–397, Cannes, France.
Imielinski, T., Lipski, W. (1983). Incomplete information and dependencies in relational
databases. ACM SIGMOD Intl. Conf. on Manag. of Data, 178–184.
Imielinski, T., Lipski, W. (1984). Incomplete information in relational databases. JACM,
761–791.
ISO/IEC. Database language SQL (SQL-2011). 9075: 2011, 2011.
ISO/IEC. Database language SQL (SQL-99). 9075–2: 1999, 2003.
ISO/IEC. Database language SQL (SQL3). JTC1/SC21 N 6931, 1992.
ISO/IEC. Database language SQL. JTC1/SC21 N 5739, 1991.
Johnson, D. S., Klug, A. (1984). Testing containment of conjunctive queries under func-
tional and inclusion dependencies. J. Comp. Syst. Sci., 28, 167–189.
The
Quest for Data Independence, Minimal Plausibility, and Formalization 337
RELATIONAL SCHEMAS
IMPLEMENTATION AND REVERSE
ENGINEERING
CONTENTS
4.10 Conclusion.................................................................................. 534
4.11 Exercises..................................................................................... 535
4.12 Past and Present.......................................................................... 577
Keywords............................................................................................... 578
References.............................................................................................. 579
As they are very easy to design and even implement, especially after
studying Sections 4.2 and 4.3, AF8′ algorithms are left to the reader (see
Exercise 4.3). Similarly, the dual REAF0′ algorithms are very simple,
except for Access, which is dealt with in section 4.4; consequently, they
are left to the reader too (see Exercise 4.5). Same thing goes for the dual of
A8, the reverse engineering algorithm REA0 (see Exercise 4.6).
Please note that A8 only considers single foreign keys referencing (single)
primary surrogate keys, as you should always use. If, however, you also use
the totally not recommended concatenated foreign keys provided too by the
RDM, then you need a slight variation of this algorithm (see Exercise 4.7).
Section 4.5 discusses the family of algorithms AF1–8, probably the
most important set of forward “shortcuts” for actually implementing E-R
data models into rdbs as fast as advisable, and presents, as an example, one
of its members (for MS SQL Server dbs).
Section 4.6 discusses the REAF0–2 family, dual to the above one,
and at least as important, especially for decrypting the architecture and
the semantics of legacy (but not only), undocumented (or poorly docu-
mented) rdbs, and presents too, as an example, one of its members (for
MS Access dbs).
Section 4.7 presents (as a rdb reverse engineering case study) the
results of applying for an Access db (Access again, as it is the only one
out of the five RDBMSes considered in this book for which reverse engi-
neering is not that obvious) the corresponding algorithms of the families
REAF0′ (composed with REA0) and REAF0–2.
Finally, this chapter ends like all other ones of this book with sections
dedicated to the corresponding best practice rules, the math behind, con-
clusion, exercises, and past, present and references.
Figure 4.1 presents algorithm A8, for translating relational schemas into
SQL DDL ANSI standard corresponding scripts. For the column domains
D’, see Table 4.1, which is presenting the SQL ANSI-92 data types (with
the addition of the autonumbers, collections, and XML that were intro-
duced in later ANSI standards).
Variants of the algorithm A8 (which are targeting as output language
their own SQL idioms instead of the ANSI standard ones) are implemented
by all RDBMS versions that provide automatic generation of SQL DDL
scripts of the rdbs they manage (which includes DB2 10.5, Oracle 12c,
MySQL 5.7, and SQL Server 2014).
FIGURE 4.1 Algorithm A8 (Translation of RDM Schemas into SQL ANSI-92 Scripts).
350 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
and m parameters are optional, so all of (and only) the following expres-
sions are valid: DECIMAL, DECIMAL(n), and DECIMAL(n, m).
DB2 object names may generally have at most 128 bytes (i.e., between 32
and 128 characters, depending on the chosen alphabet and encoding) and
are not case sensitive: you can freely use both upper and lower case (and
DB2 stores them exactly as you defined them, if they were enclosed in
95
Third place, with some 17%, was occupied by the MS SQL Server; all other competitors, including
the very popular open source MySQL that was leading the pack, are fighting to survive in or emerge
from the rest of some 20%; surprisingly, at that time, the only NoSQL DBMS ranked among the top
ten as popularity was MongoDB; again as popularity, Access ranked 7th, only two seats behind DB2.
Relational
Schemas Implementation and Reverse Engineering 353
double quotes, case in which you should always refer to them alike, or in
uppercase, if they were not delimited with double quotes).
The so-called ordinary identifiers should start with a letter and can
contain only letters (with all lowercase ones automatically translated to
uppercase), digits, and ‘_’. The delimited identifiers, which should be
declared and used enclosed in double quotes, can use almost anything,
including lowercase letters and even leading spaces.
Date and time constants are enclosed in single quotation marks. Text
strings are enclosed in double quotation marks. You can use double quo-
tation marks within text strings by doubling them; for example, the text
string “““quote”””’s value is “quote”.
DB2 provides the SQL CREATE DATABASE statement too, but, mainly
due to security and physical disk space management, this is typically done
only by DBAs, who are generally using for this task too the GUI of the
IBM Data Studio tool. However, you can do it too in its Express editions,
as there you are the DBA too.
Computed tables are called views and can be created with the CRE-
ATE VIEW … AS SELECT … SQL DDL statement; they can be created
graphically through the IBM DB2 Query Management Facility’s (QMF)
QBE-type GUI too. Unfortunately, QMF (which also provides analytics)
currently has only a commercial Enterprise edition.
DB2 was the first RDBMS to introduce in-memory tables, very useful
especially for OLAP processing (but not only) as it speeds up data analysis
even more than 25 times, as part of the BLU technologies (a development
code name standing for “big data, lightning fast, ultra-easy”), which is a
bundle of novel techniques for columnar processing, data deduplication,
parallel vector processing, and data compression.
Temporary tables, either local or global are created in user temporary
tablespaces with the CREATE [GLOBAL] TEMPORARY TABLE … SQL
354 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
DDL statement. The local ones exist during the current user session or the
lifetime of the procedure in which they are declared. The global ones have
persistent definitions, but data is not persistent (as it is automatically deleted
by the system at the end of the db sessions) and they generate no redo or
rollback information. If, however, you would like their instances be persistent
too, you can add the ON COMMIT PRESERVE ROWS clause at the end of
the above statement.
Moreover, DB2 also provides temporary tables which only exist in a
FROM or WHERE clause during the execution of a single SQL statement, as
well as temporary tables that are part (and exist only during the execution)
of a WITH SQL statement.
You can associate indexes, triggers, statistics, and views to temporary
tables too. Dually, for temporary tables you cannot define referential integ-
rity, or specify default values for their columns, or have columns declared
as a ROWID or LOB (V6), or create such tables LIKE a declare global
temporary table, or issue LOCK TABLE statements on them, or use DB2
parallelism, etc.
DB2 also provides table hierarchies, in which subtables may inherit the
privileges of their supertables.
All the problems of the world could be settled easily if men were only
willing to think.
The trouble is that men very often resort to all sorts of devices in
order not to think, because thinking is such hard work.
—Thomas J. Watson
You can get the current system date, time, and timestamp directly
from the registers CURRENT DATE (or CURRENT_DATE), CURRENT
TIME (or CURRENT_TIME), and CURRENT TIMESTAMP (or CUR-
RENT_TIMESTAMP), respectively. For getting the current system locale
(by default, the “en_US” one, for English U.S.), you can use the CUR-
RENT LOCALE LC_TIME register. Similarly, you can get the local offset
with respect to the current UTC time from the CURRENT TIMEZONE (or
CURRENT_TIMEZONE) register. You can get the year part of a date or
timestamp d with the function Year(d).
There is no CURRENCY data type: during data integrations, DB2
translates such data types into DECIMAL.
Every time we’ve moved ahead in IBM, it was because someone was
willing to take a chance,
put his head on the block, and try something new.
—Thomas J. Watson
There are also six other DB2 data types: one built-in (XML), one extended
(ANCHOR), and four user-defined (DISTINCT, STRUCTURED, REFER-
ENCE, and ARRAY).
Ø XML stores well-formed XML documents; such values share the
first 5 restrictions of the CLOB, DBCLOB, and BLOB data types
above.
Ø ANCHOR is a data type based on another SQL object such as a
column, global variable, SQL variable or parameter, or a table
or view row. A data type defined using an anchored type defini-
tion maintains a dependency on the object to which it is anchored.
Any change in the data type of the anchor object will impact the
362 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
96
ROW is a user-defined data type that can be used only within the IBM’s extended SQL Procedural
Language (SQL PL) applications; it is a structure composed of multiple fields, each with its own name
and data type, which can be used to store the column values of a row in a result set or other similarly
formatted data.
Relational
Schemas Implementation and Reverse Engineering 363
operators of its source type, because these functions and operators might
not be meaningful.
For example, a LENGTH function could be defined to support a param-
eter with the data type AUDIO that returns the length of the object in sec-
onds instead of bytes.
A weakly typed DISTINCT data type (WTDDT) is considered to be
the same as its source type for all operations, except when it applies con-
straints on values during assignments, casts, and function resolutions.
The NATURALS following example is a WTDDT:
CREATE TYPE NATURALS AS INTEGER
WITH WEAK TYPE RULES CHECK (VALUE >= 0);
Weak typing means, in this case, that except for accepting only posi-
tive integer values, NATURALS operates in the same way as its underlying
INTEGER data type.
WTDDTs can be used as alternative methods of referring to built-in
data types within application code: the ability to define constraints on the
values that are associated with them provides a method for checking val-
ues during assignments and casts.
There are two distinguished built-in system DISTINCT types that can
be used as table column ones:
ü SYSPROC.DB2SECURITYLABEL must be used to define the row
security label column of a protected table; its underlying data type
is VARCHAR(128) FOR BIT DATA; a table can have at most
one column of type DB2SECURITYLABEL; for such a column,
NOT NULL WITH DEFAULT is implicit and cannot be explicitly
specified; the default value is the session authorization ID’s secu-
rity label for write access.
ü SYSPROC.DB2SQLSTATE is used to store DB2 errors, warnings,
and information codes; its underlying data type is INTEGER.
Using DISTINCT types provides benefits in the following areas:
ü Extensibility: increasing the set of data types available to support
your applications.
ü Flexibility: any semantics and behavior can be specified for these
data types by using user-defined functions to augment the diversity
of the data types available in the system.
364 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
the hierarchy (i.e., when the hierarchy is a tree, the subtree rooted in T).
A proper subtype of a structured type T is a structured type below T in the
type hierarchy.
Recursive type definitions in type hierarchies are subject to some
restrictions, for which it is necessary to develop a shorthand way of refer-
ring to the specific type of recursive definitions that are allowed; the fol-
lowing definitions are used:
ü Directly usage: a type A is said to directly use another type B, if and
only if one of the following statements is true:
1. Type A has an attribute of type B.
2. Type B is a subtype of A or a supertype of A.
ü Indirectly usage: A type A is said to indirectly use a type B, if one
of the following statements is true:
1. Type A directly uses type B.
2. Type A directly uses some type C and type C indirectly uses
type B.
A type cannot be defined so that one of its attribute types directly or
indirectly uses itself (i.e., STRUCTURED type definitions should be acy-
clic). If it is, however, necessary to have such a configuration, consider
using a REFERENCE (see below) as the attribute. For example, with struc-
tured type attributes, there cannot be an instance of “employee” with an
attribute of “manager” when “manager” is of type “employee”. There can,
however, be an attribute of “manager” with a type of REF(employee).
A type cannot be dropped if certain other objects use the type, either
directly or indirectly. For example, a type cannot be dropped if a table or
view column makes direct or indirect use of the type.
Ø REFERENCE (or REF) is a companion type to a structured type.
Similar to a DISTINCT type, a reference one is a scalar type that
shares a common representation with one of the built-in data types.
This same representation is shared for all types in the type hier-
archy. The reference type representation is defined when the root
type of a type hierarchy is created. When using a reference type, a
structured type is specified as a parameter of the type. This param-
eter is called the target type of the reference.
Ø ARRAY is a data type that is defined as an array with elements of
another data type. An array is a structure that contains an ordered
366 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Wisdom is the power to put our time and our knowledge to the proper use.
—Thomas J. Watson
Table 4.3 shows how to implement in DB2 your most frequently needed
data types.
To conclude this subsection, Table 4.4 presents the rest of the data
types provided by DB2.
The closest to GUIDs (or UUIDs) unique values are those obtainable by
calls to the GENERATE_UNIQUE() function; as they are numeric, not
string ones, in order to obtain their string GUID-like equivalent also use
the HEX(), CHAR(), and TRIM() functions, as follows:
SELECT TRIM(CHAR(HEX(GENERATE_UNIQUE())))
FROM SYSIBM.SYSDUMMY1;
However, you do not need GUIDs in DB2: data replication is managed
by the dedicated powerful tool IBM InfoSphere Data Replication.
DB2 also provides an improved syntax of the INSERT … VALUES
SQL DML statement that allows for inserting as many rows as desired with
only one such statement.
The functions in the SYSFUN schema taking a VARCHAR as an argu-
ment will not accept VARCHARs greater than 4,000 bytes long as an argu-
ment. However, many of these functions also have an alternative signature
accepting a CLOB(1 M). For these functions, the user can explicitly cast
the greater than 4,000 VARCHAR strings into CLOBs and then recast the
result back into VARCHARs of the required length.
Special restrictions apply to expressions that result in a CLOB, DBCLOB,
and BLOB data types, as well as to structured type columns; such expres-
sions and columns are not allowed in:
Relational
Schemas Implementation and Reverse Engineering 371
There are dozens of other components too, including OLAP, BI, spatial
and graph data, and the new multitenant architecture, which allows user
to centrally manage any number of dbs. Oracle 12c can also be accessed
through ODBC, ADO, and OLEDB.
Besides several commercial versions, there is also a free to download
Oracle Express (XE) one (for 11 GB dbs maximum), which can also be
redistributed with your products. It allows for 1 GB RAM and 1 CPU core.
Oracle 12c partially adheres to the SQL 2011 ANSI standard (which
includes the 1999 object-relational extensions, recursive SELECT queries,
triggers, and support for procedural and control-of-flow statements in its
extended SQL PL/SQL language, as well as the 2003 and 2006 SQL/XML),
plus its own extensions (that can be detected with its FIPS Flagger, provided
you execute the SQL statement ALTER SESSION SET FLAGGER = ENTRY).
Oracle 12c runs on (z)Linux, MS Windows, HP-UX, AIX, Solaris, and
OpenVMS OSs.
Oracle object names may have at most 30 characters, are not case sensitive
(and Oracle stores them in uppercase), except for diagrams, and, among
other reserved characters, do not accept ‘-’ (that you have to replace with
‘_’). Apart letters and numbers, you may only use ‘$’, ‘#’, ‘@’, and ‘_’.
Only for parameters you can use any of them as the first one: for the rest of
the identifiers, the first one should be a letter. However, you may use any
number of ASCII not allowed characters in names, including for the first
one, provided you embed them in double quotation marks.
Text strings have to be embedded in apostrophes (single quotation
marks). Calendar date values have to be entered as the result of the system
conversion function To_Date applied to their text string values and system
calendar date template (e.g., in the U.S., ‘MM/DD/YYYY’).
You can use apostrophes within text strings by doubling them; for
example, the text string ‘‘‘quote”’’s value is ‘quote’.
Relational
Schemas Implementation and Reverse Engineering 375
Comments in Oracle SQL start immediately after “--” and end at the
end of the line; alternatively, you can embed any number of lines in a
single comment by embracing it in “/*” (for start) and “*/” (for end).
They don’t call it the Internet anymore, they call it cloud computing.
I’m no longer resisting the name. Call it what you want.
—Larry Ellison
In Oracle, dbs are curiously named users.97 Oracle provides a SQL CRE-
ATE USER statement too, but, similarly to DB2, mainly due to security
and physical disk space management, this is typically done only by DBAs,
who are generally using for this task too the GUI of the Oracle Enterprise
Console tool. However, in the free Express edition, you can do it too, as
you are also the DBA.
Computed tables are called views and can be created with the CREATE
VIEW … AS SELECT … SQL DDL statement; they cannot be created
through the Oracle SQL Developer’s GUI too, as it doesn’t have a QBE-
type facility.
You can also define in-memory tables, very useful for OLAP process-
ing, but only for some additional $23,000 per CPU.
Temporary tables, either local or global are created in user temporary
tablespaces with the CREATE [GLOBAL] TEMPORARY TABLE … SQL
DDL statement. The local ones exist during the current user session or
the lifetime of the procedure in which they are declared. The global ones
have persistent definitions, but data is not persistent (as it is automatically
deleted by the system at the end of the db sessions) and they generate no
redo or rollback information. If, however, you would like their instances
be persistent too, you can add the ON COMMIT PRESERVE ROWS clause
at the end of the above statement.
You can associate indexes, triggers, statistics, and views to temporary
tables too.
97
Reminiscence of its beginnings, when they did not anticipate that a user might need to manage sev-
eral dbs: this is another great counterexample of how not to name objects, especially conceptual ones.
376 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Ø BINARY_DOUBLE = [–1.79769313486231×10308;
1.79769313486231×1038];
Both BINARY_FLOAT and BINARY_DOUBLE support the distin-
guished values infinity and NaN (Not a Number).
Ø FLOAT[(p)] = [–8.5×1037; 8.5×1037]. Precision p [ [1; 126], default
126 (binary); scale is interpreted from the data; used internally,
when converting ANSI FLOAT values; not recommended to be
used explicitly too, due to truncations; use instead the more robust
NUMBER, BINARY_FLOAT, and BINARY_DOUBLE data types.
There are 4 DATE/TIME data types (DATE, TIMESTAMP, TIMESTAMP
WITH TIME ZONE, and TIMESTAMP WITH LOCAL TIME ZONE), plus
2 additional interval (INTERVAL YEAR TO MONTH and INTERVAL
DAY TO SECOND) data types:
Ø DATE = [1/1/4712 BC; 12/31/9999];
Ø TIMESTAMP [(fractional_seconds_precision)]; extension of DATE
with fractional second precision between 0 and 9 digits (default: 6);
Ø TIMESTAMP [(fractional_seconds_precision)] WITH TIME ZONE;
variant of TIMESTAMP that includes in its values a time zone
region name or a time zone offset (i.e., the difference (in hours
and minutes) between local time and UTC — formerly Greenwich
Mean Time).
Ø TIMESTAMP [(fractional_seconds_precision)] WITH LOCAL
TIME ZONE; variant of TIMESTAMP WITH TIME ZONE: it differs
in that data stored in the db is normalized to the db time zone, and
the time zone information is not stored as part of the corresponding
columns data. When a user retrieves such data, Oracle returns it in
the user’s local session time zone. This data type is useful for date
information that is always to be displayed in the time zone of the
client system, in two-tier applications;
Ø INTERVAL YEAR [(year_precision)] TO MONTH = [–4712/1;
9999/12]; year_precision is the number of digits in the YEAR field
and has the default value 2;
Ø INTERVAL DAY [(day_precision)] TO SECOND [(fractional_
seconds_precision)] = [0 0:0:0; 9 23:59:59.999999999]; day_pre-
cision is the number of digits in the DAY field, and accepts values
between 0 and 9, the default being 2; fractional_seconds_precision
378 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
99
4,000 is the standard default maximum value, but you can set it to 32,767 (see above footnote).
2,000 is the standard default maximum value, but you can set it to 32,767 (see above two footnotes).
100
However, extended RAW values are stored as out-of-line LOBs only if their size is greater than 4,000
bytes; otherwise, they are stored inline.
380 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
representing the address of each row. These strings have the data type
ROWID.
Rowids contain the following information:
ü The data block of the data file containing the row; the length of this
string depends on your operating system.
ü The row in the data block.
ü The database file containing the row; the first data file has the
number 1; the length of this string depends on your OS.
ü The data object number, which is an identification number assigned
to every db segment. You can retrieve the data object number from
the data dictionary views USER_OBJECTS, DBA_OBJECTS, and
ALL_OBJECTS. Objects that share the same segment (clustered
tables in the same cluster, for example) have the same object number.
Rowids are stored as base 64 values that can contain the characters
A-Z, a-z, 0–9, and the plus sign (+) and forward slash (/). Rowids are
not available directly: you can use the supplied package DBMS_ROWID to
interpret rowid contents. The package functions extract and provide infor-
mation on the four rowid elements listed above.
The rows of some tables have addresses that are not physical or perma-
nent or were not generated by Oracle. For example, the row addresses of
index-organized tables are stored in index leaves, which can move. Row-
ids of foreign tables (such as DB2 tables accessed through a gateway) are
not standard Oracle rowids.
Oracle uses universal rowids (urowids) to store the addresses of index-
organized and foreign tables. Index-organized tables have logical urowids
and foreign tables have foreign urowids. Both types of urowid are stored
in the ROWID pseudocolumn (as are the physical rowids of heap-orga-
nized tables).
Oracle creates logical rowids based on the primary key of the table.
The logical rowids do not change as long as the primary key does not
change. The ROWID pseudocolumn of an index-organized table has a
data type of UROWID. You can access this pseudocolumn as you would
the ROWID pseudocolumn of a heap-organized table (i.e., by using a
SELECT…ROWID statement). If you want to store the rowids of an index-
organized table, then you can define a column of type UROWID for the
table and retrieve the value of the ROWID pseudocolumn into that column.
Relational
Schemas Implementation and Reverse Engineering 381
Most men and women, by birth or nature, lack the means to advance
in wealth or power, but all have the ability to advance in knowledge.
—Pythagoras
101
An object identifier (OID) uniquely identifies an Oracle object and enables you to reference it from
other objects or from db table columns.
102
When a REF value points to a nonexistent object, it is said to be “dangling”; a dangling REF is dif-
ferent from a null REF; to determine whether a REF is dangling or not, use the condition IS [NOT]
DANGLING.
382 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
For example, in order to reference the Salary value of an employee having x = 205, stored in an
104
HR XML file under the tags EMPLOYEES/ROW, the corresponding value is: /HR/EMPLOYEES/
ROW[x=205]/Salary
Relational
Schemas Implementation and Reverse Engineering 383
Above the cloud with its shadow is the star with its light.
—Pythagoras
Relational
Schemas Implementation and Reverse Engineering 385
Oracle computed columns are referred to as virtual and are defined using
the predicate AS. Their data type may be explicitly stated or implicitly
derived by Oracle from the corresponding expression. Here are the main
characteristics of and restrictions on virtual columns:
• Indexes defined against virtual columns are equivalent to function-
based indexes.
• Virtual columns can be referenced in the WHERE clause of updates
and deletes, but they cannot be manipulated by SQL DML state-
ments.
• Tables containing virtual columns can still be eligible for result
caching.
• Functions in expressions must be deterministic at the time of table
creation, but can subsequently be recompiled and made nondeter-
ministic without invalidating the corresponding virtual columns. In
such cases the following steps must be taken after the function is
recompiled:
ü Constraints on the virtual column must be disabled and reen-
abled.
ü Indexes on the virtual column must be rebuilt.
ü Materialized views that access the virtual column must be fully
refreshed.
ü The result cache must be flushed if cached queries have
accessed the virtual column.
ü Table statistics must be regathered.
• Virtual columns are not supported for index-organized, external,
object, cluster, or temporary tables.
• Expressions used in virtual column definitions have the following
restrictions:
ü cannot refer to another virtual column by name;
ü can only refer to columns defined in the same table;
ü if they refer to a deterministic user-defined function, they can-
not be used as a partitioning key column;
ü their output must be a scalar value: it cannot return an Ora-
cle supplied data type, a user-defined type, or LOB or LONG
RAW.
386 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Table 4.5 shows how to implement in Oracle your most frequently needed
data types.
To conclude this subsection, Table 4.6 presents the rest of the data
types provided by Oracle.
Beware that Oracle triggers are very slow. For example, as Oracle does
not accept using the SysDate system function (which returns the current
system calendar date and time) in constraints, although such constraints
might be enforced through triggers, we prefer to use instead additional
virtual columns and associated functions, as they are four times faster than
equivalent triggers.
For example, in order to detect whether or not the values that users
would like to store for years is greater than the current year, we first declare
a Year_in_future function (based on SysDate, see Subsection 4.3.2); then,
this function is used to define virtual columns in all tables that need such
a domain or tuple constraint, for rejecting storing implausible data in the
associated actual columns (see examples in Subsection 4.3.2).
Note that, fortunately, this type of solution is possible only due to the
existence of a bug in Oracle: virtual columns can be based only on deter-
ministic functions, whereas, in fact, Year_in_future is not deterministic,
because is based on SysDate; the helping bug is that, however, Oracle
accepts the false DETERMINISTIC definition keyword for it. Obviously,
same thing is true for the other two such needed functions, Day_in_future
and Day_in_far_future.
Relational
Schemas Implementation and Reverse Engineering 387
LOB columns cannot be used either in keys (be them primary or not),
or in SELECT DISTINCT, joins, GROUP BY or ORDER BY clauses,
or in the UPDATE OF clauses of AFTER UPDATE triggers. You cannot
define VARRAYs or ANYDATAs using LOB columns either.
However, you can specify a LOB attribute of an object type column in
a SELECT…DISTINCT statement, a query that uses the UNION or a MINUS
set operator, if the object type of the column has a MAP or ORDER function
defined on it.
LOB columns have full transactional support; however, you cannot
save a LOB locator in a PL/SQL or OCI variable in one transaction and
then use it in another transaction or session.
Binary file LOBs (pointed by the BFILE data type) do not participate
in transactions and are not recoverable: the underlying OS should provide
their integrity and durability. DBAs must ensure that the corresponding
external file exists and that Oracle processes have OS read permissions
on them: BFILE enables only read-only support of such large binary files
(you cannot modify or replicate such a file: Oracle provides only APIs to
access file data). The primary interfaces that can be used to access file data
are the DBMS_LOB package and the OCI.
The only real security that a man can have in this world is
a reserve of knowledge, experience and ability.
—Henry Ford
Thinking is the hardest work there is, which is probably the reason
why so few engage in it.
—Henry Ford
All is Number.
—Pythagoras
MySQL provides the following 9 numeric data types (if not otherwise
specified, n = 1 by default):
Ø TINYINT[(n)] [UNSIGNED] = [–128; 127] (signed) or [0; 255]
(unsigned);
Ø SMALLINT [(n)] [UNSIGNED] = [–32,768; 32,767] (signed) or
[0; 65535] (unsigned);
Ø MEDIUMINT [(n)] [UNSIGNED] = [–8,388,608; 8,388,607]
(signed) or [0; 16,777,215] (unsigned);
Ø INT [(n)] [UNSIGNED] = [–2,147,483,648; 2,147,483,647]
(signed) or [0; 4,294,967,295] (unsigned); synonym: INTEGER;
Ø BIGINT [(n)] [UNSIGNED] = [–9,223,372,036,854,775,808;
9,223,372,036,854,775,807](signed)or[0;18,446,744,073,709,551,615]
(unsigned); synonym: INT8;
Ø DECIMAL [(n[, m])] [UNSIGNED] = [–1065; 1065–1] (signed) or
[0; 1065–1] (unsigned), 0 ≤ n ≤ 65, 0 ≤ m ≤ 30; by default, n = 10, m
= 0; precision for arithmetic calculations is 65 digits; synonyms:
NUMERIC, FIXED;
Ø FLOAT [(n[, m])] [UNSIGNED] = [–3.402823466 × 1038;
−3.402823466 × 1038] (signed) or [0; −3.402823466 × 1038]
(unsigned), 0 ≤ n, m ≤ 38 (3.402823466 × 1038 is only theoretical:
actually, depending on the host hardware and software platform, it
can be less than that); no default: if n and/or m are not specified,
their actual values depend on the host hardware platform; precision
for arithmetic calculations is some 7 scale digits; unfortunately,
sometimes, you might get unexpected results, because calculations
are never actually done with FLOAT;
Ø DOUBLE [(n[, m])] [UNSIGNED] = [–1.7976931348623157
× 10308; 1.7976931348623157 × 10308] (signed) or [0;
1.7976931348623157 × 10308] (unsigned), 0 ≤ n, m ≤ 308
(1.7976931348623157 × 10308 is only theoretical: actually, depend-
ing on the host hardware and software platform, it can be less than
that); no default: if n and/or m are not specified, their actual values
depend on the host hardware platform; precision for arithmetic cal-
culations is some 15 scale digits; synonyms: DOUBLE PRECI-
SION, REAL;
Relational
Schemas Implementation and Reverse Engineering 393
There are 14 types of strings in MySQL (using, for CHAR, VARCHAR, and
TEXT ones either ASCII or UNICODE encodings, at your choice, depend-
ing on their CHARSET attribute and/or SET NAME and/or SET CHAR-
ACTER SET statements execution or the default, which is ASCII; except
for CHAR and VARCHAR, for which n is in characters, n is in bytes):
Ø CHAR(n) = UNICODE(n), 0 ≤ n ≤ 30, fixed length (right padding);
Ø VARCHAR(n) = UNICODE(n), 0 ≤ n ≤ 65,535;
Ø TINYTEXT = UNICODE(n), 0 ≤ n ≤ 255;
Ø TEXT = UNICODE(n), 0 ≤ n ≤ 65,535;
Ø MEDIUMTEXT = UNICODE(n), 0 ≤ n ≤ 16,777,215;
Ø LONGTEXT = UNICODE(n), 0 ≤ n ≤ 4,294,967,295;
Ø BINARY(n): equivalent to CHAR(n), except that there is no char-
acter set, n is in bytes (not characters) and padding is done with
binary 0 (not space);
Ø VARBINARY(n): equivalent to VARCHAR(n), except that there is
no character set and n is in bytes (not characters);
Ø TINYBLOB: binary equivalent of TINYTEXT;
Ø BLOB: binary equivalent of TEXT;
Ø MEDIUMBLOB: binary equivalent of MEDIUMTEXT;
Ø LONGBLOB: binary equivalent of LONGTEXT;
Ø ENUM: static sets of (maximum under 3,000) string literals (that
are stored in numeric coding, but showed as defined); only one of
them is stored per table cell; (example: ENUM(‘red’, ‘orange’,
‘yellow’, ‘green’, ‘blue’, ‘indigo’, ‘violet’) is stored
as TINYINT values 1, 2, 3, 4, 5, 6, and 7, respectively);
Ø SET: static sets of (maximum 64) string literals (that are stored in
numeric coding, but showed as defined); none or any combination
of them may be stored per table cell.
BLOB and TEXT data types differ from VARBINARY and VARCHAR,
respectively, in the following respects:
Relational
Schemas Implementation and Reverse Engineering 397
Failure is simply the opportunity to begin again, this time more intelligently.
—Henry Ford
Table 4.7 shows how to implement in MySQL your most frequently needed
data types. To conclude this subsection, Table 4.8 presents the rest of the
data types provided by MySQL.
Relational
Schemas Implementation and Reverse Engineering 399
There are no big problems, there are just a lot of little problems.
—Henry Ford
ü Columns that are part of a PRIMARY KEY are made NOT NULL,
even if not declared that way.
ü Trailing spaces are automatically deleted from ENUM and SET
member values when the table is created.
ü Certain data types used by other SQL db vendors are mapped to
corresponding MySQL types.
ü When you include a USING clause to specify an index type that
is not permitted for a given storage engine, but there is another
index type available that the engine can use without affecting query
results, the engine uses the available type.
ü If strict SQL mode is not enabled, a VARCHAR column with a
length specification greater than 65,535 is converted to TEXT, and
a VARBINARY column with a length specification greater than
65,535 is converted to BLOB. Otherwise, an error occurs in either
of these cases.
ü Specifying the CHARACTER SET binary attribute for a character
data type causes the column to be created as the corresponding
402 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
I really had a lot of dreams when I was a kid, and I think a great deal
of that grew out of the fact that
I had a chance to read a lot.
—Bill Gates
SQL Server object names may have at most 128 characters (116 for tem-
porary tables), are not case sensitive, but you can freely use both upper
and lower case (and SQL Server stores them exactly as you defined
them). They should start with a letter or one of the special characters
‘@’, ‘#’, and ‘_’.
Those starting with ‘@’ are considered as local variables; moreover,
‘@’ cannot be used in any other name position.
Those starting with ‘#’ are considered as temporary objects, only exist-
ing during the current work session; all those starting with “##” are global,
that is accessible to all concurrent users, while the others are local, acces-
sible only to the current user.
Money constants are prefixed (after sign, if any) with ‘$’ or some other
over 30 currency symbols available (including ‘€’, ‘£’, ‘¥’, etc.).
The only other reserved character for names is the space. However,
you may use any number of spaces in names provided you embed them
404 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Computed tables are called views and can be created with the CREATE
VIEW … AS SELECT … SQL DDL statement; they can also be created
and stored through the SQL Server Management Studio’s QBE-type Query
Design GUI too.
You can define temporary tables as variables, but they only exist dur-
ing the execution of the corresponding script (batch, transaction) and
you cannot use the SELECT … INTO SQL statement for populating
them.
You can also define in-memory tables, very useful for OLTP processing.
Temporary tables, either local (names prefixed by ‘#’) or global (names
prefixed by “##”) exist during the current user session or the lifetime of the
procedure in which they are declared.
uniquely identify the corresponding table rows, but does not do anything to enforce uniqueness.
Relational
Schemas Implementation and Reverse Engineering 407
using these two data types; moreover, as MONEY is one byte less than
a large DECIMAL, with up to 19 digits of precision and as most real-
world monetary calculations (up to $9.99 M) can fit in a DECIMAL(9,2),
which requires just five bytes, you can save size, worry less about round-
ing errors, and make your code more portable by using DECIMAL instead
of MONEY.
There are several DATE/TIME data types (except for DATETIMEOFF-
SET, all others date/time types ignore time zones):
Ø DATE = [01/01/1; 31/12/9999], with 1 day accuracy, 10 digits pre-
cision, no scale;
Ø DATETIME2 = [01/01/1; 31/12/9999], with 100 nanoseconds
accuracy, precision and (user defined) scale of maximum 7 digits;
Ø DATETIME = [01/01/1753; 31/12/9999], with accuracy rounded
to increments of .000, .003, or .007 seconds and fixed scale of 3
digits;
Ø SMALLDATETIME = [01/01/1900; 06/06/2079], with 1 min accu-
racy and seconds always :00; 23:59:59 is rounded to 00:00:00 of
the next day;
Ø TIME = [00:00:00; 23:59:59.9999999], with 100 nanoseconds
accuracy, precision and (user defined) scale of maximum 16 and 7
digits, respectively;
Ø DATETIMEOFFSET = [01/01/1; 31/12/9999], with 100 nanosec-
onds accuracy, precision and (user defined) scale of maximum 7
digits, and from −14:00 to +14:59 user defined offset (depending
on desired time zone);
Ø ROWVERSION (TIMESTAMP) is an automatically generated rela-
tive date and time stamp used for row version-stamping (avoid
using TIMESTAMP, which will soon be not supported anymore).
A nonnullable ROWVERSION column is semantically equivalent to a
BINARY(8) one (see Section 4.2.4.3.2). A nullable ROWVERSION col-
umn is semantically equivalent to a VARBINARY(8) one (see Section
4.2.4.3.2).
A table can have only one ROWVERSION column. Every time that a
row of such a table is modified or inserted, the incremented db rowver-
sion value is inserted into its ROWVERSION column. To get the current
408 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Just like for tree-like file management systems, a slash starts the rep-
resentation and a path that only visits the root is represented by a single
slash; for levels underneath the root, each label is encoded as a sequence of
integers separated by dots; comparison between children is performed by
comparing the integer sequences separated by dots, in dictionary order; each
level is followed by a slash (i.e., slash separates parents from their children).
For example, the following are valid HIERARCHYID paths of lengths
1, 2, 2, 3, and 3 levels respectively:
• /
• /1/
• /0.3/7/
• /1/3/
• /0.1/0.2/
Nodes can be inserted in any location. For example, nodes inserted
after /1/2/ but before /1/3/ can be represented as /1/2.5/. Nodes inserted
before 0 have logical representations as negative numbers. For example,
a node that comes before /1/1/ can be represented as /1/-1/. Nodes cannot
have leading zeros. For example, /1/1.1/ is valid, but /1/1.01/ it is not. To
prevent errors, insert nodes by using the GetDescendant() method.
Columns of type HIERARCHYID can be used on any replicated table.
The requirements for your application depend on whether replication is
a directional or bidirectional one, and on the versions of SQL Server that
are used.
XML has the following syntax:
XML([content | document] xml_schema_collection),
where:
ü content restricts the XML instance to be a well-formed XML
fragment; XML data can contain multiple zero or more elements at
the top level; text nodes are also allowed at the top level (and this
is the default behavior);
ü document restricts the XML instance to be a well-formed XML
document; XML data must have one and only one root element;
text nodes are not allowed at the top level;
ü xml_schema_collection is the name of an XML schema col-
lection; to create a typed XML column or variable, you can option-
ally specify the XML schema collection name.
412 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The stored representation of XML data type instances size cannot exceed
2 GB. The content and document facets apply only to typed XML.
GEOGRAPHY represents data in a round-earth coordinate system and
stores it ellipsoidally (round-earth), using GPS latitude and longitude
coordinates.
GEOMETRY is a planar spatial data type, representing data in a Euclid-
ean (flat) coordinate system.
SQL_VARIANT is a data type that may store values of any SQL
Server-supported data type, except for the following 14 ones: DATE-
TIMEOFFSET, GEOGRAPHY, GEOMETRY, HIERARCHYID, IMAGE,
NTEXT, NVARCHAR(MAX), ROWVERSION, SQL_VARIANT, TEXT,
USER_DEFINED, VARBINARY(MAX), VARCHAR(MAX), and XML.
SQL_VARIANT can have a maximum length of 8,016 bytes. This
includes both the base type information and the base type value. The maxi-
mum length of the actual base type value is 8,000 bytes.
A SQL_VARIANT data type value must first be cast to its base data
type one, before participating in operations such as addition and sub-
traction. SQL_VARIANT cannot have another SQL_VARIANT as its
base type.
SQL_VARIANT can be assigned a default value. This data type can
also have NULL as its underlying value, but the NULL values will not
have an associated base type.
A unique, primary, or foreign key may include columns of type SQL_
VARIANT, but the total length of the data values that make up the key of
a specific row should not be more than the maximum length of an index,
which is 900 bytes.
A table can have any number of SQL_VARIANT columns.109
Finally, SQL Server’s users can define their own data types. Such types
should be created with the sp_addtype table definitions. They are based
on the system data types and are recommended to be used when several
tables must store the same type of data in a column and you must ensure
that these columns have exactly the same data type, length, and nullability.
When a user-defined data type is created, you must supply the follow-
ing three parameters: new data type name (unique across the db), system
data type upon which the new data type is based, and nullability (i.e.,
109
However, SQL_VARIANT cannot be used in either CONTAINSTABLE or FREETEXTTABLE.
Relational
Schemas Implementation and Reverse Engineering 413
whether the data type allows null values or not). When nullability is not
explicitly defined, it will be assigned based on the ANSI null default set-
ting for the db or connection.
User-defined data types created in the model db exist in all new user-
defined dbs; if a data type is created in a user-defined db, then it exists only
in that db.
For example, a user-defined data type called Address could be cre-
ated based on the varchar data type in the model db (the “supreme” model)
as follows:
USE model
EXEC sp_addtype Address, ‘VARCHAR(128)’, ‘NOT
NULL’
User-defined data types are not supported in table variables.
You can also add computed columns to tables: columns that are not physi-
cally stored in the table, unless the column is marked PERSISTED. A
computed column expression can use data from other columns of its table
to calculate a value for that column.
A computed column cannot be used as a DEFAULT or FOREIGN KEY
constraint definition or with a NOT NULL constraint definition. However,
if the computed column value is defined by a deterministic expression and
the data type of the result is allowed in index columns, a computed column
can be used as a key column in an index or as part of any PRIMARY KEY
or UNIQUE constraint.
For example, if the table has integer columns A and B, the computed
column C = A + B may be indexed, but computed column D = A
+ DATEPART(dd, GETDATE()) cannot be indexed, because the value
might change in subsequent invocations.
Computed columns can be created either through the GUI of the SQL
Server Management Studio or with the AS predicate of SQL. For example,
414 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
the following SQL DDL statement creates a table PRODUCTS having the
primary key autonumber column x, a column Product for storing prod-
uct names, a QtyAvailable column for storing corresponding quantities in
stock, a UnitPrice one, and a virtual computed column InventoryValue that
calculates, for each row, the corresponding value of the product between
QtyAvailable and UnitPrice:
CREATE TABLE PRODUCTS (
x INT IDENTITY (1,1) PRIMARY KEY,
Product VARCHAR(64) NOT NULL,
QtyAvailable SMALLINT NOT NULL
CHECK (QtyAvailable BETWEEN 0 AND 100),
UnitPrice SMALLMONEY NOT NULL
CHECK (UnitPrice BETWEEN 0 AND 10000),
InventoryValue AS QtyAvailable * UnitPrice);
Table 4.9 shows how to implement in SQL Server your most frequently
needed data types. To conclude this subsection, Table 4.10 presents the
rest of the data types provided by the SQL Server.
SQL Server stores nulls in indexes and considers that there is only one
null value; consequently, it does not accept in keys columns that may
store nulls.
Relational
Schemas Implementation and Reverse Engineering 415
not be more than one coauthor in any of its positions), a similar to the
above solution would imply relaxing the domain constraint PosInList [
[2; 16], which is unacceptable (as then you were not able to distinguish
between plausible and not plausible PosInList values).
Fortunately, in such cases no workaround is needed: for example, it
doesn’t actually matter whether values entered in PosInList are correct or
not, as long as they are plausible. For example, even if, to begin with, a
user would not exactly know whether Prof. Dan Suciu is the second or the
third coauthor of the book Data on the Web. From Relations to Semistruc-
tured Data and XML, he/she might enter him as the second one and then
enter Prof. Peter Buneman as its third one, which can be corrected later, by
interchanging their list positions.
Generally, Access 2013 adheres to the pure SQL ANSI 1992 standard,
with some extensions (both standard, as the identity autonumber columns,
and nonstandard, as the very powerful Lookup feature for foreign keys or
the TRANSFORM SQL statement).
Access dbs are limited to 2 GB (but can be linked between them in any
number, so that from one db you can access a huge number of other dbs,
and not only Access ones) and are also accessible through ODBC, ADO,
and OLEDB. Besides its rdb engine, Access comes equipped with the high
level PL VBA, which embeds SQL, ADO, and DAO, with which you can
develop db applications. By also using the MS SharePoint 2013 server or
the Office 365 website as hosts, you can also develop web-based Access
2013 applications.
There are no free Access versions, but there is a freely download-
able Access 2013 Runtime rdb engine that you can distribute together
with your Access 2013 desktop applications to users that do not own full
Access versions.
Access main limitation is that it only runs on MS Windows and Mac
OS X OSs.
I believe that if you show people the problems and you show them the
solutions they will be moved to act.
—Bill Gates
Access object names may have at most 255 characters, are not case sensi-
tive, but you can freely use both upper and lower case (and Access stores
them exactly as you defined them).
Among other reserved characters, they do not accept ‘?’ and ‘#’.
However, you may use any reserved characters in their names (as well as
reserved words), provided you embed them in square brackets.
Text strings have to be embedded in double quotation marks, whereas
date and time constants within sharp (numeric: ‘#’) signs. You can use
double quotation marks within text strings by doubling them; for example,
the text string “““quote”””’s value is “quote”.
Relational
Schemas Implementation and Reverse Engineering 419
As Access manages only one rdb at a time110, its SQL does not include a
CREATE DATABASE statement. The simplest way to create an Access
rdb is to use its GUI for opening a new blank db, selecting the desired path,
giving it the desired name, and saving it.
Computed tables (views) are called queries, which cannot be material-
ized (only their definition is stored): they are evaluated each time when
invoked. They can be created in SQL only through VBA and ADOX with
the CREATE VIEW … AS SELECT … statement; they can also be cre-
ated and stored through the Access’ GUI QBE-type Design View too.
You can create temporary tables too (which exists only during the
work session in which they were created; at the end of these sessions,
they are automatically deleted by the system) with CREATE TEMPO-
RARY TABLE …
before and 4 after the decimal point. By default, USD is the used currency,
but it can be changed to any other world one (either through Access option
settings or/and through the host OS’ ones).
The DATETIME (DATE, TIME, TIMESTAMP) data type is a subset
of DOUBLE and can store any second since January 1st, 100 and up to
December 31st, 9999. You can obtain the system date and time from the
function Now(), only the date from the function Date() (only the time from
Time()), and the year from a date with the Year() function.
There are two types of strings (both of them stored using corresponding
actual (variable) lengths and UTF-8, for both UNICODE and ASCII):
Ø TEXT (VARCHAR, CHAR, STRING, ALPHANUMERIC) = UNI-
CODE(255);
Ø LONGTEXT (LONGCHAR, MEMO, NOTE) = UNICODE(4096).
You can search within LONGTEXT values, but you cannot group or
order by them.
The Internet is becoming the town square for the global village of tomorrow.
—Bill Gates
There are other two data types stored in binary format as well, namely:
Ø BINARY (VARBINARY); Length(BINARY) [ [0; 510 B];
Ø LONGBINARY (OLEOBJECT, GENERAL); Length(LONGBINARY)
[ [0; 1 GB].
LONGBINARY is used only for storing OLE objects (embedded pic-
tures, audio, video, or other Binary Large OBjects (BLOBs) not over 1 GB
each).
422 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
You can also add computed columns to tables, but only through Access’
GUI. Moreover, you can also add a Total row for any table and specify for
each of its columns with what aggregation function would you like Access
to compute the corresponding total (the only available function for text-
type ones being COUNT).
Table 4.11 shows how to implement in Access your most frequently needed
data types. To conclude this subsection, Table 4.12 presents the rest of the
data types provided by Access.
The ANSI SQL standard wildcards ‘_’ (any character) and ‘%’ (any char-
acter string, including the empty one) are different in Access: ‘?’ and ‘*’,
respectively.
424 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The subsections of this section are presenting how to create and popu-
late rdbs corresponding to the public library relations from Fig. 3.13 (Sub-
section 3.4.2) in IBM DB2 10, Oracle Database 12c and MySQL 5.7, MS
SQL Server 2014 and Access 2013, by applying algorithm A8 and taking
into consideration the peculiarities of these RDBMS versions described in
Section 4.2.
Due to differences between these RDBMS versions, table and column
names slightly differ: for example, because Oracle does not accept ‘-’ in
object names, table CO-AUTHORS has been renamed as CO_AUTHORS
and because Oracle stores all such names only using uppercase, column
DueReturnDate has been renamed as DUE_RETURN_DATE.
Year(CURRENT DATE)),
PRIMARY KEY (x),
CONSTRAINT fkPub FOREIGN KEY (Publisher)
REFERENCES PUBLISHERS,
CONSTRAINT fkBk FOREIGN KEY (First_Book)
REFERENCES BOOKS,
CONSTRAINT EKey UNIQUE (First_Book, Publisher,
EYear));
Note that, as no desired tablespace is asked, the default user one (gener-
ally called USERS) is used by Oracle to create tables and needed indexes
into. As 12c is, for the time being, not so widely spread, the following
code does not use its newest autonumber feature, but the “classic” trigger
method needed for all of its previous versions.
434 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
begin
return case when to_date(day, ‘MM/DD/YYYY’) >
sysdate then ‘N’ else ‘Y’ end;
end Day_in_future;
CREATE FUNCTION Day_in_far_future (day varchar2)
RETURN VARCHAR2
DETERMINISTIC IS
begin
return case when to_date(day, ‘MM/DD/YYYY’) >
sysdate + 300
then ‘N’ else ‘Y’
end;
end Day_in_far_future;
begin
if (:new.X is null) then
select CO_AUTHORS_SEQ.nextval into :new.X
from dual;
end if;
end;
DELIMITER |
CREATE FUNCTION checkBYear(y INTEGER, message
VARCHAR(256))
RETURNS INTEGER DETERMINISTIC
BEGIN
IF NOT y IS NULL AND (y < −5000 OR
y > EXTRACT(Year FROM
CurDate())) THEN
SIGNAL SQLSTATE ‘ERR0R’
SET MESSAGE_TEXT := message;
END IF;
RETURN y;
END;
|
DELIMITER $$
CREATE TRIGGER CHCKI_BYear BEFORE INSERT ON BOOKS
FOR EACH ROW
BEGIN
SET @dummy := checkBYear(NEW.Byear,
CONCAT(NEW.Byear, ‘ is not a year between −5000
and ‘, EXTRACT(Year FROM
CurDate()),’!’));
END;
$$
DELIMITER $$
CREATE TRIGGER CHCKU_BYear BEFORE UPDATE ON BOOKS
FOR EACH ROW
BEGIN
446 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
DELIMITER |
CREATE FUNCTION checkPosInList(p INTEGER, message
VARCHAR(256))
RETURNS INTEGER DETERMINISTIC
BEGIN
IF p < 2 OR p > 16 THEN
SIGNAL SQLSTATE ‘ERROR’;
SET MESSAGE_TEXT := message;
END IF;
RETURN p;
END;
|
DELIMITER $$
CREATE TRIGGER CHCKI_PosInList BEFORE INSERT ON
CO_AUTHORS
FOR EACH ROW
BEGIN
Relational
Schemas Implementation and Reverse Engineering 447
DELIMITER $$
CREATE TRIGGER CHCKU_PosInList BEFORE UPDATE ON
CO_AUTHORS
FOR EACH ROW
BEGIN
SET @dummy := checkPosInList(NEW.PosInList,
CONCAT(NEW.PosInList, ‘ is not a position
between 2 and 16!’));
END;
$$
DELIMITER $$
CREATE TRIGGER CHCKU_EYear BEFORE UPDATE ON
EDITIONS FOR EACH ROW
BEGIN
SET @dummy:= checkBYear(NEW.Eyear, CONCAT(NEW.
Eyear, ‘ is not a year between −2500 and ‘,
EXTRACT(Year FROM CurDate()),’!’));
END;
$$
VTitle VARCHAR(255),
ISBN VARCHAR(16) UNIQUE,
Price DECIMAL(8,2) NOT NULL,
CONSTRAINT fkE FOREIGN KEY (Edition)
REFERENCES EDITIONS(x),
UNIQUE INDEX (Edition, VNo),
UNIQUE INDEX (Edition, VTitle));
DELIMITER |
CREATE FUNCTION checkVNo_Price(n INTEGER, p
DECIMAL, messageN VARCHAR(256),
messageP VARCHAR(256))
RETURNS INTEGER DETERMINISTIC
BEGIN
IF n < 1 THEN
SET @r := 1;
SIGNAL SQLSTATE ‘ERROR’;
SET MESSAGE_TEXT := messageN;
ELSEIF p < 0 OR p > 100000 THEN
SET @r := 2;
SIGNAL SQLSTATE ‘ERROR’;
SET MESSAGE_TEXT := messageP;
ELSE
SET @r := 0;
END IF;
RETURN @r;
END;
|
DELIMITER $$
CREATE TRIGGER CHCKI_VNo_Price BEFORE INSERT ON
VOLUMES FOR EACH ROW
BEGIN
SET @dummy := checkVNo_Price(NEW.VNo, NEW.
Price, CONCAT(NEW.VNo, ‘ is not a volume number
450 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
DELIMITER $$
CREATE TRIGGER CHCKU_VNo_Price BEFORE UPDATE ON
VOLUMES FOR EACH ROW
BEGIN
SET @dummy := checkVNo_Price(NEW.VNo, NEW.
Price, CONCAT(NEW.VNo, ‘ is not a volume number
greater than 0!’),
CONCAT(NEW.Price, ‘ is not a price between
0 and 100,000!’));
END;
$$
DELIMITER $$
CREATE TRIGGER CHCKI_BookPos BEFORE INSERT ON
VOLUMES_CONTENTS
FOR EACH ROW
BEGIN
SET @dummy := checkBookPos(NEW.BookPos,
CONCAT(NEW.BookPos, ‘ is not a position
between 1 and 16!’));
END;
$$
DELIMITER $$
CREATE TRIGGER CHCKU_BookPos BEFORE UPDATE ON
VOLUMES_CONTENTS
FOR EACH ROW
BEGIN
SET @dummy: = checkBookPos(NEW.BookPos,
CONCAT(NEW.BookPos, ‘ is not a position
between 1 and 16!’));
END;
$$
BEGIN
SET @dummy := checkBorrowDate(NEW.BorrowDate,
CONCAT(NEW.BorrowDate, ‘ is not a date
between 2000–01–06 and ‘, CurDate(),’!’));
END;
$$
SET @r := 0;
END IF;
RETURN @r;
END;
|
DELIMITER $$
CREATE TRIGGER CHCKI_Due_ActualReturnDates BEFORE
INSERT ON BORROWS_LISTS FOR EACH ROW
BEGIN
SET @dummy := checkDue_ActualReturnDates(NEW.
DueReturnDate,
NEW.ActualReturnDate,
CONCAT(NEW.DueReturnDate,
‘ is not a date between 2000–01–06 and ‘,
DATE_ADD(CurDate(),
INTERVAL 300 DAY), ‘!’),
CONCAT(NEW.ActualReturnDate,
‘is not a date between 2000–01–06 and
‘, CurDate(),’!’));
END;
$$
DELIMITER $$
FName VARCHAR(128),
LName VARCHAR(64) NOT NULL,
[e-mail] VARCHAR(255),
[Author?] BIT,
CONSTRAINT PKey UNIQUE ([e-mail], LName,
FName));
Especially as Access is not translating in SQL for its users the schemas of
the tables it manages, in order to create tables through SQL you need to
save any CREATE TABLE statement in a separate query and to run then
each of them.
For example, in order to create table PERSONS, you should click on
the Query Design icon of the Create ribbon menu, then click on the Close
button of the Show Table pop-up window (without selecting any table or
view), then click on the SQL View icon of the File ribbon menu, type the
corresponding DDL SQL statement in the newly open Query1 tab (see Fig.
4.2), then close it and save it (changing its default Query1 name in, for
example, createBooks), and, finally, executing this DDL query by double-
clicking on its name.
Here is the DDL SQL needed for creating all the tables of the
LibraryDB db:
The easiest way to add the check (tuple) and domain constraints is
through the Access GUI, by using the Validation Rule and Validation
Text properties of the corresponding table and columns, respectively (in
the corresponding tables’ Design View). Figures 4.3–4.11 show how to
enforce all the LibraryDB domain constraints.
Here is the DML SQL needed for populating the tables (that should be
stored one by one in distinct queries, just like for the DDL ones (see Fig.
4.2), as Access is only manipulating single statement SQL scripts):
There are very few exceptions among RDBMSs that do not provide facili-
ties for generating SQL DDL scripts to create the tables of the dbs they
are managing; the only one among the five ones considered in this book
is MS Access.
This is why we chose Access 2013 as an example for designing
algorithms of the reverse engineering family REAF0′. Obviously, as,
for example, all other four RDBMS versions considered in this book
have DDL GUI interfaces too, the algorithm presented in this subsec-
tion (let’s call it REA2013A0, although it also works for many previ-
ous Access versions) may be an example for designing own similar
algorithms for them too, although it is trivially simpler to first use their
facilities for generating DDL scripts in their own SQL idioms and then
apply much simpler algorithms to translate these scripts into ANSI
standard ones.
Not only because it is a reverse engineering algorithm, but especially
in order to first gain experience, we are starting with how to actually do it,
before summarizing what has generally to be done in the corresponding
algorithm.
In order to simplify things as much as possible, the figures that fol-
low are captured from the Access 2013 LibraryDB rdb (see Subsection
4.3.5).
474 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
For opening an Access db, just double-click on its name; note that if
you only have a Runtime Access version, you can only manipulate Access
created db applications and data (if any): you need a full Access license in
order to also modify db schemas and applications of such dbs.
When double-clicking the LibraryDB.accdb file on a computer hav-
ing a full Access 2013 license installed, the window shown in Fig. 4.12
is opened. In the db Navigation left pane, under the All Tables group, you
see all the tables belonging to the corresponding db. In order to open the
scheme of anyone of them, right-click on its name and then click on the
Design View option; then, in order to get more screen space, you can mini-
mize the Navigation left pane by clicking on its upper rightmost button
(the “«” one; to reopen anytime this pane, just click on the same button,
which, when this pane is minimized, displays “»”).
Figure 4.13 shows the scheme of an Access table: its name (in this case
BORROWS_LISTS) is displayed in the newly opened tab header; its col-
umn names are displayed in the Field Name column; their corresponding
data types are shown in the middle Data Type column; the third Descrip-
tion (optional) column is intended to store the semantics associated to
each column.
The columns that are making up the primary key of the table (if any)
have to their left a yellow small key icon: in Fig. 4.13, this is displayed
only to the left of the x column.
You can see the data subtypes and the domain constraints associated
to each column in the General left subpane under the scheme table; for
example, from Fig. 4.13, you can see that the Autonumber data type is
by default a subtype of the LONG one and, if you open the corresponding
Field Size combo-box, that the only other alternative for it is the Repli-
cation ID one (see Fig. 4.14).
If you open the New Values combo-box, you can see that Access gener-
ates autonumbers either incrementally (which is the default) or randomly
(see Fig. 4.15). From the Access documentation, you can see that both the
starting and the increment values for its incremental autonumbers are 1.
Finally for single primary keys, just like for any other single key, you
can see in Fig. 4.13 that the Indexed property is set to Yes (No Duplicates),
which means that a unique index is built on this column and enforces its
one-to-oneness.
Relational
Schemas Implementation and Reverse Engineering 475
Up to this point, after consulting Tables 4.1 and 4.9 above (where we
can see that LONG is NAT(9) in Access), we may infer that algorithm
REA2013A0 should output for this table the following DDL statement:
If the cursor is then moved on the second (Borrow) line of the table
from Fig. 4.3, we can see (Fig. 4.16) that this column is a LONG that does
not accept null values (because the Required property is set to Yes). Conse-
quently, the output of REA2013A0 should now be the following:
When moving the cursor on the third line (Copy, see Fig. 4.17), we can
see that this column is similar to the previous one, so that the output of
REA2013A0 should now be the following:
FIGURE 4.14 The two types of Access autonumbers (LONG and REPLICATION).
FIGURE 4.15. The two types of Access autonumbering (Increment and Random).
478 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
111
Unfortunately, there is no standard at least for the most frequently needed functions in dbs, although
this would be of great help for everybody.
480 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
when clicking on the Indexes icon of the ribbon (see Fig. 4.20). Note that
all indexes are defined in this window, but only those that have the prop-
erty Unique set to Yes are enforcing keys (the other ones are of no interest
in this first volume: they will be discussed in the fourth chapter of the
second one).
In this example, there are only two indexes, named blKey and Prima-
ryKey (according to the Index Name column from Fig. 4.20), both enforc-
ing keys, the first one being made out of the columns Borrow and Copy
(in this order), and the second one, which corresponds to the primary key
of the table, only of the column x (according to the column Field Name).
Consequently, the output of REA2013A0 should now be the following:
As inspection of all table keys is done, you should close the Indexes
windows. Next step is to discover the referential integrity constraints
associated with this table: the simplest way to start with is to browse the
instance of the Access metacatalog table MSysRelationships, which stores
all the referential integrity constraints enforced in the db.
By default, all metacatalog tables are hidden; in order to be able to see
in the Navigation pane the system tables too, right-click the pane (e.g., to
the right of its All Tables header) and then click on Navigation Options.
Figure 4.21 shows the corresponding popup window that opens and in
which you should check the Show System Objects check-box (bottom left).
When you then click on its OK button (bottom right), this window closes
482 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
and you can see in the Navigation pane the system tables too (with their
icons and names, prefixed by MSys, slightly dimmed, see Fig. 4.22).
When you double-click on the MSysRelationships table icon, its
instance opens (see Fig. 4.23); right-click on the name of the szObject col-
umn and then click on the Sort A to Z option; for better viewing the table
instance, double-click the dividing lines between columns to size them
according to their content.
Column szRelationship stores the names of the corresponding referen-
tial integrity constraints; column grbit stores their subtypes (4352 for ON
UPDATE CASCADE and ON DELETE CASCADE; 4096 for ON UPDATE
RESTRICT and ON DELETE CASCADE; 256 for ON UPDATE CAS-
CADE and ON DELETE RESTRICT; and 0 for the default ON UPDATE
RESTRICT and ON DELETE RESTRICT).
Note that whenever the referenced column is an autonumber one112,
cascading updates does not make sense –as values of automatically gen-
erated numbers cannot ever be changed–, so that Access ignores in such
cases the ON UPDATE CASCADE setting); column szColumn stores the
names of the foreign keys; column szObject stores the corresponding table
names; column szReferencedColumn stores the columns that are refer-
enced by the foreign keys; column szReferencedObject stores the corre-
sponding table names.
For example, from the third and the fourth lines of the table instance
from Fig. 4.23 we can find out that table BORROWS_LISTS has two such
referential integrities (both of them of the default subtype, as grbit is 0,
named COPIESBORROWS_LISTS and BORROWSBORROWS_LISTS,
respectively), the first one associated to its foreign key Copy (which refer-
ences column x of table COPIES) and the second one to its foreign key
Borrow (which references column x of table BORROWS).
Consequently, the output of REA2013A0 should now be the following:
FIGURE 4.22 Access Navigation Pane also showing its system tables.
Borrow INTEGER(9) NOT NULL,
Copy INTEGER(9) NOT NULL,
DueReturnDate DATE NOT NULL CHECK (DueReturnDate
BETWEEN
Relational
Schemas Implementation and Reverse Engineering 485
113
But it is not, as this would prevent storing data that a book copy has actually been returned, when-
ever its actual return date were greater than the corresponding due one.
486 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
FIGURE 4.24 Access Table Properties Property Sheet window (with no tuple
constraint).
FIGURE 4.25 Access Table Properties Property Sheet window (with a tuple constraint).
Relational
Schemas Implementation and Reverse Engineering 487
FIGURE 4.26 Algorithm REA2013A0 (Translation of (all current versions of) Access
db schemas into corresponding SQL ANSI DDL scripts).
488 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
As already seen in Section 1.7, the algorithms of the AF1–8 family are
“shortcut” ones composed out of the following (family of) algorithms (in
the order of their application): A1–7, A7/8–3, A8, and AF8′ (i.e., any of
its members g viewed as a mapping can be obtained as g = f °A8 °A7/8–3
°A1–7, where f is a member of AF8’, which means that, for any E-RDM
data model m, the corresponding rdb g(m) is computable as g(m) =
f(A8(A7/8–3(A1–7(m))))).
Of course that the challenge and beauty of any shortcut is to cut as
short as advisable (i.e., without jeopardizing the journey), in order to gain
time. To begin with in this case, as we have already seen in the previous
chapter, it is not at all advisable to skip A7/8–3. Fortunately, as A7/8–3
can also be applied directly on rdb tables, we can apply it after applying
AF1–8, so one step is already gained (and, actually, g = f °A8 °A1–7, g(m)
= f(A8(A1–7(m)))).
What can next be also done to gain more time is to design AF1–8 as
an extended A1–7 (as output) from RDM directly to RDBMS versions,
by skipping completely both the RDM and the ANSI standard SQL DDL
scripts. The price to be paid is only losing the portability of your imple-
mentations, but, in general, this is very rarely needed today, so it is gener-
ally worthwhile to do it.
Relational
Schemas Implementation and Reverse Engineering 489
method createSQLSTable(R)
if there is no table R in rdb M and R has attributes or is referencing or it is
not referencing, but it is not static and with small cardinality then
add to S.sql line “CREATE TABLE [” & R & “] (”;
if R is a subset of S then
AddSQLSForeignKey(R, x, S, “PK”);
loop for all other sets T with R # T
AddSQLSForeignKey(R, xT, T, “NNU”);
end loop;
else
N = chooseSQLSnumericDT(maxcard(R));
add to S.sql line “[” & x & “]” & N & “ IDENTITY PRIMARY KEY”;
end if; completeSQLScheme(R); add to S.sql line “);”;
end if;
end method createSQLSTable;
FIGURE 4.28 Algorithm A2014SQLS1–8’s createSQLSTable method.
method addSQLSForeignKey(R, A, S, o)
if S is a table then
N = chooseSQLSnumericDT(maxcard(S), 0);
s = s & “, [” & A & “] ” & N;
else
l = 0;
loop for all e [ S
if length(e) > l then l = length(e);
end loop;
s = s & “[” & A & “] VARCHAR(” & l & “) CHECK ([” & A & “] IN (”;
first = true;
loop for all e [ S
if first then first = false; else s = s & “, ”; end if; s = s & “‘“ & e & “‘“;
end loop;
end if;
if o [ {“NN”, “NNU”} then s = s & “ NOT NULL ”; if o = “NNU” then s =
s & “ UNIQUE ”;
elseif o = “PK” then s = s & “ PRIMARY KEY ”; end if;
s = s & “, CONSTRAINT [” & R & “-” & A & “-FK] FOREIGN KEY
([” & A & “]) REFERENCES [” & S & “]”;
add to S.sql line s;
end method addForeignKey;
FIGURE 4.29 Algorithm A2014SQLS1–8’s createSQLSForeignKey method.
492 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
method completeSQLScheme(R)
loop for all arrows A leaving R and heading to object set S, except for set
inclusion-type ones
if A is compulsory then addForeignSQLSKey(R, A, S, “NN”);
else addForeignSQLSKey(R, A, S,); end if;
end loop (for all arrows);
loop for all ellipses e connected to R, except for the surrogate key x
if e is compulsory then addSQLSColumn(R, e, “NN”); else
addSQLSColumn(R, e,); end if;
end loop (for all ellipses);
loop for all uniqueness restrictions u associated to R
s = “, CONSTRAINT [” & u & “] UNIQUE (”; first = true;
loop for all mappings m making up u
if first then first = false; else s = s & “, ”; end if; s = s & “[” & m &
“]”;
end loop;
add to S.sql line s & “)”;
end loop (for all uniqueness restrictions);
i = 1;
loop for all tuple-type restrictions t associated to R
add to S.sql line “, CONSTRAINT CHK_” & R & i & “ CHECK (” &
t & “)”; i = i + 1;
end loop (for all other tuple-type restrictions);
end method completeSQLScheme;
method addSQLSColumn(R, e, o)
s = “, ” & e & “ ” & chooseSQLSDT(R, e);
if o [ {“NN”, “NNU”} then s = s & “ NOT NULL ”;
if o = “NNU” then s = s & “ UNIQUE ”;
if e is computed according to expression E then s = s & “ AS ” & E;
add to S.sql line s;
end method addSQLSColumn;
FIGURE 4.31 Algorithm A2014SQLS1–8’s addSQLSColumn method.
Relational
Schemas Implementation and Reverse Engineering 493
method chooseSQLSnumericDT(n, m)
if m = 0 then
select case n
case < 3: return “ TINYINT”;
case [ {3, 4}: return “ SMALLINT”;
case [ [5; 9]: return “ INT”;
case [ [10; 18]: return “ BIGINT”;
case [ [19; 38]: return “ DECIMAL(” & n & “, 0) ”;
case [ [39; 308]: return “ FLOAT(53)”;
else return “ INTEGER OVERFLOW”;
end select;
else
select case n
case < 39: return “ DECIMAL(” & n & “, “ & m & “)”;
case [ [39; 308]: return “FLOAT(53)”;
else return “RATIONAL OVERFLOW”;
end select;
end if;
end method chooseSQLSnumericDT;
method chooseSQLStextDT(t, p)
if t = ASCII
if p ≤ 8000 then return “ VARCHAR(” & p & “)”;
elseif p [ [8001; 231 – 1] then return “ VARCHAR(MAX) ”;
else return “ASCII OVERFLOW”; end if;
else
if p ≤ 4000 then return “ NVARCHAR(” & p & “)”;
elseif p [ [4001; 23° – 1] then return “ VARCHAR(MAX) ”;
else return “UNICODE OVERFLOW”; end if;
end if;
end method chooseSQLSDT;
order of their application): REAF0’, REA0, REA1, and REA2 (or REAF0–1
and REA2, as REAF0–1 is the composition of REAF0’, REA0, and REA1).
In other words, any of its members g viewed as a mapping can be
obtained as g = f ° REA0 ° REA1 ° REA2, where f is a member of REAF0’,
which means that, for any rdbs r, the corresponding E-R data model g(r)
is computable as g(r) = f(REA0(REA1(REA2(r))))) (or, equivalently, g =
REAF0–1 ° REA2, g(r) = REAF0–1(REA2(r))).
You should first of all please note that, as (E)MDM is not tackled in
this first volume, neither are REA1, REA2, nor REAF0–1.
Relational
Schemas Implementation and Reverse Engineering 495
FIGURE 4.35 Algorithm REA2013A0–2 (Translation of (all current versions of) Access
db schemas into corresponding E-R data models).
496 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
method primaryKey(C, t, D)
underline C;
if t = ‘e’ then
draw the one-to-oneness arrow on the edge from ellipse C to its
rectangle;
write in RS the maximum cardinality restriction for T, labeled Ti,
equal to D’s cardinal;
i = i + 1;
else
make the arrow labeled C a double one;
end if;
end primaryKey;
FIGURE 4.36 Algorithm REA2013A0–2′s method primaryKey.
Relational
Schemas Implementation and Reverse Engineering 497
method addSemanticsAndClose
investigate the resulted E-RD and RS, and, based on your gained
insights, document in RS the semantics of all object sets, properties,
and restrictions, and write in D a first clear, concise, and as much as
possible exhaustive description of rdb S;
save and close .txt files E-RD, RS, and D, which make up the E-R data
model M;
close Access;
end addSemanticsAndClose;
FIGURE 4.37 Algorithm REA2013A0–2′s method addSemanticsAndClose.
4.7.1 THE STOCKS DB
In Fig. 4.38 we can see that this db is made out of the following 10 tables:
Colors, Customers, Entries, EntryItems, EntryTypes, OrderItems, Orders,
Products, Stocks, and Warehouses (as Switchboard Items is the standard
Access table for storing the application menu, which is a table-driven one,
and the 6 dimmed tables –whose names start with MSys– are making up
the standard Access metadata catalog).
Figure 4.39 shows the metadata stored by Access on the referential integ-
rity constraints of this db in its table MSysRelationships. We can see that
there are the following 12 foreign keys (as the other two ones belong to and
are referencing metacatalog tables: their names are prefixed by “MSys”):
Ø fk1068REF548: Entries.EntryType references EntryTypes.#EntryType
Ø fk1065REF545: Entries.Customer references Customers.#Customer
table does not have other keys (as its third index, CustomerIdx, is not a
unique one).
Figures 4.48–4.51 present the five columns of table EntryItems: #Entry-
Item (surrogate autonumber primary key), Entry (long integer, no nulls,
indexed), Product (long integer, no nulls, indexed), Qty (double rational
strictly positive and at most 1000, scale 2, not null), and Warehouse (long
integer, no nulls). As seen in Fig. 4.52, this table has the key EntryItem-
Key, made up of Entry and Product (as its other two indexes, EIProductIdx
and EIEntryIdx, are not unique).
Figure 4.53 presents the two columns of table EntryTypes: #EntryType
(surrogate autonumber primary key) and EntryType (string of at most 64
characters, no nulls, one-to-one).
Figures 4.54–4.57 present the five columns of table OrderItems:
#OrderItem (surrogate autonumber primary key), Order (long integer,
no nulls, indexed), Product (long integer, no nulls, indexed), Qty (double
rational strictly positive and at most 1000, scale 2, not null), and Ware-
house (long integer, no nulls). As seen in Fig. 4.58, this table has the key
OrderItemKey, made up of Order and Product (as its other two indexes,
OIProductIdx and OIOrderIdx, are not unique).
nulls), and InitialStockDate (date between Jan. 3rd, 2000 and the current
system date and time, no nulls). As seen in Fig. 4.73, this table also has the
key StocksKey made of Warehouse and Product (as its other two indexes,
#Product and #Warehouse, are not unique).
Figure 4.74 presents the two columns of table Warehouses: #Ware-
house (surrogate autonumber primary key) and Warehouse (string of at
most 64 characters, no nulls, one-to-one).
514 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The illiterate of the twenty-first century will not be those who cannot
read and write, but those who cannot learn, unlearn, and relearn.
—Alvin Toffler
Figure 4.75 shows the result of applying the reverse engineering algorithm
REA2013A0 to the above rdb. Note that for all foreign keys the columns
they reference are omitted, as all such columns are primary keys. Also
note that, due to their corresponding domain (check) constraints, the preci-
sion of the Qty columns has been downsized to only 6.
Note that the range for Stock and InitialStock was computed as follows:
according to Table 4.11, the DOUBLE Access data type may store decimal
numbers of up to 308 digits; as these two columns need 2 of them for the
scale, the maximum possible mantissa is 306; moreover, given that they
both accept only positive values, RAT+ is their base set (instead of RAT,
the set of all rationals).
FIGURE 4.75 The Stocks db ANSI standard SQL DDL creation script.
Relational
Schemas Implementation and Reverse Engineering 521
4.8.1 DATABASES
4.8.2 TABLES
For example, as in a CITIES table (see, for example, Fig. 3.5) there are
much more distinct ZipCode values (rare duplicates being possible only
between countries) than *Country name ones (as, generally, there are very
many cities in a country), the corresponding key should be defined in the
order <ZipCode, *Country> and not vice versa.
4.8.3 COLUMNS
4.8.4 CONSTRAINTS
(e.g., for PEOPLE or CREATURES Sex you only need one character and
only a few letters out of the ASCII alphabet); for numbers and dates, only
a subset of a corresponding data type (e.g. [0; 1,000.00] for Stock or Price,
[1/1/2010, SysDate()] for PaymentDate and OrderDate, etc.).
R-I-P-2.(Always enforce all not null constraints, at least one per table
additional to the corresponding surrogate primary key one)
You should always enforce for any column that should always be filled
with data its associated not null constraint: any table should have at least
one such semantic column (i.e. not only the syntactic surrogate key one).
For example, there may be no countries, cities, companies, books,
songs, movies, etc. without names (titles), no order, invoice, bill, delivery
lines without product, price, and quantity, etc.
There is no math behind this chapter, except for the compulsory propo-
sitions characterizing the algorithms that it introduces. As they are very
simple and similar, for example, to Propositions 3.1 and 4.1 (see Exercise
4.1), they are all left to the reader.
Thus, Exercise 4.1 characterizes A8, while Exercise 4.20 invites read-
ers to characterize all of the other algorithms presented in this chapter, as
well as all of the extensions to them and all of the similar ones designed
by the readers as proposed by the other exercises from subsection 4.11.
4.10 CONCLUSION
4.11 EXERCISES
It is exercise alone that supports the spirits, and keeps the mind in vigor.
—Cicero
536 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
4.2.
a. Apply algorithm REA2013A0–2 to the MS Access Northwind
demo db.
b. Apply algorithm A7/8–3 to the table Purchase Order Details of the
MS Access Northwind demo db and add all discovered semantic
keys to its scheme.
Solution:
a. If you did not already downloaded the MS Access Northwind db,
first do the following:
1. Open MS Access 2013.
2. In its top screen Search for Online Templates text-box type
Northwind and then press Enter: the window shown in Fig.
4.89 will open.
3. Click on its Northwind icon: the window shown in Fig. 4.90
will open.
4. Choose the desired name for the Northwind db and the folder of
your PC where you want it downloaded, and then click on the Cre-
ate tile: the window shown in Fig. 4.91 will open.
5. Click on the Enable Content button from its SECURITY WARN-
ING row: the window shown in Fig. 4.92 will open.
6. Either leave the default demo employee “Andrew Cencini” or
select any other one and then click on the Login button: the win-
dow shown in Fig. 4.93 will open.
7. Close the Home tab, open the Navigation Pane, select its All
Access Objects container, and open its Tables subset one: the win-
dow shown in Fig. 4.94 will open.
As it can be seen from Fig. 4.94, the Northwind db has the follow-
ing 20 tables: Customers, Employee Privileges, Employees, Inventory
Transaction Types, Inventory Transactions, Invoices, Order Details,
Order Details Status, Orders, Orders Status, Orders Tax Status, Priv-
ileges, Products, Purchase Order Details, Purchase Order Status,
Purchase Orders, Sales Reports, Shippers, Strings, and Suppliers.
8. Enable showing of the Access system metacatalog tables too (see
subsection 4.4, Figs. 4.21 and 4.22): the window shown in Fig.
4.94 will then look like in Fig. 4.95.
9. Double-click on the MSysRelationships table, close the Navigation
Pane, resize MSysRelationships columns so that all data is fully
Relational
Schemas Implementation and Reverse Engineering 545
FIGURE 4.95 Access 2013 Northwind demo db tables, including the system ones.
Orders both Approved By and Submitted By are, in fact, foreign keys ref-
erencing Employees.
This is why all these 4 columns are represented in the corresponding
E-RDs above both as ellipses (as they are) and as arrows (as they should
actually be).
The corresponding associated restrictions set is the following:
1. Customers (The set of customers of interest)
a. Cardinality:max(card(Customers)) = 109 (ID) (C0)
b. Data ranges:
Company, Last Name, First Name, E-mail Address,
Job Title, City, State/Province,
Country/Region: UNICODE(50) (C1)
Business Phone, Home Phone, Mobile Phone,
Fax Number: UNICODE(25) (C2)
Address, Notes: UNICODE(4096) (C3)
ZIP/Postal Code: UNICODE(15) (C4)
Web Page: HYPERLINK (C5)
Attachments: ATTACHMENT (C6)
c. Compulsory data: ID (C7)
d. Uniqueness: ID (trivial: there may not be two
customers having same autogenerated id numbers) (C8)
2. Employees (The set of employees of interest)
a. Cardinality:max(card(Employees)) = 109 (ID) (E0)
b. Data ranges:
Company, Last Name, First Name, E-mail Address,
Job Title, City, State/Province,
Country/Region: UNICODE(50) (E1)
Business Phone, Home Phone, Mobile Phone,
Fax Number: UNICODE(25) (E2)
Address, Notes: UNICODE(4096) (E3)
ZIP/Postal Code: UNICODE(15) (E4)
Web Page: HYPERLINK (E5)
Attachments: ATTACHMENT (E6)
c. Compulsory data: ID (E7)
d. Uniqueness: ID (trivial: there may not be two
employees having same autogenerated id numbers) (E8)
552 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
114
115
116
114
In fact, very probably, Purchase Order ID should be a foreign key referencing Purchase Orders,
that is an arrow in Fig. 4.116, not an ellipse.
115
In fact, very probably, Inventory ID should be a foreign key referencing Inventory Transactions,
that is an arrow in Fig. 4.116, not an ellipse.
116
Multivalued attribute, taking values in the powerset of Suppliers.
556 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
and comments –at most 255 characters–; compulsory are only the
type, product, and quantity);
Ø Purchase orders (supplier, status, employees that submitted,
created, and approved it, payment type –one of 3 predefined
ones–, submitted, creation, expected, payment, approved, and
(per product) received dates, shipping fee, taxes, payment
amount, and (per product) unit cost currencies, products and
corresponding inventory transactions, per product quantities –
rationals of at most 18 mantissa and 4 scale digits–, per product
notes –at most 4096 characters–, and whether received products
were posted to inventory or not; compulsory are only the id,
purchase order id, quantity, and unit cost);
Ø Customer orders (customer, status, tax status, and per product
detail status, employee in charge, shipper, payment type –one of
3 predefined ones–, order, shipped, paid, and (per product) allo-
cated dates, shipping fee, taxes, and (per product) unit cost cur-
rencies, products and corresponding inventory transactions and
purchase orders, per product quantities –rationals of at most 18
mantissa and 4 scale digits–, ship name, city, state/province, zip/
postal code, and country/region –at most 50 characters–, ship
address and notes –at most 4096 characters–, and per product
discount percentage; compulsory are only the order id, quantity,
and discount);
Ø Purchase and Customer Order and Details Status, including
tax ones, and inventory transaction types (compulsory name, at
most 50 characters);
Ø Employee Privilege Types (name, at most 50 characters);
Ø Employee Privileges (unique compulsory pairs of employees
and privilege types).
FIGURE 4.118 (Continued)
Figure 4.121 shows the Index window of this table after declaring the
keyPODInventoryID above key, while Fig. 4.122 shows it after declaring
the keyPODProductID too.
Relational
Schemas Implementation and Reverse Engineering 569
4.9. Consider the case of rdbs in which there is at least one cycle in
their corresponding E-RD graph in which no two arrows meet sharp point
to sharp point:
a. should such cycles be allowed by RDBMSs? Why?
b. if yes, design needed (if any) modifications to algorithm
REA2013A0 such that it also copes with such cases.
4.10. Apply algorithm A2014SQLS1–8 to:
a. the E-R data model of the public library case study from Subsec-
tions 2.8.1 and 2.8.2; compare its result with the DDL script from
Subsection 4.3.4
b. the E-R data metamodel of the E-RDM presented in Section 2.9,
in order to obtain a MS SQL Server DDL script for generating new
metacatalog tables for it (that might be used when designing and
developing an E-RDM GUI)
c. the RDM metamodel of itself presented in subsection 3.7.2
*d. compare the results of c) above with the SQL Server 2014′s cor-
responding metacatalog tables.
4.11.
a. Design algorithms similar to A2014SQLS1–8 for DB2 10.5, Oracle
12c, MySQL 5.7, Access 2013.
*b. Same thing as above for the latest versions of PostgreSQL and
MongoDB.
4.12.
a. Design algorithms similar to A2014SQLS1–8 for SQL Server 2014,
DB2 10.5, Oracle 12c, MySQL 5.7, Access 2013, but for corre-
sponding instructions on how to implement rdbs through their
GUIs (not through SQL DDL scripts).
*b. Same thing as above for the latest versions of PostgreSQL and
MongoDB.
4.13
a. Design extensions to both REA2013A0 and REA2013A0–2 so that
they also generate DML scripts to populate dbs with the source rdb
instances.
b. Apply these extensions to the public library case study db instance
from Figure 3.13 and compare their output with the DML script
from Subsection 4.3.5.
572 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
4.19.
a. Would you add to the Customers table of the Stocks db (see Section
4.7) two Boolean columns Customer? and Supplier? Why?
b. Make all necessary changes, both in the ANSI standard SQL DDL
script from Subsection 4.7.2 and in the E-R data model from Sub-
section 4.7.3 in order to add these two columns to the Stocks db.
4.20. Based on the examples provided by Proposition 3.1 and Exercise
4.1, characterize and prove the corresponding properties of the algorithms
REA2013A0, REA2013A0–2, and A2014SQLS1–8, as well as of all of the
other algorithms that you designed from the scratch or extended as asked
by the exercises 4.3.a and b, 4.4 to 4.8, 4.9.b, 4.11, 4.12, 4.13a, 4.14, 4.15,
4.16a, 4.17, and 4.18b.
4.21. Apply the algorithm A8 to the relational schemas from Figs. 3.12
and 3.20, Tables 3.7–3.13, as well as to all those obtained by solving Exer-
cises 3.9, 3.28.a, and 3.76–3.81.
4.22 Apply the algorithm A2014SQLS1–8 to the E-R data models from
subsections 2.9.1 and 2.9.2, 3.7.3.1 and 3.7.3.2, Figs. 3.28–3.30, as well as
to all those obtained by solving the exercises from Chapter 2.
574 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
c. IBM DB2 10.5 Express-C (hint: only the most frequently SYS-
CAT ones indicated by the following web page and its next (3)
one http://www.devx.com/dbzone/Article/29585/0/page/2 would
be enough for this exercise).
**4.40.
a. Reapply algorithm A0 to the descriptions from the E-R data models
obtained in Exercise 4.39, in order to correct these E-R data mod-
els (i.e., conceptually reverse engineer the corresponding views,
in order to discover the metacatalog tables from which they were
computed).
b. Apply corresponding algorithms of the family AF1–8 to the cor-
rected E-R data models obtained at a. above, in order to obtain SQL
DDL scripts for creating the corresponding metacatalog tables, run
them in the corresponding RDBMS versions, and then design for
them the views similar to those in Exercise 4.39 (your starting
point to this journey; hint: trivially, you have to do it in user dbs of
your own, not in the corresponding system ones!).
All of the best practice rules and algorithms presented in this section, as
well as their characterization propositions, application examples and exer-
cises come from (Mancas, 2001). All these algorithms are also included in
MatBase (see, e.g., Mancas and Mancas (2005)).
Algorithms similar to A8 are included in all RDBMSes that provide
translation of the rdbs they manage into SQL DDL scripts.
Algorithms similar to REA0 are included in all RDBMSes for creating
or modifying rdbs when executing SQL DDL scripts.
Different variants of algorithms from the AF1–8 family are embedded
in the data modeling tools provided by DB2, Oracle, SQL Server, as well
578 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
as by third party solutions (see, for example, the universal Toad one, now
owned by the Dell Corp.: http://www.quest.com/toad-data-modeler/).
There exist other third party tools as well for generating corresponding
SQL DDL scripts from Access dbs: for example, see the DBWScript from
DBWeigher.com (http://dbweigher.com/dbwscript.php).
For obtaining E-RD variants from rdbs there are also tools both RDBMS
embedded (like, for example, the MS Access Relationships and SQL Server
Database Diagrams, the Oracle SQL Developer Data Modeler, and the
IBM DB2 Data Studio or InfoSphere Data Architect) and third party (see,
for example, the CA ERwin Data Modeler, http://erwin.com/products/data-
modeler, the Visual Paradigm, http://www.visual-paradigm.com/features/
database-design/ or the SmartDraw, http://www.smartdraw.com/resources/
tutorials/entity-relationship-diagrams/). Unfortunately, since its 2013 ver-
sion, the MS Visio does not have this facility anymore.
Excellent DB2 10.5, Oracle 12.c, MySQL 5.7, SQL Server 2014, and
Access 2013 documentation is provided by IBM, Oracle, and Microsoft,
respectively (see references in the next subsection).
Even if only in the fourth chapter of the next volume will we tackle
rdb optimizations, some initial aspects of it were also introduced in this
chapter, both from own experience, RDBMS documentations, and from
Bradford (2011), Feuerstein (2008), MacDonald (2013), and Vaughn and
Blackburn (2006).
All URLs mentioned in this section were last accessed on Sep-
tember 30th, 2014.
KEYWORDS
•• (z)Linux
•• @@DBTS
•• @@IDENTITY
•• A1-7
•• A2014SQLS1-8
•• A7/8-3
•• A8
Relational
Schemas Implementation and Reverse Engineering 579
•• Access
•• addSemanticsAndClose
•• addSQLSColumn
•• ADO
•• ADOX
•• AF1-8
•• AF8’
•• AFTER UPDATE
•• AIX
•• ALL_OBJECTS
•• Allow Zero Length
•• ALTER SESSION
•• ALTER TABLE
•• ANCHOR
•• ANSI
•• ANYDATA
•• ANYDATASET
•• ANYTYPE
•• Array
•• ASCII
•• AS ROW BEGIN
•• AS ROW CHANGE
•• AS ROW END
•• associative array
•• ATTACHMENT
•• AUTO_INCREMENT
•• Azure
•• BasicFile LOB
•• BEFORE INSERT
•• BEFORE UPDATE
•• BFILE
•• BIGINT
•• BINARY
580 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• BINARY_DOUBLE
•• BINARY_FLOAT
•• BIT
•• BITEMPORAL
•• BLOB
•• BLU
•• BOOL
•• BOOLEAN
•• built-in data type
•• BUSINESS_TIME
•• BYTE
•• cardinality
•• CASCADE DELETE
•• CASCADE UPDATE
•• CAST
•• CCSID
•• CHAR
•• check constraint
•• chooseSQLSDT
•• chooseSQLSnumericDT
•• chooseSQLStextDT
•• CLOB
•• CODEUNITS16
•• CODEUNITS32
•• COMMENT
•• completeSQLScheme
•• compulsory data
•• computed column
•• CONCAT
•• COUNT
•• COUNTER
•• CREATE DATABASE
•• CREATE FUNCTION
Relational
Schemas Implementation and Reverse Engineering 581
•• CREATE SEQUENCE
•• createSQLSForeignKey
•• createSQLSTable
•• CREATE TABLE
•• CREATE TRIGGER
•• CREATE TYPE
•• CREATE USER
•• CurDate
•• CURRENCY
•• CURRENT_DATE
•• CURRENT_TIME
•• CURRENT_TIMESTAMP
•• CURRENT_TIMEZONE
•• CURRENT DATE
•• CURRENT LOCALE LC_TIME
•• CURRENT TIME
•• CURRENT TIMESTAMP
•• CURRENT TIMEZONE
•• CurTime
•• DANGLING
•• DAO
•• Data Architect
•• database
•• Database Partitioning Feature (DPF)
•• data event
•• DATALINK
•• Data Links Manager
•• data range
•• Data Studio
•• data type
•• DATE
•• DATE_ADD
•• DATE_SUB
582 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• DATETIME2
•• DATETIMEOFFSET
•• DB2
•• DB2SECURITYLABEL
•• DB2SQLSTATE
•• DBA_OBJECTS
•• DBCLOB
•• DBCS
•• DBMS_XMLSCHEMA
•• DBTimeZone
•• DBURIType
•• DBWScript
•• DDL
•• DECFLOAT
•• DECIMAL
•• DELETE RESTRICT
•• delimited identifier
•• DEREF
•• DESCRIBE
•• Design View
•• DETERMINED BY
•• DETERMINISTIC
•• Developer Data Modeler
•• directly usage
•• direct recursive type
•• DISTINCT
•• DML
•• domain constraint
•• DO statement
•• DOUBLE
•• DOUBLEPRECISION
•• DROP
•• dummy date
Relational
Schemas Implementation and Reverse Engineering 583
•• HP-UX
•• HTTPURIType
•• IBM
•• identifier
•• IDENTITY
•• indexed
•• index file
•• indirectly usage
•• indirect recursive type
•• infinity
•• information_schema
•• InfoSphere Data Replication
•• InfoSphere Warehouse
•• INSERT INTO
•• INT
•• INTEGER
•• INTERVAL DAY
•• INTERVAL YEAR
•• iSeries
•• key
•• LAST_INSERT_ID
•• LIMIT
•• Limit to List
•• Linux
•• LOAD DATA INFILE
•• LOB technology
•• Local_Timestamp
•• local temporary table
•• LOGICAL
•• LONG
•• LONGBLOB
•• LONG RAW
•• LONGTEXT
Relational
Schemas Implementation and Reverse Engineering 585
•• Lookup
•• Lookup Wizard
•• Management Studio
•• MatBase
•• materialized view
•• MAX_STRING_SIZE
•• MAXDB
•• MBCS
•• MEDIUMBLOB
•• MEDIUMINT
•• MEDIUMTEXT
•• Microsoft
•• minimal uniqueness
•• Mixed Based Replication (MBR)
•• MIXED DATA
•• MIXED DECP
•• MONEY
•• MongoDB
•• MSysRelationships
•• MULTILINESTRING
•• MULTIPLE BYTE
•• MULTIPOINT
•• MULTIPOLYGON
•• MyISAM
•• MySQL
•• NaN
•• NCHAR
•• NCLOB
•• NDBCLUSTER
•• NESTED TABLE
•• NEWID
•• Northwind
•• NOT NULL
586 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• QBE
•• QMF
•• Query Design
•• Query Management Facility
•• quiet NaN
•• QUOTED_IDENTIFIER
•• R-I-C-0
•• R-I-C-1
•• R-I-C-2
•• R-I-C-3
•• R-I-C-4
•• R-I-C-5
•• R-I-D-0
•• R-I-D-1
•• R-I-D-2
•• R-I-P-0
•• R-I-P-1
•• R-I-P-2
•• R-I-P-3
•• R-I-P-4
•• R-I-P-5
•• R-I-T-1
•• R-I-T-2
•• R-I-T-3
•• R-I-T-4
•• RAW
•• REA0
•• REA1
•• REA2
•• REA2013A0
•• REA2013A0-2
•• REAF0-1
•• REAF0-2
588 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• REAF0’
•• REAL
•• recursive type
•• REF
•• REFERENCE
•• REFERENCES
•• referential integrity
•• REGEXP
•• relational scheme
•• REPLACE
•• Required
•• RESET
•• reverse engineering
•• root table
•• root type
•• Row Based Replication (RBR)
•• ROW BEGIN
•• ROW CHANGE TIMESTAMP
•• ROW END
•• ROWGUIDCOL
•• ROWID
•• ROWVERSION
•• Runtime
•• Sakila
•• SBCS
•• SDO_GEOMETRY
•• SDO_GEORASTER
•• SDO_TOPO_GEOMETRY
•• SecureFile LOB
•• SELECT
•• Sequence
•• SERIAL
•• Server Pubs
Relational
Schemas Implementation and Reverse Engineering 589
•• SessionTimeZone
•• SET
•• SharePoint
•• SHORT
•• SHORT TEXT
•• SHOW
•• Show System Objects
•• SI_AverageColor
•• SI_Color
•• SI_ColorHistogram
•• SI_FeatureList
•• SI_PositionalColor
•• SI_StillImage
•• SI_Texture
•• signaling NaN
•• SINGLE
•• SMALLDATETIME
•• SMALLINT
•• SMALLMONEY
•• Smart Draw
•• Solaris
•• Spatial and Graph
•• Spatial Extender
•• SQL
•• SQL/MM StillImage
•• SQL_VARIANT
•• SQL Server
•• Statement Based Replication (SBR)
•• STDDT
•• STRAIGHT_JOIN
•• strongly typed data type
•• STRUCTURED
•• subnormal number
590 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• subtable
•• SUM
•• Supertable
•• supertype
•• SYSCAT
•• SysDate
•• SysDateTime
•• SysDateTimeOffset
•• SYSDUMMY
•• SYSFUN
•• SYSIBM
•• SYSPROC
•• SYSTEM_TIME
•• system views
•• SysTimestamp
•• SysUTCDateTime
•• szColumn
•• szObject
•• szReferencedColumn
•• szReferencedObject
•• szRelationship
•• table
•• tablespace
•• target type
•• temporary table
•• TEXT
•• Time
•• TIMESTAMP
•• TIMESTAMP WITH LOCAL TIME ZONE
•• TIMESTAMP WITH TIME ZONE
•• TINYBLOB
•• TINYINT
•• TINYTEXT
Relational
Schemas Implementation and Reverse Engineering 591
•• TOP
•• TRANSACTION START ID
•• TRANSFORM
•• TRANSLATE
•• Trigger
•• TRIM
•• tuple constraint
•• typed table
•• typed view
•• type hierarchy
•• UDT
•• UNICODE
•• UNION ALL
•• UNIQUEIDENTIFIER
•• unique index
•• uniqueness
•• Universal Naming Convention (UNC)
•• Unix
•• UNPIVOT
•• UNSIGNED
•• UPDATE OF
•• UPDATE RESTRICT
•• URIFactory
•• URIType
•• UROWID
•• USER_OBJECTS
•• user defined data type
•• UTF-16
•• UTF-8
•• VALIDATED
•• VALIDATING
•• Validation Rule
•• VARBINARY
592 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• VARCHAR
•• VARCHAR2
•• VARGRAPHIC
•• VARRAYS
•• VBA
•• virtual column
•• Visio
•• Visual Paradigm
•• VM/VSE
•• weakly typed data type
•• Windows
•• WTDDT
•• XML
•• XMLEXISTS
•• XMLQUERY
•• XMLType
•• XPath
•• XQuery
•• Year
•• Yes (No Duplicates)
•• YESNO
•• zOS
REFERENCES
CONCLUSION
The possession of knowledge does not kill the sense of wonder and mystery.
There is always more mystery.
—Anaïs Nin
CONTENTS
We cannot solve our problems with the same thinking we used when we
created them.
—Albert Einstein
The relational db scheme design problem is, essentially, two folded: what
table schemes should be included in a db scheme? what columns and con-
straints should contain every of its table schemes?
The following db design axioms provide criteria for comparing and
hierarchizing alternative design solutions:
A-DA0. Non-relationally rdb design axiom: rdb schemes should not
be designed relationally; instead, they should first be designed by using
E-RDs and associated restriction sets; then, conceptual design should be
refined by using a more powerful data model, able to heavily assist us in
discovering and designing all existing constraints, as well as in correct-
ing and refining all errors in E-RDs; finally, resulted conceptual schemes
should be (preferably, automatically) translated into corresponding rela-
tional ones, plus associated non-relational constraint sets. As an emer-
gency temporary shortcut, you could skip the intermediate data model
level, but never the E-RDM one.
A-DA01. Data plausibility axiom: any db instance should always store
only plausible data; implausible (“garbage”) data might be stored only
temporarily, during updating transactions (i.e., any time before start and
after end of such a transaction all data should be plausible).
A-DA02. No semantic overloading axiom: no fundamental object set
or property should be semantically overloaded (i.e., any such object set
or property should have only one simple and clear associated semantics).
A-DA03. Non-redundancy axiom: in any db, any fundamental data
should be stored only once, such that inserting new data, as well as updat-
ing or deleting existing one should be done in/from only one table row.
A-DA04. Unique objects axiom: just like, generally, for set elements,
object sets do not allow for duplicates (i.e., each object for which data is
stored in a db should always be uniquely identifiable through its corre-
sponding data).118
118
Consequently, no db fundamental table should ever contain duplicate rows (the so-called table syn-
tactic horizontal minimality). Moreover, for such tables, except for some very few exceptions (namely,
tables corresponding to subsets or to object sets of the type poultry/rabbit/etc. cages), their rows should
Conclusion
597
not differ between them only in their surrogate key column, but in other columns as well (the so-called
table semantic horizontal minimality).
119
see, for example, keys *Country • Zip from Fig. 3.5 (Subsection 3.2.3) and ConstrName • *DB from
Table 3.11 (Subsection 3.7.2)
598 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
thus only allowing for some object sets or properties the empty set as their
instances).
A-DA13. Minimum inclusions axiom: in any db model, for any inclu-
sion between object sets T # S, S should be the smallest set including T.
A-DA14. Inclusions acyclicity axiom: no fundamental object set inclu-
sions chain should be a cycle (dually, for any db scheme, its fundamental
inclusions graph should be acyclic).
A-DA15. Sets equality axiom: no equality of two fundamental object
sets should ever be added to a conceptual db scheme (i.e., such equali-
ties, if any, are always to be interpreted simply as declaring object set
aliases).120
A-DA16. Mappings equality axiom: no equality of two fundamental
mappings should ever be added to a db scheme (i.e., such equalities, if
any, are always to be interpreted simply as declaring mappings aliases).121
A-DA17. ER-Ds axiom: except for an overview one, no ER-D should
have more than one letter/A4 page and its nodes and edges should not
intersect between them, except for tangent points of edges to their con-
nected nodes.
A-DA18. ER-D cycles axiom: all ER-D cycles should be thoroughly
analyzed in order to discover whether or not they should be broken or they
need declaring additional constraints in the corresponding db scheme.
A-DA19. Naming axiom: all dbs and their components (object sets,
mappings, constraints, tables, columns, queries, stored procedures,
indexes, etc.) should be consistently named (i.e., strictly adhering to some
naming conventions), always embedding as much unambiguous semantics
as possible in each name. Names given during data analysis should be kept
identical on all design levels and down to corresponding db implementa-
tions. Names for elements of any such sets should be unique within their
corresponding sets.
A-DA20. Documentation axiom: Document all of your software (con-
sequently, data analysis, db design, implementation, manipulation and
120
Dually, rdb implementations may also contain, for optimality reasons, several tables corresponding
to a same object set obtained by vertical and/or horizontal splitting, provided that all of them are cor-
rectly linked and processed: instances of those obtained horizontally should always remain pairwise
disjoint, whereas any of the n–1 schemas (n > 1, natural) obtained vertically should be implemented
as corresponding to subsets of the remaining (master) one.
121
That is no table should ever contain two columns whose values should always be equal (the so-
called table vertical minimality).
Conclusion
599
Beware of the problem of testing too many hypotheses; the more you torture
the data, the more likely they are to confess, but confessions obtained
under duress may not be admissible in the court of scientific opinion.
—Stephen M. Stigler
122
That is all existing constraints (in the corresponding subuniverse of discourse), be them relational
or not, should be enforced in the db.
600 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
The competent programmer is fully aware of the limited size of his own skull.
He therefore approaches his task with full humility,
and avoids clever tricks like the plague.
—Edsger W. Dijkstra
A-DM0. Positive data axiom: all data stored in conventional dbs is positive.
A-DM01. Closed world axiom: facts that are not known to be true (i.e.,
for which there is no stored data) are implicitly considered as being false.
A-DM02. Closed domains axiom: there are no other objects than those
for which there is stored data.
A-DM03. Null values set axiom: there is a distinguished countable set
of null values.
A-DM04. Null values axiom: Null values may model either (tempo-
rary) unknown or inapplicable or no information data.
A-DM05. Available columns axiom: only the columns of the tables,
views and subqueries of the FROM clause are available for all other clauses
of the corresponding SQL statements.
A-DM06. GROUP BY (golden rule) axiom: in the presence of a GROUP
BY clause, corresponding SELECT clause can freely contain only column
expressions also present in the GROUP BY one; all other available columns
may be used in SELECT only as arguments of aggregation functions.
124
Obviously, this is a particularization of axiom A-DA11 above.
125
That is not only anomaly-free with respect to the domain and key constraints, but with respect to all
other constraints, regardless of their type.
602 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
A-DM07. Relevant data and processing axiom: any query should con-
sider and minimally process, in each of its steps, only relevant data; for
example,
ü the SELECT and FROM clauses should contain only needed col-
umns and tables, respectively;
ü use the GROUP BY clause only when needed (e.g., computing
groups always containing only one row is senseless);
ü in the presence of a GROUP BY clause, all logical conditions that
may be evaluated before grouping (i.e., not containing aggregate
functions) should be placed in the corresponding WHERE clause
and only those that contain aggregated columns should be placed
in the corresponding HAVING one;
ü ordering should be done only on needed columns (expressions);
ü joins, groupings, and as much as possible of the needed logical
conditions should be performed on numeric columns (surrogate
keys and foreign keys referencing them): only in the last step of
a query should surrogate keys and pointers (i.e., foreign keys) be
replaced by corresponding desired semantic columns.
A-DM08. ORDER BY axiom: except for well-founded exceptions, as
part of a minimal courtesy towards our customers, query results should
be presented ordered by the most interesting possible order (even if not
explicitly asked for); dually, ordering should be done only once per query,
as its last step.
A-DM09. Data importance axiom: as data is one of the most important
assets of our customers, preserving its desired values, as well as consis-
tence of the corresponding db instances is a must (e.g., do not ever insert,
update or delete data that should not be inserted/updated/deleted).
A-DM10. Query semantics correctness axiom: any syntactically cor-
rect query computes a table, but neither its instance, nor even its header is
guaranteed to be the result expected by our customers (e.g., some of our
queries are, in fact, equivalent possible definitions of the empty set).126
A-DM11. Fastest possible manipulation axiom: at least in production
environments, data manipulations should be done with best possible
126
In order to guarantee their semantic correctness (the only one relevant to our customers), if you
cannot formally prove it, at least informally read each query meaning immediately after you consider
it finalized and compare reading with corresponding initial request: if they do not match exactly, then
reconsider query design and/or development.
Conclusion
603
Even if no intuitive support may be given for the time being for opti-
mization, which heavily depends on current corresponding rdb instances,
workload, and data access paths (see second volume of this book), for hav-
ing all axioms grouped together, optimization ones are already presented
in this section too.128
Optimization also heavily depends on the used DBMS versions; as
such, first are presented here only the very general ones.
Finally, specific optimization axioms are presented in the second vol-
ume for the five RDBMS versions considered as examples in this book.
A-DO0. Reduce data contention axiom: intelligently use hard disks,
big files, multiple tablespaces, partitions, and segments with adequate
block sizes, separate user and system data, avoid constant updates of a
same row, etc. in order to always reduce data contention to the minimum
possible.
127
Data is not like trees (that you have to cut one by one), but rather like grains (that you may crop one
by one, but then all of us would eventually starve).
128
Consequently, beginner readers might skip it at a first sequential read of this volume and return to
it after reading and understanding the second volume.
604 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
Whenever you set out to do something, something else must be done first.
—(Corollary 6 to the first Murphy’s Law:
If anything can go wrong, it will.)
Essentially, this first volume advocates using at least three levels from Fig.
1.1 (namely the E-RDM, RDM, and RDBMS ones) and five correspond-
ing FE algorithms (A0, A1–7, A7/8–3, A8, and one of the AF8′), whenever
having to create a new rdb or to extend an existing one. Dually, when hav-
ing to understand a not enough documented rdb, it advocates using same
three levels and three corresponding RE algorithms (namely, one of the
REAF0’, REA0, and REA1–2).
Obviously, there is an even shorter possible path presented in this book
too, namely using one of the composed algorithms from the AF1–8 fam-
ily, instead of the composition of one of the AF8 and A1–7. This should
be considered only by exception, when under time pressure: as we have
already seen, skipping the RDM level means losing portability (sometimes
even between different versions of a same RDBMS).
129
The need for synchronous write is one of the reasons why the online redo log can become a bottle-
neck for applications.
Conclusion
605
designers ignore any constraint in their fields, hoping that the workers who
implement their models will discover and enforce them during manufac-
turing or, even worse, that not the workers, but the users of those products
will do it?
Moreover, generally, not only me would always stress that, for its sake,
as well as for our’s (workers in this field or/and users of its artifacts), the
IT industry should rapidly mature at least at the level of, let’s say, the car
manufacturing’s one (as building has already more than 5,000 years and is
very, very hard to narrow such gap rapidly…): would you buy or even only
drive or use a car (driven by somebody else) that was artisanally manufac-
tured, without proper design, without taking into consideration all physics
and safety rules that govern this subuniverse?
To conclude the above considerations, I am only one of the very many
who are convinced that we should not skip the E-RDM, but, on the con-
trary, we should always make full and intelligent use of it in the first step
of the conceptual data analysis and modeling process.
Dually, unfortunately, even extended with restriction sets, E-RDM is
not enough for data modeling, at least from the following four points of
views:
• E-RDM is totally ignoring the following db design axioms (see
Subsection 5.1.1): A-DA03, A-DA05, A-DA07, A-DA08, A-DA11
to A-DA16, A-DA18, and A-DA20.
• E-RDM is only partially taking into account the following db
design axioms (see Subsection 5.1.1): A-DA01 (as discovery of all
existing business rules in any subuniverse of discourse is not even
an aim of restriction sets), A-DA04 (as for restriction sets declar-
ing only one uniqueness per object set is enough), and A-DA20 (as
only informal descriptions of the subuniverses of discourse to be
modeled are compulsory).
• E-RDM has no means to check at least the syntactical correctness
of E-R data models (e.g., are all nonfunctional relationship-type
object sets and all functional relationships and all attributes cor-
rectly defined mathematically?).
• Moreover, there are very few commercial (and even prototype)
DBMSs that provide E-R user interfaces which make completely
transparent to users the RDM (or any other) level of db scheme
608 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
implementation, and even much more fewer the ones that also pro-
vide E-R query languages at least as powerful as SQL.
All these four considerations imply that E-RDM alone is not enough
for accurate, expert conceptual data analysis, db design, implementation,
and usage.
To conclude this subsection, the answers to the questions in its title are:
“definitely yes” and “definitely no”, respectively.
Due to the fact that RDBMSs are still the norm in this field and that they
are so close to RDM (and, hence, to one another), there is a great tempta-
tion to skip RDM completely, in the processes of both db design and usage.
However, as already discussed in details in Section 3.13, the answers to
the questions in the title of this subsection are: “yes” (at least for portabil-
ity and RLs reasons) and “definitely no”, respectively.
I would only add here to the pros that it is always much better not to
try to solve any problem from the beginning in technological terms, but
in higher, more abstract ones. In particular, in the RDM case, despite
the great number of fundamental compatibilities between nearly all
RDBMS versions, there are, however, enough technological differences
too between them, sometimes even only between different versions of a
same RDBMS.
Dually, for the cons, I would also add here the following two other
reasons:
• RDM is totally ignoring the following db design axioms (see Sec-
tion 5.1): A-DA07, A-DA08, A-DA11 to A-DA16, A-DA18, A-DI01
to A-DI05, A-DI08, A-DI09, and A-DI12 to A-DI17.
• RDM is only partially taking into account the following db design
axioms (see Section 5.1): A-DA01, A-DI0 (as it does not take into
account non-relational constraints), and A-DA04 (as users are not
coerced to declare at least one key for any fundamental table).
Conclusion
609
Not only from my point of view, first of all, such a data model should fully
take into account all db axioms from Section 5.1: even for those that it
cannot automatically guarantee, it should at least provide a framework and
tools for its users to help them comply with.
Next, it should also incorporate:
• an algorithm of type A3, for the (at least syntactical) validation of
the initial E-R data models;
• one of type A4, for assisting E-RD cycles analysis (because, as
shown in the second chapter of the second volume of this book, they
are very frequently embedding lot of non-relational constraints);
• one of type A5, for assisting the not always obvious task of guaran-
teeing the coherence of the constraint sets;
• one of type A6, for assisting the much more difficult task of guar-
anteeing the minimality of the constraint sets.
Finally, above all, in order to guarantee data plausibility, such a model
should also incorporate all needed constraint types, not only the relational
ones: ideally, it should provide at least the subclass of closed Horn clauses130.
There are hundreds of higher (than RDM and E-RDM) level concep-
tual data models, but, to my knowledge, none of them incorporates all the
above as well, except the (E)MDM that is presented and used in the second
volume of this book.
130
Recall that the Horn clauses are the biggest class of first order predicate formulas for which the
implication problem is decidable.
610 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
As we have already seen, no data models and/or DBMS might take into
account all above db axioms and best practice rules. Consequently, I
would like to take this final (in the realm of this volume) opportunity to
summarize once more the main ideas behind the crucial ones in the fol-
lowing dodecalog:
1. Data is increasingly becoming the most important asset of custom-
ers: we always must commit to keep it plausible, safe, and avail-
able at the maximum possible speed.
2. Only humans can understand semantics and break it down into
atomic ones: algorithms and/or machinery can only assist us with it.
3. It is not only in our customers’, but also in our own interest to
never semantically overload either object sets or mappings.
4. Discovering all existing business rules that govern any subuniverse
of interest, adding them to our data models, and enforcing them in
any corresponding db is crucial in guaranteeing data plausibility:
missing anyone of them allows for storing implausible data.
5. Adding any implausible constraint to a data model should be
always banned, as it would not allow for storing plausible data in
the corresponding dbs.
6. For any db, its associated constraint set should be coherent and
minimal.
7. Without proper data analysis and conceptual db design, resulting
dbs have very few chances to satisfy customers.
8. Just like, for example, in the case of aircrafts, following prescrip-
tions of algorithms and best practice rules is the safest thing to do
in the db field too.
9. Redundant data should always be strictly controlled: correspond-
ing values should always be read-only for users and should be
immediately calculated automatically when needed.
10. Object elements represented in dbs only by null values or/and
which are not uniquely identifiable semantically (i.e., not only by
surrogate keys) within their object sets are useless, confusing, tak-
ing up storage space and slowing down query execution for noth-
ing: consequently, they should be banned from any db instance.
11. When executed, a query always returns the correct answer to the cor-
responding customer question if and only if it was also semantically
Conclusion
611
KEYWORDS
•• A-DA0
•• A-DA1
•• A-DA2
•• A-DA3
•• A-DA4
•• A-DA5
•• A-DA6
•• A-DA7
•• A-DA8
•• A-DA9
•• A-DA10
•• A-DA11
•• A-DA12
•• A-DA13
•• A-DA14
•• A-DA15
•• A-DA16
•• A-DA17
•• A-DA18
•• A-DA19
•• A-DA20
•• A-DI0
•• A-DI1
•• A-DI2
•• A-DI3
•• A-DI4
•• A-DI5
•• A-DI6
612 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
•• A-DI7
•• A-DI8
•• A-DI9
•• A-DI10
•• A-DI11
•• A-DI12
•• A-DI13
•• A-DI14
•• A-DI15
•• A-DI16
•• A-DI17
•• A-DM0
•• A-DM1
•• A-DM2
•• A-DM3
•• A-DM4
•• A-DM5
•• A-DM6
•• A-DM7
•• A-DM8
•• A-DM9
•• A-DM10
•• A-DM11
•• A-DM12
•• A-DM13
•• closed world
•• constraint enforcement
•• constraint set coherence
•• constraint set minimality
•• controlled redundancy
•• data analysis
•• data contention
•• data plausibility
•• database design
•• database design axiom
•• database implementation axiom
•• database instance manipulation axiom
•• database optimization axiom
•• (E)MDM
•• E-RDM
613
states that all right angles are equal to one another, while Thales’ theo-
rem states that, in any circle, any triangle formed between three distinct
points of the circle is a right one if and only if two of the points are on a
circle’s diameter.
Especially in metalogic and computability (recursion) theory (but not
only), methods (procedures, algorithms131) taking some class of problems
and reducing the solution to a finite set of steps that, for all class problems
instances, always computes the right answers are called effective.
Functions (see their definition below) with an effective method are
called effectively calculable. Several independent efforts to provide a
formal characterization of effective calculability led to a variety of pro-
posed definitions132 (that were ultimately shown to be equivalent): the
notion captured by these definitions is called (recursive) computabil-
ity. The Church–Turing thesis133 states that these two notions coincide:
any number-theoretic function that is effectively calculable is recursively
computable and vice-versa.
Generally, any function defined on and taking values from a set of for-
mulas is called an inference (derivation, deductive, transformation) rule.
In classical logic (as well as the semantics of many other nonclassical
logics) inference rules are also truth preserving134: if the premises are true
(under an interpretation) then so is the conclusion. Usually, only rules that
are recursive (i.e., for which there is an effective procedure for determin-
ing whether, according to the rule, any given formula is the conclusion of
a given set of formulae) are important.
For example, “if x is an ancestor of y and z is x’s father, then z is an
ancestor of y too” is a (transitive recursive) inference rule. Modus ponens
((p ` (p q)) q), modus tollens ((¬q ` (p q)) ¬p), resolution ((p
~ q) ` (¬p ~ r)) (q ~ r)), hypothetical ((p q) ` (q r)) (p r))
and disjunctive ((p ~ q) ` ¬p) q) syllogisms are well-known examples
of propositional logic (see below) inference rules.
In linguistics, computer science, and mathematics, a formal language
is a set of strings of symbols that may be constrained by rules that are
131
Some authors consider only effective methods for calculating the values of a function as being
algorithms.
132
Namely, general recursion, Turing machines, and λ-calculus.
133
Note that this is not a mathematical statement, so it cannot be proven mathematically.
134
In many-valued logics, they preserve a general designation.
618 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
specific for it. The alphabet of a formal language is the set of symbols,
letters, or tokens from which the strings of the language may be formed;
frequently it is required to be finite. The strings formed from this alphabet
are called words, and the words that belong to a particular formal language
are sometimes called well-formed words/formulas. For example, all com-
puter programming languages (and not only) are studied using the formal
languages theory.
A formal language is often defined by means of a formal grammar135:
a set of formation rules for rewriting strings (rewriting rules), along with
a start symbol from which rewriting must start. The rules describe how to
form strings from the language’s alphabet that are valid according to the
language’s syntax. A grammar is usually thought of as a language genera-
tor. However, it can also sometimes be used as the basis for a “recognizer”
— a tool that determines whether a given string belongs to the language or
is grammatically incorrect.
A formal system is a mathematical well-defined system of abstract
thought consisting of finite sets of symbols (its alphabet), axioms and
inference rules (the latter two constituting its deductive system), plus
a grammar for generating and/or accepting the well-formed formulas
forming its formal language. For example, Euclid’s Elements (or even
Spinoza’s Ethics, its apparently nonmathematical imitation), the pred-
icate, propositional, lambda, and domain relational calculi are formal
systems.
Deductive systems should be sound (i.e., all provable statements are
true) and complete (i.e., all true statements are provable). Similarly, algo-
rithms are said to be sound and complete (if they analyze/compute only
and all possible valid cases/values of the corresponding subuniverse).
Propositional (or sentential) logic (or calculus) is a formal system in
which formulas of a formal language may be interpreted as representing
propositions. A system of inference rules and axioms allows certain for-
mulas (called theorems) to be derived, which may be interpreted as true
propositions. The series of formulas which is constructed within such a
system is called a derivation and the last formula of the series is a theorem,
whose derivation may be interpreted as a proof of the truth of the proposi-
tion represented by the theorem.
135
For example, regular, context-free, deterministic context-free, etc.
Appendix
619
mapping r~: A → A/~ called its canonical surjection, which maps any ele-
ment of A to the partition to which it belongs. Sometimes, partitions are
identified by a (unique) representative element from A, in which case A/~
# A is called a representative system (system of representatives).
For example, let A = CITIZENS and ~ # CITIZENS × CITIZENS,
defined by “;x, y[CITIZENS, x~y iff x and y have the same supreme
head of state”; then CITIZENS/~ includes, for example, the classes of all
subjects of HM Queen Elisabeth II of the U.K. (also including Canadians,
Australians, New-Zealanders, etc.), of all citizens of the U.S. and their
territories, of all German citizens, etc. The corresponding representa-
tive system would now include the Japanese Emperor Akihito, Queens
Elisabeth II of the U.K., Margrethe II of Denmark, presidents Barack
Obama, Xi Jinping, Vladimir Putin, François Hollande (of the U.S.,
China, Russia, and France, respectively), etc. The canonical surjection
SupremeStateHead : CITIZENS → CITIZENS/~ has, for example, the
following values: SupremeStateHead(Seiji Ozawa) = Emperor Akihito,
SupremeStateHead(Sir Elton Hercules John) = Queen Elisabeth II of the
U.K., SupremeStateHead(Céline Marie Claudette Dion) = Queen Elisa-
beth II of the U.K., SupremeStateHead(Eldrick Tont “Tiger” Woods) =
President Barack Obama, SupremeStateHead(Dalai Lama) = President Xi
Jinping, etc.
The kernel (nucleus) of a mapping f : A → B, denoted ker(f), is an
equivalence relation over its domain, partitioning it into classes grouping
together all of its elements for which f takes same values: ker(f) = { (x, x’)
[ A2 | f(x) = f(x’) }. Sometimes, instead of the functional notation ker(f),
an infix one is used: x =f x’ f(x) = f(x’). The corresponding induced par-
tition (and quotient set) is {{w [ A | f(x) = f(w)} | x [ A } = A/ker(f), which
is also called the function coimage (denoted coim(f)). Note that coimages
are naturally isomorphic to images, as there is a bijection from coim(f) to
Im(f). It is very easy to show that ker(f•g) = ker(f) ∩ ker(g), for any func-
tion product f•g.
For example, with Country : STATES → COUNTRIES and Region :
STATES → REGIONS, ker(Country) = {(Alabama, Alabama), (Ala-
bama, Alaska), …, (Wisconsin, Wyoming), (Wyoming, Wyoming),
(Alba, Alba), (Alba, Arad), …, (Vâlcea, Vrancea), (Vrancea, Vrancea),
…}, ker(Region) = {(Connecticut, Connecticut), (Connecticut, Maine),
Appendix
627
1. R’ is transitive,
2. R # R’, and
3. for any relation R”, if R # R” and R” is transitive, then R’ # R”
(i.e., R’ is the smallest relation that satisfies (1) and (2)).
Similarly, any other such closure (e.g., reflexive, symmetric, etc.) can
be defined.
Obviously, the transitive closure of a relation can be obtained by clos-
ing it, closing the result, and continuing to close the results of the previous
closures until no further elements are added. Note that the digraph of the
transitive closure of a relation is obtained from the digraph of the relation
by adding for each directed path the arc that shunts the path if one is not
already there.
Let R be a relation from X to Y and S one from Y to Z;
The composition (or join or concatenation) of R and S, written R.S, is
the relation R.S = {(x, z)[X × Z | xRy and ySz, y[Y}.
A function-style notation S°R is also sometimes used, although it is
quite inconvenient for relations. The notation R.S is easier to deal with, as
the relations are named in the order that leaves them adjacent to the ele-
ments that they apply to (thus, x(R.S)z, because xRy and ySz for some y).
Transitivity and composition may seem similar, but they are not: both
are defined using x, y, and z, for one thing, but transitivity is a property of
a single relation, while composition is an operator on two relations that
produces a third relation (which may or may not be transitive).
The product of two relations R and S is the relation R × S = {(w, x, y, z) |
wRx ` ySz}.
The converse (or transpose) of R, written R−1, is the relation R−1 = {(y,
x) | xRy}.
Symmetric and converse may also seem similar, but they are actually
unrelated: both are described by swapping the order of pairs, but sym-
metry is a property of a single relation, while converse is an operator that
takes a relation and produces another relation (which may or may not be
symmetric). It is trivially true, however, that the union of a relation with
its converse is a symmetric relation.
Examples:
Let X = {Airplane, Pool, Restaurant}, Y = {Goggles, Heels, Seatbelt,
Tuxedo}, Z = {Buckle, Pocket, Strap}, and R = {(Airplane, Seatbelt), (Air-
Appendix
629
R”, where n, natural, is the size of the input, f(n) is a function of n (e.g.,
n, n2, 2n, etc.), O(x) is the complexity operator (called big O), and, for
example, resources are the number of steps (e.g., needed to be performed
by an algorithm in the corresponding worst case), execution time, memory
space, etc.
For example, the NC class is a set of decision problems decidable
in polylogarithmic time on a parallel computer with a polynomial num-
ber of processors—a class of problems having highly efficient parallel
algorithms.
L (also known as LOGSPACE, LSPACE or DLOGSPACE) is the
complexity class containing decision problems that can be solved by a
DTM using a logarithmic amount of memory space. L contains precisely
those languages expressible in first-order logic with an added commuta-
tive transitive closure operator. This result has applications to database
query languages: the data complexity of a query is defined as the complex-
ity of answering a fixed query, considering the data size as the variable
input. For this measure, queries against relational databases with complete
information (i.e., having no notion of nulls) as expressed, for instance, in
relational algebra are in L.
The main idea of logspace is that you can store a polynomial-mag-
nitude number in logspace and use it to remember pointers to a position
of the input. The logspace class is therefore useful to model computation
where the input is too big to fit in the internal memory (RAM) of a com-
puter. Long DNA sequences and databases are good examples of problems
where only a constant part of the input will be in RAM at a given time and
where we have pointers to compute the next part of the input to inspect,
thus using only logarithmic memory.
L is a subclass of NL (NLOGSPACE), which is the class of languages
decidable in logarithmic space on a nondeterministic Turing machine.
It has been proved that NC # L # NL # NC2, that is: given a parallel
computer C with a polynomial number O(nk) of processors for some con-
stant k, any problem that can be solved on C in O(log n) time is in L, and
any problem in L can be solved in O(log2 n) time on C. Still open problems
include whether L = NL.
The polynomial complexity class (P, or PTIME, or DTIME(nO(1))) con-
tains all decision problems that can be solved by a deterministic Turing
Appendix
631
widely suspected that there are no polynomial-time algorithms for NP-hard problems, this has not
been proven yet. Moreover, NP also contains all problems that can be solved in polynomial time.
632 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
decision subset sum problem (given a set of integers, does any nonempty
subset of them add up to zero?) is NP-hard.
A decision problem is NP-complete when it is both in NP and NP-
hard. The set of NP-complete problems is denoted by NP-C (or NPC).
Although any given solution to an NP-complete problem can be verified
in polynomial time, there is no known efficient way to find such a solution
in the first place (the most notable characteristic of NP-complete problems
is that no fast solution to them is known): the time required to solve such
problems using any currently known algorithm increases very quickly as
the size of the problem grows. As a consequence, determining whether or
not it is possible to solve these problems quickly, called the P versus NP
problem, is one of the main still unsolved problems in computer science.
NP-complete problems are generally addressed by using heuristic meth-
ods and approximation algorithms.
For example, the above NP-hard subset sum problem is also NP-com-
plete. Another famous NP-hard example is the Boolean Satisfiability Prob-
lem (also called the Propositional Satisfiability Problem and abbreviated
as SATISFIABILITY or SAT): there exists an interpretation that satisfies a
given Boolean formula? In other words, can the variables of a given Bool-
ean formula be consistently replaced by true or false in such a way that the
formula evaluates to true?141 Despite the fact that no algorithms are known
to solve SAT efficiently, correctly, and for all possible input instances,
many instances of SAT-equivalent problems that occur in practice (e.g., in
AI, circuit design, automatic theorem proving, etc.) can actually be solved
rather efficiently using heuristical SAT-solvers. Although such algorithms
are not believed to be efficient on all SAT instances, they tend to work well
for many practical applications.
There are decision problems that are NP-hard but not NP-complete,
such as the programs halting problem: given a program and its input, will
it run forever?142 For example, the Boolean satisfiability problem can be
reduced to the programs halting problem by transforming it to the descrip-
tion of a Turing machine that tries all truth value assignments and, when
141
If this is the case, the formula is called satisfiable. On the other hand, if no such assignment exists,
the function expressed by the formula is identically false for all possible variable assignments and the
formula is unsatisfiable. For example, the formula a ` ¬b is satisfiable, because a = true and b = false
make it true; dually, a ` ¬a is unsatisfiable.
142
That’s a yes/no question, so this is a decision problem.
Appendix
633
it finds one that satisfies the formula, it halts; otherwise, it goes into an
infinite loop. The halting problem is not in NP since all problems in NP
are decidable in a finite number of operations, while the halting problem
is generally undecidable.
There are also NP-hard problems that are neither NP-complete nor
undecidable. For instance, the language of True quantified Boolean formu-
las (TQBF)143 is decidable in polynomial space, but not nondeterministic
polynomial time (unless NP = PSPACE).
The polynomial space complexity class (PSPACE) is the set of decision
problems that can be solved by a DTM only using a polynomial amount
of (memory) space.
A decision problem is PSPACE-complete if it can be solved using an
amount of memory that is polynomial in the input length and if every
other problem that can be solved in polynomial space can be transformed
to it in polynomial time. The problems that are PSPACE-complete can be
thought of as the hardest problems in PSPACE, because a solution to any
such problem could easily be used to solve any other problem in PSPACE.
The PSPACE-complete problems are thought to be outside of the P and
NP complexity classes, but this has not been proved.144
For example, the canonical PSPACE-complete problem is the quanti-
fied Boolean formula problem (QBF) one, a generalization of the Boolean
satisfiability problem, in which both existential and universal quantifiers
can be applied to each variable: it asks whether a quantified sentential
form over a set of Boolean variables is true or false; for example, the fol-
lowing is an instance of QBF: ;x 'y 'z ((x ~ z) ` y).
The complexity class EXPSPACE is the set of all decision problems solv-
able by a DTM in O(2p(n)) space, where p(n) is a polynomial function of n.
NEXPTIME (or NEXP) is the class of problems solvable in exponential
time by a nondeterministic Turing machine.
143
A (fully) quantified Boolean formula is a formula in quantified propositional logic where every
variable is quantified (or bound), using either existential or universal quantifiers at the beginning of
the sentence. Such a formula is equivalent to either true or false (since there are no free variables). If
such a formula evaluates to true, then it belongs to TQBF (also known as QSAT or Quantified SAT).
144
It is, however, known that they lie outside of the NC class, because problems in NC can be solved
in an amount of space polynomial in the logarithm of the input size, and the class of problems solvable
in such a small amount of space is strictly contained in PSPACE (which is proved by a space hierarchy
theorem).
634 Conceptual Data Modeling and Database Design: A Fully Algorithmic Approach
145
Recall that the same problem with the number of steps written in unary is P-complete.
146
By the time hierarchy theorem and the space hierarchy theorem.