4-Fundamentals of Database Management
4-Fundamentals of Database Management
Table of contents
Introduction ........................................................................................................................................ 17
Design Process ................................................................................................................................... 17
Determining data to be stored ............................................................................................................ 17
Conceptual schema ............................................................................................................................. 18
Logically structuring data................................................................................................................... 18
Physical database design .................................................................................................................... 18
Difference between a Database System and a File System .............................................................. 18
Introduction ........................................................................................................................................ 18
Self-Describing Nature of a Database System ................................................................................... 19
Insulation Between Programs And Data ............................................................................................ 19
Support of Multiple Views of the Data .............................................................................................. 19
Sharing Of Data and Multi-User Transaction Processing .................................................................. 19
Moving to Relational Model............................................................................................................... 19
Introduction ........................................................................................................................................ 19
Schema ............................................................................................................................................... 19
Subschema .......................................................................................................................................... 20
Levels of Abstraction ......................................................................................................................... 20
Data Independence ............................................................................................................................. 20
Relation .............................................................................................................................................. 20
Types of Relationship......................................................................................................................... 20
One-to-one relationships .................................................................................................................... 21
One-to-many relationships ................................................................................................................. 21
Many-to-many relationships .............................................................................................................. 21
The Relational Data Structure ............................................................................................................ 22
Relational Data Integrity .................................................................................................................... 23
Integrity Constraints ........................................................................................................................... 24
Domain Constraints ............................................................................................................................ 24
Referential Integrity ........................................................................................................................... 25
Operational Constraints ...................................................................................................................... 25
CODD’S Rules ................................................................................................................................... 25
Fundamentals of Database Management 3
Introduction ........................................................................................................................................ 38
Why It Is Called Relational Calculus? ............................................................................................... 38
Tuple Calculus.................................................................................................................................... 38
Domain Calculus ................................................................................................................................ 39
Analogies ............................................................................................................................................ 40
Entity .................................................................................................................................................. 40
Attribute ............................................................................................................................................. 40
Single Valued vs. Multi Valued ......................................................................................................... 41
Database Architecture Explained ...................................................................................................... 41
Types of Database Architecture ......................................................................................................... 41
Two-Tier Architecture (Client-Server Architecture) ......................................................................... 42
4 Fundamentals of Database Management
Presentation Services.......................................................................................................................... 42
Business Services/objects................................................................................................................... 42
Application Services .......................................................................................................................... 42
Advantages of Two-tier Architecture ................................................................................................. 43
Drawbacks of Two-tier Architecture.................................................................................................. 43
Three-tier Architecture ....................................................................................................................... 43
Multitier Architecture ......................................................................................................................... 44
E-R Diagrams ...................................................................................................................................... 46
Introducing E/R Diagram ................................................................................................................... 46
Analogies ............................................................................................................................................ 46
Entity .................................................................................................................................................. 46
Attribute ............................................................................................................................................. 46
Movie World Example: ...................................................................................................................... 46
Student World Example: .................................................................................................................... 47
Single Valued vs. Multi Valued ......................................................................................................... 47
Movie World Example: ...................................................................................................................... 47
Student World Example: .................................................................................................................... 47
Movie World Example: ...................................................................................................................... 47
Student World Example: .................................................................................................................... 47
E-R Diagrams ..................................................................................................................................... 48
An aside on null values ...................................................................................................................... 48
Symbols Used In E-R Diagrams ........................................................................................................ 49
Entity Type ......................................................................................................................................... 49
Movie World Example: ...................................................................................................................... 49
Student World Example: .................................................................................................................... 50
Key Attributes .................................................................................................................................... 50
Movie World Example: ..................................................................................................................... 50
Student World Example: .................................................................................................................... 50
Relationship ........................................................................................................................................ 50
Movie World Example: .................................................................................................................... 50
Student World Example: .................................................................................................................... 50
Relationship Type .............................................................................................................................. 50
Movie World Example: ...................................................................................................................... 51
Student World Example: .................................................................................................................... 51
Cardinality Ratio ................................................................................................................................ 51
Movie World Example: ...................................................................................................................... 52
Fundamentals of Database Management 5
What is a Database?
A Computer Database is a structured collection of records or data that is stored in a computer system. The
structure is achieved by organizing the data according to a database model. The model in most common
use today is the relational model. Other models such as the hierarchical model and the network model use
a more explicit representation of relationships.
What is a DBMS?
As one of the oldest components associated with computers, the database management system, or DBMS,
is a computer software program that is designed as the means of managing all databases that are
currently installed on a system hard drive or network. Different types of database management systems
exist, with some of them designed for the oversight and proper control of databases that are configured
for specific purposes. Here are some examples of the various incarnations of DBMS technology that are
currently in use, and some of the basic elements that are part of DBMS software applications.
As the tool that is employed in the broad practice of managing databases, the DBMS is marketed in many
forms. Some of the more popular examples of DBMS solutions include Microsoft Access, FileMaker, DB2,
and Oracle. All these products provide for the creation of a series of rights or privileges that can be
associated with a specific user. This means that it is possible to designate one or more database
administrators who may control each function, as well as provide other users with various levels of
administration rights. This flexibility makes the task of using DBMS methods to oversee a system
something that can be centrally controlled, or allocated to several different people.
Components of a DBMS
There are four essential elements that are found with just about every example of DBMS currently on the
market. The first is the implementation of a modeling language that serves to define the language of each
database that is hosted via the DBMS. There are several approaches currently in use, with hierarchical,
network, relational, and object examples. Essentially, the modeling language ensures the ability of the
databases to communicate with the DBMS and thus operate on the system.
Second, data structures also are administered by the DBMS. Examples of data that are organized by this
function are individual profiles or records, files, fields and their definitions, and objects such as visual
media. Data structures are what allows DBMS to interact with the data without causing and damage to the
integrity of the data itself.
A third component of DBMS software is the data query language. This element is involved in maintaining
the security of the database, by monitoring the use of login data, the assignment of access rights and
Fundamentals of Database Management 9
privileges, and the definition of the criteria that must be employed to add data to the system. The data
query language works with the data structures to make sure it is harder to input irrelevant data into any
of the databases in use on the system.
Last, a mechanism that allows for transactions is an essential basic for any DBMS. This helps to allow
multiple and concurrent access to the database by multiple users, prevents the manipulation of one record
by two users at the same time, and preventing the creation of duplicate records.
1. You may be interested to know the characteristics of data in a Database. They are,
2. Shared - Data in database are shared among Different users and applications.
3. Persistence – Data in a database exist permanently in the sense that data can live beyond the
4. scope of the process that created it.
5. Correctness – Data should be correct
6. Security – Data should be protected from Un-Authorized access.
7. Consistency- Whenever more than one data element in a database represents real-
8. world
9. values, the values should be protected from unauthorized access.
10. Non-redundancy – No two data items in a database should represent the same real world
11. entity.
Introduction
A DBMS can take any one of the several approaches to manage data. Each approach constitutes
a database model. A data model is a collection of descriptions of data structures and their contained fields,
together with the operations or functions that manipulate them. A data model is a comprehensive
scheme for describing how data is to be represented for manipulation by humans or computer
programs. A thorough representation details the types of data, the topological arrangements of data,
spatial and temporal maps onto which data can be projected, and the operations and structures that can
be invoked to handle data and its maps. The various Database Models are the following:-
Relational Model
A database model that organizes data logically in tables. A formal theory of data consisting of three major
components: (a) A structural aspect, meaning that data in the database is perceived as tables, and only
tables, (b) An integrity aspect, meaning that those tables satisfy certain integrity constraints, and
(c) A manipulative aspect, meaning that the tables can be operated upon by means of operators which
derive tables from tables. Here each table corresponds to an application entity and each row
represents an instance of that entity. (RDBMS - relational database management system) A database
based on the relational model was developed by E.F. Codd.
A relational database allows the definition of data structures, storage and retrieval operations and
integrity constraints. In such a database the data and relations between them are organized in tables. A
table is a collection of records and each record in a table contains the same fields.
10 Fundamentals of Database Management
Certain fields may be designated as keys, which mean that searches for specific values of that field will
use indexing to speed them up. Often, but not always, the fields will have the same name in both tables.
For example, an "orders" table might contain (customer-ID, product-code) pairs and a "products" table
might contain (product-code, price) pairs so to calculate a given customer's bill you would sum the prices
of all products ordered by that customer by joining on the product-code fields of the two tables. This can
be extended to joining multiple tables on multiple fields. Because these relationships are only specified at
retrieval time, relational databases are classed as dynamic database management system. The
RELATIONAL database model is based on the Relational Algebra.
Advantages
Structural Independence
Conceptual Simplicity
Ease of design, implementation, maintenance and usage.
Ad hoc query capability
Disadvantages
Hardware Overheads
Ease of design can lead to bad design
Network Model
The popularity of the network data model coincided with the popularity of the hierarchical data model.
Some data were more naturally modelled with more than one parent per child. So, the network model
permitted the modelling of many-to-many relationships in data. In 1971, the Conference on Data
Systems Languages (CODASYL) formally defined the network model. The basic data modelling
construct in the network model is the set construct. A set consists of an owner record type, a set name,
and a member record type.
A member record type in the Network Model can have that role in more than one set; hence the
multiparent concept is supported. An owner record type can also be a member or owner in another set.
The data model is a simple network, and link and intersection record types (called junction records by
IDMS) may exist, as well as sets between them . Thus, the complete network of relationships is
represented by several pair wise sets; in each set some (one) record type is owner (at the tail of the
network arrow) and one or more record types are members (at the head of the relationship arrow).
Usually, a set defines a 1:M relationship, although 1:1 is permitted. The CODASYL network model is based
on mathematical set theory.
Fundamentals of Database Management 11
Advantages
Conceptual Simplicity
Ease of data access
Data Integrity and capability to handle more relationship types
Data independence
Database standards
Disadvantages
System complexity
Absence of structural independence
Hierarchical Model
The hierarchical data model organizes data in a tree structure. There is a hierarchy of parent and child
data segments. This structure implies that a record can have repeating information, generally in the child
data segments. Data in a series of records, which have a set of field values attached to it. It
collects all the instances of a specific record together as a record type. These record types are the
equivalent of tables in the relational model, and with the individual records being the equivalent of rows.
In a Hierarchical model you could create links between these record types; the hierarchical model
uses Parent Child Relationships. These are a 1: N mapping between record types. This is done by using
trees, like set theory used in the relational model, "borrowed" from maths. For example, an organization
might store information about an employee, such as name, employee number, department, salary. The
organization might also store information about an employee's children, such as name and date of birth.
The employee and children data forms a hierarchy, where the employee data represents the parent
segment and the children data represents the child segment.
If an employee has three children, then there would be three child segments associated with one
employee segment. In a hierarchical database the parent-child relationship is one to many. This restricts a
child segment to having only one parent segment. Hierarchical DBMSs were popular from the late
1960s, with the introduction of IBM's Information Management System (IMS) DBMS, through the
1970s.
Advantages
Simplicity
Data Security and Data Integrity
Efficiency
Disadvantages
Implementation Complexity
Lack of structural independence
Programming complexity
12 Fundamentals of Database Management
Object DBMSs add database functionality to object programming languages. They bring much more than
persistent storage of programming language objects. Object DBMSs extend the semantics of the
C++, Smalltalk and Java object programming languages to provide full- featured database
programming capability, while retaining native language compatibility. A major benefit of this
approach is the unification of the application and database development into a seamless data model
and language environment. As a result, applications require less code, use more natural data
modeling, and code bases are easier to maintain. Object developers can write complete database
applications with a modest amount of additional effort.
In contrast to a relational DBMS where a complex data structure must be flattened out to fit into tables or
joined together from those tables to form the in-memory structure, object DBMSs have no performance
overhead to store or retrieve a web or hierarchy of interrelated objects. This one-to-one mapping of object
programming language objects to database objects has two benefits over other storage approaches: it
provides higher performance management of objects, and it enables better management of the
complex interrelationships between objects. This makes object DBMSs better suited to support
applications such as financial portfolio risk analysis systems, telecommunications service
applications, World Wide Web document structures, design and manufacturing systems, and hospital
patient record systems, which have complex relationships between data.
Advantages
Disadvantages
Semistructured Model
In semi-structured data model, the information that is normally associated with a schema is contained
within the data, which is sometimes called ``self-describing''. In such database there is no clear
separation between the data and the schema, and the degree to which it is structured depends on the
application. In some forms of semistructured data there is no separate schema, in others it exists but only
places loose constraints on the data. Semi-structured data is naturally modelled in terms of graphs which
contain labels which give semantics to its underlying structure. Such databases subsume the modelling
power of recent extensions of flat relational databases, to nested databases which allow the nesting (or
encapsulation) of entities, and to object databases which, in addition, allow cyclic references between
objects.
Semistructured data has recently emerged as an important topic of study for a variety of reasons. First,
there are data sources such as the Web, which we would like to treat as databases but which cannot be
constrained by a schema. Second, it may be desirable to have an extremely flexible format for data
exchange between disparate databases. Third, even when dealing with structured data, it may be helpful
to view it as semi-structured for the purposes of browsing.
Fundamentals of Database Management 13
Associative Model
The associative model divides the real-world things about which data is to be recorded into two sorts:
Entities are things that have discrete, independent existence. An entity‘s existence does not depend on
any other thing. Associations are things whose existence depends on one or more other things, such that
if any of those things ceases to exist, then the thing itself ceases to exist or becomes meaningless.
An associative database comprises two data structures:
A set of items, each of which has a unique identifier, a name and a type.
A set of links, each of which has a unique identifier, together with the unique identifiers of three
other things, that represent the source, verb and target of a fact that is recorded about the source in the
database. Each of the three things identified by the source, verb and target may be either a link or an
item.
The best way to understand the rationale of EAV design is to understand row modelling (of which EAV is a
generalized form). Consider a supermarket database that must manage thousands of products and
brands, many of which have a transitory existence. Here, it is intuitively obvious that product names
should not be hard-coded as names of columns in tables. Instead, one stores product descriptions in a
Products table: purchases/sales of individual items are recorded in other tables as separate rows with a
product ID referencing this table. Conceptually an EAV design involves a single table with three columns,
an entity (such as an olfactory receptor ID), an attribute (such as species, which is actually a pointer into
the metadata table) and a value for the attribute (e.g., rat). In EAV design, one row stores a single fact.
In a conventional table that has one column per attribute, by contrast, one row stores a set of facts. EAV
design is appropriate when the number of parameters that potentially apply to an entity is vastly more
than those that actually apply to an individual entity.
Context Model
The context data model combines features of all the above models. It can be considered as a collection of
object-oriented, network and semi-structured models or as some kind of object database. In other words
this is a flexible model, you can use any type of database structure depending on task. Such data model
has been implemented in DBMS Context.
The fundamental unit of information storage of Context is a CLASS. Class contains METHODS and
describes OBJECT. The Object contains FIELDS and PROPERTY. The field may be composite, in this case
the field contains Sub Fields etc. The property is a set of fields that belongs to particular Object. (similar
to AVL database). In other words, fields are permanent part of Object but Property is its variable part.
The header of Class contains the definition of the internal structure of the Object, which includes the
description of each field, such as their type, length, attributes and name. Context data model has a set of
predefined types as well as user defined types. The predefined types include not only character strings,
texts and digits but also pointers (references) and aggregate types (structures).
Advantages of DBMS
There are three main features of a database management system that make it attractive to use a DBMS in
preference to more conventional software. These features are centralized data management, data
independence, and systems integration.
14 Fundamentals of Database Management
In a database system, the data is managed by the DBMS and all access to the data is through the DBMS
providing a key to effective data processing. This contrasts with conventional data processing systems
where each application program has direct access to the data it reads or manipulates. In a conventional
DP system, an organization is likely to have several files of related data that are processed by several
different application programs.
In the conventional data processing application programs, the programs usually are based on a
considerable knowledge of data structure and format. In such environment any change of data structure
or format would require appropriate changes to the application programs. These changes could be as
small as the following:
Coding of some field is changed. For example, a null value that was coded as -1 is
now coded as -9999.
A new field is added to the records.
The length of one of the fields is changed. For example, the maximum number of
digits in a telephone number field or a postcode field needs to be changed.
The field on which the file is sorted is changed.
If some major changes were to be made to the data, the application programs may need to be rewritten.
In a database system, the database management system provides the interface between the application
programs and the data. When changes are made to the data representation, the metadata maintained by
the DBMS is changed but the DBMS continues to provide data to application programs in the previously
used way. The DBMS handles the task of transformation of data wherever necessary.
This independence between the programs and the data is called data independence. Data independence is
important because every time some change needs to be made to the data structure, the programs that
were being used before the change would continue to work. To provide a high degree of data
independence, a DBMS must include a sophisticated metadata management system.
In DBMS, all files are integrated into one system thus reducing redundancies and making data
management more efficient. In addition, DBMS provides centralized control of the operational data. Some
of the advantages of data independence, integration and centralized control are:
In conventional data systems, an organization often builds a collection of application programs often
created by different programmers and requiring different components of the operational data of the
organisation. The data in conventional data systems is often not centralised. Some applications may
require data to be combined from several systems. These several systems could well have data that is
redundant as well as inconsistent (that is, different copies of the same data may have different values).
Data inconsistencies are often encountered in everyday life. For example, we have all come across
situations when a new address is communicated to an organisation that we deal with (e.g. a bank, or
Telecom, or a gas company), we find that some of the communications from that organisation are
received at the new address while others continue to be mailed to the old address. Combining all the data
in a database would involve reduction in redundancy as well as inconsistency. It also is likely to reduce the
costs for collection, storage and updating of data.
A DBMS is often used to provide better service to the users. In conventional systems, availability of
information is often poor since it normally is difficult to obtain information that the existing systems were
not designed for. Once several conventional systems are combined to form one centralised data base, the
Fundamentals of Database Management 15
availability of information and its up-todateness is likely to improve since the data can now be shared and
the DBMS makes it easy to respond to unforeseen information requests.
Centralizing the data in a database also often means that users can obtain new and combined information
that would have been impossible to obtain otherwise. Also, use of a DBMS should allow users that do not
know programming to interact with the data more easily.
The ability to quickly obtain new and combined information is becoming increasingly important in an
environment where various levels of governments are requiring organizations to provide more and more
information about their activities. An organization running a conventional data processing system would
require new programs to be written (or the information compiled manually) to meet every new demand.
Changes are often necessary to the contents of data stored in any system. These changes are more easily
made in a database than in a conventional system in that these changes do not need to have any impact
on application programs.
As noted earlier, it is much easier to respond to unforeseen requests when the data is centralized in a
database than when it is stored in conventional file systems. Although the initial cost of setting up of a
database can be large, one normally expects the overall cost of setting up a database and developing and
maintaining application programs to be lower than for similar service using conventional systems since the
productivity of programmers can be substantially higher in using non-procedural languages that have
been developed with modern DBMS than using procedural languages.
Since all access to the database must be through the DBMS, standards are easier to enforce. Standards
may relate to the naming of the data, the format of the data, the structure of the data etc.
In conventional systems, applications are developed in an ad hoc manner. Often different system of an
organisation would access different components of the operational data. In such an environment,
enforcing security can be quite difficult.
Setting up of a database makes it easier to enforce security restrictions since the data is now centralized.
It is easier to control who has access to what parts of the database. However, setting up a database can
also make it easier for a determined person to breach security. We will discuss this in the next section.
Since the data of the organization using a database approach is centralized and would be used by a
number of users at a time, it is essential to enforce integrity controls.
16 Fundamentals of Database Management
Integrity may be compromised in many ways. For example, someone may make a mistake in data input
and the salary of a full-time employee may be input as $4,000 rather than $40,000. A student may be
shown to have borrowed books but has no enrolment. Salary of a staff member in one department may be
coming out of the budget of another department.
If a number of users are allowed to update the same data item at the same time, there is a possibility that
the result of the updates is not quite what was intended. For example, in an airline DBMS we could have a
situation where the number of bookings made is larger than the capacity of the aircraft that is to be used
for the flight. Controls therefore must be introduced to prevent such errors to occur because of concurrent
updating activities. However, since all data is stored only once, it is often easier to maintain integrity than
in conventional systems.
All enterprises have sections and departments and each of these units often consider the work of their unit
as the most important and therefore consider their needs as the most important. Once a database has
been set up with centralized control, it will be necessary to identify enterprise requirements and to
balance the needs of competing units. It may become necessary to ignore some requests for information if
they conflict with higher priority needs of the enterprise.
Perhaps the most important advantage of setting up a database system is the requirement that an overall
data model for the enterprise be built. In conventional systems, it is more likely that files will be designed
as needs of particular applications demand. The overall view is often not considered. Building an overall
view of the enterprise data, although often an expensive exercise, is usually very cost-effective in the long
term.
Fundamentals of Database Management 17
Introduction
Database design is the process of producing a detailed data model of a database. This logical data model
contains all the needed logical and physical design choices and physical storage parameters needed to
generate a design in a Data Definition Language, which can then be used to create a database. A fully
attributed data model contains detailed attributes for each entity.
The term database design can be used to describe many different parts of the design of an overall
database system. Principally, and most correctly, it can be thought of as the logical design of the base
data structures used to store the data. In the relational model these are the tables and views. In an
Object database the entities and relationships map directly to object classes and named relationships.
However, the term database design could also be used to apply to the overall process of designing, not
just the base data structures, but also the forms and queries used as part of the overall database
application within the Database Management System or DBMS.
Design Process
The process of doing database design generally consists of a number of steps which will be carried out by
the database designer. Not all of these steps will be necessary in all cases. Usually, the designer must:
Within the relational model the final step can generally be broken down into two further steps that of
determining the grouping of information within the system, generally determining what are the basic
objects about which information is being stored, and then determining the relationships between these
groups of information, or objects. This step is not necessary with an Object database.
The tree structure of data may enforce a hierarchical model organization, with a parentchild relationship
table. An Object database will simply use a one-to-many relationship between instances of an object class.
It also introduces the concept of a hierarchical relationship between object classes, termed inheritance
In a majority of cases, the person who is doing the design of a database is a person with expertise in the
area of database design, rather than expertise in the domain from which the data to be stored is drawn
e.g. financial information, biological information etc. Therefore the data to be stored in the database must
be determined in cooperation with a person who does have expertise in that domain, and who is aware of
what data must be stored within the system.
This process is one which is generally considered part of requirements analysis, and requires skill on the
part of the database designer to elicit the needed information from those with the domain knowledge. This
is because those with the necessary domain knowledge frequently cannot express clearly what their
system requirements for the database are as they are unaccustomed to thinking in terms of the discrete
data elements which must be stored. Data to be stored can be determined by Requirement Specification.
18 Fundamentals of Database Management
Conceptual schema
Once a database designer is aware of the data which is to be stored within the database, they must then
determine how the various pieces of that data relate to one another. When performing this step, the
designer is generally looking out for the dependencies in the data, where one piece of information is
dependent upon another i.e. when one piece of information changes, the other will also. For example, in a
list of names and addresses, assuming the normal situation where two people can have the same address,
but one person cannot have two addresses, the name is dependent upon the address, because if the
address is different then the associated name is different too. However, the inverse is not necessarily true,
i.e. when the name changes address may be the same.
(NOTE: A common misconception is that the relational model is so called because of the stating of
relationships between data elements therein. This is not true. The relational model is so named such
because it is based upon the mathematical structures known as relations.)
Once the relationships and dependencies amongst the various pieces of information have been
determined, it is possible to arrange the data into a logical structure which can then be mapped into the
storage objects supported by the database management system. In the case of relational databases the
storage objects are tables which store data in rows and columns.
Each table may represent an implementation of either a logical object or a relationship joining one or more
instances of one or more logical objects. Relationships between tables may then be stored as links
connecting child tables with parents. Since complex logical relationships are themselves tables they will
probably have links to more than one parent.
In an Object database the storage objects correspond directly to the objects used by the Object-oriented
programming language used to write the applications that will manage and access the data. The
relationships may be defined as attributes of the object classes involved or as methods that operate on
the object classes.
The physical design of the database specifies the physical configuration of the database on the storage
media. This includes detailed specification of data elements, data types, indexing options, and other
parameters residing in the DBMS data dictionary. It is the detailed design of system that includes modules
& the database's hardware & software specifications of the system.
In the database approach, a single repository of data is maintained that is defined once then accessed by
various users.
Self-describing of a database
Insulation between programs and data
Fundamentals of Database Management 19
Database system contains not only the database itself but also a complete definition of the database
structure and constrains
The information stored in the catalog is called Meta-data (data about data), and it describes the structure
of the primary database.
In file processing, if any changes to the structure of a file may require changing all programs that access
the file.
In database system, the structure of data files is stored in the DBMS catalog separately from the access
program. This is called program-data independence.
Each user may see a different view of the database, which describes only the data of interest to that user.
Allowing a set of concurrent users to retrieve from and to update the database. Concurrency control within
the DBMS guarantees that each transaction is correctly executed or aborted.
The relational model is an abstract theory of data that is based on the mathematical theory whose
principles were laid down by Dr. E F Codd. The relational model of Codd used certain terms and principles.
The Relational data base management systems are based on the relational model. More precisely
relational model is concerned with the aspects of data, data structure, and data integrity and data
manipulation.
Schema
A schema describes the organization of data and relationships within the database. A schema is
owned by a database user and has the same name as that user. A schema separates physical
aspects of data storage from logical aspects of data representation. The internal schema defines how and
where data are organized in physical data storage. The conceptual schema defines the stored data
20 Fundamentals of Database Management
structure in terms of the database model used. The external schema defines a view or views of
database for particular users. An instance of a database is the data it contains at some particular time.
Subschema
That part of a database definition, to be viewed by particular applications, that describes all or a subset
of the data elements, record types, set types, and areas defined in the schema. It is basically a
portion of a schema - usually to show a particular user department's portion of the database. It
identifies a subset of areas, sets, records, and data names defined in the database schema available to
user sessions.
Levels of Abstraction
Data Independence
It is the ability to modify a schema definition in one level without affecting a schema definition
in the next higher level. The interfaces between the various levels and components should be
well defined so that changes in some parts do not seriously influence others.
Relation
row ~ tuple
column ~ attribute
Values in a tuple are related to each other. Relation R can be thought of as a predicate R R(x, y, z) is true
if tuple (x, y, z) is in R.
Types of Relationship
One-to-one relationships
Every student has a mobile telephone number (probably!), and every mobile telephone number
corresponds to just one person. There is a one-to-one relationship between students and mobile telephone
numbers. If we have a table whose entity is Student (i.e. a table with information about students), we
could simply add "Mobile number" as one of the fields in that table. However, we might not want to clutter
the table with this kind of information, so we might make another table showing simply the student ID
and the telephone number. We can then look up this information if we ever need it. The link between the
two tables Students and Mobiles would be a one-to-one link. You can represent such a link with an entity
relationship diagram:
One-to-many relationships
A student only has one Director of Studies in any one year (usually), but a Director of Studies can have
many different students. The relationship between Directors of Studies and Students is a one-to-many
relationship. So, if our database were to show the director of studies of each student, we could have a
table listing all the Directors of Studies of the Colleges, with their IDs (primary key) and any extra
information wanted (full names and contact information, for example).
You could then simply add a field "DoS" to your Students table (since a student can only have one DoS)
containing the DoS's ID, and link the two tables, DoS and Students, with a one-to-many relationship. The
linked fields are the DoS_IDs (which appear in both tables).
Many-to-many relationships
There is a kind of relationship that needs special handling in relational databases, the many-to-many
relationship. One student may have many supervisors, but equally, one supervisor will have many
students. This poses a problem in terms of how to represent the relationship without resorting to
repeating attributes like this:
22 Fundamentals of Database Management
Twome
frt20 2003 egk10 fpm20 llt101 hf2003 ffrt2
, Frida
If you find yourself wanting to put repeating attributes in a table, then it is a sure sign that there is
something wrong with your data structure. Imagine the complications here if the Supervisors table were
to list all the students taught by each supervisor: you would have to have an indeterminate number of
fields: Student1, Student2, Student3, Student4, ..., Student25 ...
The solution is to provide a third linking table, one which simply lists pairs of supervisors and supervisees.
In relational databases, many-to-many relationships always require a third linking table between the two
entities which are linked by this kind of relationship. An entity diagram shows how this works:
Student_ID Supervisor_ID
frt20 egk10
lmnu1 rpu5
frt20 ull200
yt1001 egk10
This table might also contain Course Codes (FR9, SP5, etc.), in which case it would also be linked to the
Courses table with a crow's-foot line in the diagram. Note that it doesn't matter if a student ID or a
supervisor ID appears twice, in fact that's the whole point since a student can have many supervisors and
vice versa. This table doesn't need a primary key because the pairs of IDs together each form a unique
composite key. In case you think that entering data into this kind of table, with just IDs (and perhaps
Course Codes), would be error-prone, do not worry: a data entry form would present you with the
surnames, forenames and IDs of both students and supervisors, and Course Titles, as a drop-down-list to
choose from, and then would insert the appropriate IDs and codes into the table for you.
The smallest unit of data in the relational model is the individual data value. Such values are assumed to
be atomic, which means that they have no internal structure as far as the model is concerned. A domain is
a set of all possible data values. For example in supplier parts example, the domain of supplier numbers is
the set of all valid supplier numbers. Thus domains are pools of values, from which the actual values
appearing the attributes are drawn. The domain concept is a very important and integral part of relational
model. Now let us take a look at the relations...
Fundamentals of Database Management 23
The Cartesian product specifies all combinations of values from the underlying domains.
Hence, if we denote the total number of values, or cardinality, in domain D by |D| (assuming
that all domains are finite), the total number of tuples in the Cartesian product is
So we can think relation as a table, then a tuple corresponds to a row of the table; the number tuples is
called the cardinality; the number of attributes is a called the degree; and a domain is a pool of values,
from which the values of specific attributes of specific relation are taken.
As you know most of the relations have an attribute, which can uniquely identify each tuple in the
relation. In some cases there can be more than one attribute, which can uniquely identify each
tuple in the relation. This attribute is called as a candidate key. If there are more than one attribute both
of the attributes are eligible to be identified as a candidate key. One of the candidate keys is arbitrarily
designated to be the primary key and others are called as secondary or alternate keys. A key is minimal
set of attributes guaranteeing separation for the members of the relation. When more than one key exists,
a primary key is selected.
24 Fundamentals of Database Management
In the above table symbol, name and atomic number can uniquely identify each row, so any one can be a
candidate key, or the Element_Table has three candidate keys. Let R be the relation with attributes A1,
A2, …An. The set of attributes K=(Ai, Aj,…An) of R is said to be a candidate key of R if and only if the
following two properties are satisfied:
Uniqueness- At any given point of time, no two distinct tuples of R have the same
value of Ai, the same value for Aj…..and the same value for An.
Minimality – No proper subset of the set (Ai, Aj,…An) has the uniqueness
property.
In the Element_Table relation there are three candidate keys, so we can choose any one of them as the
primary key. There are no hard and fast rules on how to choose the primary key from the list of
candidate keys. It is a matter of preference and convenience of database designer.
Let us take a look at another relation, SHIPMENT_TABLE.
In the ELEMENT_TABLE, the attribute Symbol and in the SHIPMENT_TABLE the attribute Item has same
data values. And it is clear that a given value for that attribute, say Item ‗Ag‘ should be permitted to
appear in the database only if the same value appears as a value of the Primary Key ‗Symbol‘ in the
relation ELEMENT_TABLE..
Such an attribute is a foreign key. A foreign key is an attribute or attribute combination of one relation
whose values are required to match those out of the primary key of some other relation. Also the foreign
key and the primary key should be defined on the same underlying domain.
Integrity Constraints
Relational model includes several types of constraints whose purpose is to maintain the accuracy and
integrity of the data in the database. The major types of integrity constraints are:
Domain Constraints
Entity Integrity
Referential Integrity
Operational Constraints
Domain Constraints
Fundamentals of Database Management 25
All the values that appear in a column of a relation must be taken from the same domain. A domain
usually consists of the following components:
Domain Name
Meaning
Data Type
Size or length
Allowable values or Allowable range( if applicable) Entity Integrity
The Entity Integrity rule is so designed to assure that every relation has a primary key and that the data
values for the primary key are all valid. Entity integrity guarantees that every primary key attribute is non
null. No attribute participating in the primary key of a base relation is allowed to contain nulls. Primary
key performs unique identification function in a relational model. Thus a null primary key performs the
unique identification function in a relation would be like saying that there are some entity that had no
known identity. An entity that cannot be identified is a contradiction in terms, hence the name
entity integrity.
Referential Integrity
In the relational model the association between the tables is defined using foreign keys. The association
between the SHIPMENT and ELEMENT tables is defined by including the Symbol attribute as a foreign key
in the SHIPMENT table. This implies that before we insert a row in the SHIPMENT table, the element for
that order must already exist in the ELEMENT table.
A referential integrity constraint is a rule that maintains consistency among the rows of two tables or
relations. The rule states that if there is a foreign key in one relation, either each of the foreign key value
must match a primary key value in the other table or else the foreign key value must be null.
Operational Constraints
These are the constraints enforced in the database by the business rules or real world
limitations. For example if the retirement age of the employees in a organization is 60, then the age
column of the employee table can have a constraint ―Age should be less than or equal to 60ǁ. These
kinds of constraints enforced by the business and the environment are called operational constraints.
CODD’S Rules
Dr. E.F. Codd, the founder of the relational database systems, places the relational model‘s
characteristic in three main categories. First, structural features that support the view of the data. They
include relations and their underlying components, views and queries, both mechanism for creating
virtual queries. Second, integrity features such as entity and referential integrity and also application
specific-constraints. Finally data manipulation features for data retrieval, insertion, deletion and
update. These features must be able to emulate any operation from relation algebra. We will see the
Codd‘s rules now.
Information Rule.
Guaranteed Access Rule.
Systematic Treatment of nulls Rule.
Active on-line catalog based on the Relational model.
Comprehensive data Sub-language Rule.
View Updating Rule.
High-Level Insert, Update and Delete.
26 Fundamentals of Database Management
A Brief Introduction
Relational algebra and relational calculus are formal languages associated with the relational
model.
Informally, relational algebra is a (high-level) procedural language and relational calculus a non-
procedural language.
However, formally both are equivalent to one another.
A language that produces a relation that can be derived using relational calculus is relationally
complete.
Relational algebra operations work on one or more relations to define another relation
without changing the original relations.
Both operands and results are relations, so output from one operation can become input to
another operation.
Allows expressions to be nested, just as in arithmetic. This property is called closure.
What? Why?
Similar to normal algebra (as in 2+3*x-y), except we use relations as values instead of
numbers.
Not used as a query language in actual DBMSs. (SQL instead.)
The inner, lower-level operations of a relational DBMS are, or are similar to, relational
algebra operations. We need to know about relational algebra to understand query
execution and optimization in a relational DBMS.
Some advanced SQL queries requires explicit relational algebra operations, most commonly outer
join.
SQL is declarative, which means that you tell the DBMS what you want, but not how it is to be
calculated. A C++ or Java program is procedural, which means that you have to state, step by
step, exactly how the result should be calculated. Relational algebra is (more) procedural than
SQL. (Actually, relational algebra is mathematical expressions.)
It provides a formal foundation for operations on relations.
It is used as a basis for implementing and optimizing queries in DBMS software.
DBMS programs add more operations which cannot be expressed in the relational algebra.
Relational calculus (tuple and domain calculus systems) also provides a foundation,
but is more difficult to use. We‘ll skip these for now.
Basic Operations:
Selection (σ): choose a subset of rows.
Projection ( ): choose a subset of columns.
Cross Product ( ): Combine two tables.
Union ( ): unique tuples from either table.
Set difference ( −): tuples in R1 not in R2.
Renaming (ρ): change names of tables & columns Additional Operations (for convenience):
Intersection, joins (very useful), division, outer joins, aggregate functions, etc.
Now we will see the various operations in relational algebra in detail.
28 Fundamentals of Database Management
Selection Operation σ
The select command gives a programmer the ability to choose tuples from a relation (rows from
a table). Please do not confuse the Relational Algebra select command with the more powerful SQL select
command that we will discuss later.
Idea: choose tuples of a relation (rows of a table)
Format: σ selection-condition(R). Choose tuples that satisfy the selection condition.
Result has identical schema as the input.
σ Major = ‗CS‘ (Students)
This means that, the desired output is to display the name of students who has taken CS as Major. The
Selection condition is a Boolean expression including =, ≠, <, ≤, >, ≥, and, or, not.
This an―abstractǁ table because there is no way to determine the real world model that the table
represents. All we know is that attribute (column) A is the primary key and that fact is reflected in the
fact that no two items currently in the A column of R are the same. Now using a popular variant of
Relation Algebra notation…if we were to do the Relational Algebra command:
Following figure shows a sample output for the preceding SELECT command:
The resulting relation has the same attributes as the original relation. The selection condition is applied to
each tuple in turn - it cannot therefore involve more than one tuple.
Project Operation
The Relational Algebra project command allows the programmer to choose attributes (columns) of
a given relation and delete information in the other attributes.
Idea: Choose certain attributes of a relation (columns of a table)
Format: Attribute_List (Relation)
Returns: a relation with the same tuples as (Relation) but limited to those attributes of interest (in the
attribute list).selects some of the columns of a table; it constructs a vertical subset of a relation; implicitly
removes any duplicate tuples (so that the result will
be a relation).
Major(Students)
There are two (c3, d2) tuples. This is not allowed in a ―legalǁ relation. What is to be done? Of
course, in Relational Algebra, all duplicates are deleted. Now consider the following examples:
PROJECT S OVER CITY
Following figure shows a sample output for the preceding PROJECT command:
Following figure shows a sample output for the preceding PROJECT command:
Sequences Of Operations
Now we can see the sequence of operations based on both selection and Projection operations.
E.g.
TEMP <- SELECT P WHERE WEIGHT < 17 RESULT <- PROJECT TEMP OVER PNAME
or
(nested operations)
PROJECT (SELECT P WHERE WEIGHT < 17) OVER PNAME
Renaming Operation
Format: ρS(R) or ρS(A1, A2, …)(R): change the name of relation R, and names of attributes of
R:
ρCS_Students(σMajor = ‗CS‘ Students))
It would be useful to have a notation that allows us to ―saveǁ the output of an operation for future use.
Tmp1 scond1 (R1)
Tmp2 pList1(Tmp1)
Tmp3 pList2(Tmp2)
Tmp4 scond2(Tmp2)
The resulting temporary relations will have the same attribute names as the originals. We might also
want to change the attribute names:
To avoid confusion between relations.
To make them agree with names in another table.
The cartesian product of two tables combines each row in one table with each row in the other table.
XX
The Cartesian product of n domains, written dom(A1) dom(A2) ... dom(An), is defined
as follows.
X X Є Є
(A1 A2 ... An = {(a1, a2, ..., an) | a1 A1 AND a2 A2 AND ... AND Є
an An}
We will call each element in the Cartesian product a tuple. So each (a1, a2, ..., an) is known
as a tuple. Note that in this formulation, the order of values in a tuple matters.
Example: The table E (for EMPLOYEE)
1 Bill A
2 Sarah C
3 John A
dnr dname
A Marketing
B Sales
34 Fundamentals of Database Management
C Legal
1 Bill A A Marketing
1 Bill A B Sales
1 Bill A C Legal
2 Sarah C A Marketing
2 Sarah C B Sales
2 Sarah C C Legal
3 John A A Marketing
3 John A B Sales
3 John A C Legal
Division
Is expressed:
As R ÷ S
Defines a relation over the attributes C that consists of set of tuples from R that match combination of
every tuple in S.
T1 πC(R)
T2 πC((S X T1) – R)
T T1 – T2
The division operation ( ÷ ) is useful for a particular type of query that sometimes occurs in database
applications. For example, if I want to organize a study group, I would like to find people who do the same
subjects I do. The division operator provides you with the facility to perform this query without the need
to ―hard codeǁ the subjects involved.
Joins
Fundamentals of Database Management 35
Theta-join:-
The theta-join operation is the most general join operation. We can define theta-join in terms of the
operations that we are familiar with already.
x
R θ S = σθ(R S)
So the join of two relations results in a subset of the Cartesian product of those relations. Which subset is
determined by the join condition:
Let's look at an example. The result of
Professions Job = Job Careers is shown below.
Equi-join:-
The join condition, θ, can be any well-formed logical expression, but usually it is just the conjunction of
equality comparison between pairs of attributes, one from each of the joined relations. This common
case is called an equi-join. The example given above is an example of an equi-join.
Outer Joins:-
A join operation is complete, if all the tuples of the operands contribute to the result. Tuples not
participating in the result are said to be dangling. Outer join operations are variants of the join
operations in which the dangling tuples are appended with NULL fields.
They can be categorized into:
LEFT OUTER JOIN - keep data from the left-hand table
RIGHT OUTER JOIN - keep data from the right-hand table
FULL OUTER JOIN - keep data from both tables
The following figure illustrates example of left and full outer joins:
Natural Joins
The join is the method whereby two tables are combined so data from two or more tables can be used to
extract information. In Relational Algebra, Codd defined the idea of a natural join. The following
description describes a natural join
To do a natural join of two relations, you examine the relations for common attributes (columns with the
same name and domain). For example, look at the following abstract tables: R (A, B, C, D) and Q (B, E,
F)
Set Operations
Consider two relations R and S. You can perform following set operations on these two relations:
UNION of R and S
The union of two relations is a relation that includes all the tuples that are either in R or in S or in both R
and S. Duplicate tuples are eliminated.
INTERSECTION of R and S
The intersection of R and S is a relation that includes all tuples that are both in R and S.
DIFFERENCE of R and S
The difference of R and S is the relation that contains all the tuples that are in R but that are not in
S.
For set operations to function correctly the relations R and S must be union compatible. Two
relations are union compatible if:
They have the same number of attributes
The domain of each attribute in column order is the same in both R and S
The following figures illustrate each of the set operation:
Fundamentals of Database Management 37
Introduction
An operational methodology, founded on predicate calculus, dealing with descriptive expressions
that are equivalent to the operations of relational algebra. Codd's reduction algorithm can convert from
relational calculus to relational algebra. Two forms of the relational calculus exist: the tuple calculus and
the domain calculus. Codd proposed the concept of a relational calculus (applied predicate
calculus tailored to relational databases).
It is founded on a branch of mathematical logic called the predicate calculus. Relational calculus is a
formal query language where we write one declarative expression to specify a retrieval request and hence
there is no description of how to evaluate a query; a calculus expression specifies what is to be retrieved
rather than how to retrieve it. Therefore, the relational calculus is considered to be a nonprocedural
language. This differs from relational algebra, where we must write a sequence of operations to specify a
retrieval request; hence it can be considered as a procedural way of stating a query. It is possible to nest
algebra operations to form a single expression; however, a certain order among the operations is always
explicitly specified in a relational algebra expression. This order also influences the strategy for evaluating
the query.
It has been shown that any retrieval that can be specified in the relational algebra can also be specified
in the relational calculus, and vice versa; in other words, the expressive power of the two
languages is identical. This has led to the definition of the concept of a relationally complete language. A
relational query language L is considered relationally complete if we can express in L any query that can
be expressed in relational calculus. Relational completeness has become an important basis for
comparing the expressive power of high-level query languages. However certain frequently required
queries in database applications cannot be expressed in relational algebra or calculus. Most
relational query languages are relationally complete but have more expressive power than relational
algebra or relational calculus because of additional operations such as aggregate functions,
grouping, and ordering.
Tuple Calculus
The tuple calculus is a calculus that was introduced by Edgar F. Codd as part of the relational
model in order to give a declarative database query language for this data model. It formed the
inspiration for the database query languages QUEL and SQL of which the latter, although far less
faithful to the original relational model and calculus, is now used in almost all relational database
management systems as the ad-hoc query language. Along with the tuple calculus Codd also
introduced the domain calculus which is closer to first-order logic and showed that these two calculi (and
the relational algebra) are equivalent in expressive power. The SQL language is based on the tuple
relational calculus (TRC) which in turn is a subset of classical predicate logic. Queries in the TRC all have
the form:
{QueryTarget | QueryCondition}
The QueryTarget is a tuple variable which ranges over tuples of values. The
QueryCondition is a logical expression such that
It uses the QueryTarget and possibly some other variables.
If a concrete tuple of values is substituted for each occurrence of the QueryTarget in
QueryCondition, the condition evaluates to a boolean value of true or false.
Fundamentals of Database Management 39
The result of a TRC query with respect to a database instance is the set of all choices of values for the
query variable that make the query condition a true statement about the database instance. The
relation between the TRC and logic is in that the QueryCondition is a logical expression of classical first-
order logic.
The tuple relational calculus is based on specifying a number of tuple variables. Each tuple variable
usually ranges over a particular database relation, meaning that the variable may take as its
value any individual tuple from that relation. A simple tuple relational calculus query is of the form
{t | COND(t)} where t is a tuple variable and COND(t) is a conditional expression involving t.
The result of such a query is the set of all tuples t that satisfy COND(t).
For example, to find all employees whose salary is above $50,000, we can write the following tuple
calculus expression:
{t | EMPLOYEE(t) and t.SALARY>50000}
The condition EMPLOYEE(t) specifies that the range relation of tuple variable t is EMPLOYEE. Each
EMPLOYEE tuple t that satisfies the condition t.SALARY>50000 will be retrieved. Notice that t.SALARY
references attribute SALARY of tuple variable t; this notation resembles how attribute names are qualified
with relation names or aliases in SQL. The above query retrieves all attribute values for each
selected EMPLOYEE tuple t. To retrieve only some of the attributes—say, the first and last names—we
write {t.FNAME, t.LNAME | EMPLOYEE(t) and t.SALARY>50000} This is equivalent to the following SQL
query:
SELECT T.FNAME, T.LNAME FROM EMPLOYEE AS T WHERE T.SALARY>50000;
Informally, we need to specify the following information in a tuple calculus expression:
1.For each tuple variable t, the range relation R of t. This value is specified by a
2.condition of the form R(t).
3.A condition to select particular combinations of tuples. As tuple variables range over their respective
range relations, the condition is evaluated for every possible combination of tuples to identify the
selected combinations for which the condition evaluates to TRUE.
4.A set of attributes to be retrieved, the requested attributes. The values of these
attributes are retrieved for each selected combination of tuples.
Observe the correspondence of the preceding items to a simple SQL query: item 1 corresponds
to the FROM-clause relation names; item 2 corresponds to the WHERE- clause condition; and item
3 corresponds to the SELECT-clause attribute list.
Before we discuss the formal syntax of tuple relational calculus, consider another query we have
seen before.
Retrieve the birthdate and address of the employee (or employees) whose name is ‗John B. Smith‘.
Q0 : {t.BDATE, t.ADDRESS | EMPLOYEE(t) and t.FNAME=‗John‘ and t.MINIT=‗B‘ and t.LNAME=‗Smith‘}
In tuple relational calculus, we first specify the requested attributes t.BDATE and t.ADDRESS for
each selected tuple t. Then we specify the condition for selecting a tuple following the bar ( | )—
namely, that t be a tuple of the EMPLOYEE relation whose FNAME, MINIT, and LNAME attribute
values are ‗John‘, ‗B‘, and ‗Smith‘, respectively.
Domain Calculus
There is another type of relational calculus called the domain relational calculus, or simply,
domain calculus. The language QBE that is related to domain calculus was developed almost
concurrently with SQL at IBM Research, Yorktown Heights. The formal specification of the domain
calculus was proposed after the development of the QBE system.
The domain calculus differs from the tuple calculus in the type of variables used in formulas:
rather than having variables range over tuples, the variables range over single values from domains of
attributes. To form a relation of degree n for a query result, we must have n of these domain variables—
one for each attribute. An expression of the Domain calculus is of the form
{x1, x2, . . ., xn | COND(x1, x2, . . ., xn, xn+1, xn+2, . . ., xn+m)}
where x1, x2, . . ., xn, xn+1, xn+2, . . ., xn+m are domain variables that range over domains (of attributes)
and COND is a condition or formula of the domain relational calculus. A formula is made up of atoms.
40 Fundamentals of Database Management
As in tuple calculus, atoms evaluate to either TRUE or FALSE for a specific set of values, called
the truth values of the atoms.
In a similar way to the tuple relational calculus, formulas are made up of atoms, variables, and
quantifiers, so we will not repeat the specifications for formulas here. Some examples of queries
specified in the domain calculus follow. We will use lowercase letters l, m, n, . . ., x, y, z for domain
variables.
Example: Q0
Retrieve the birthdate and address of the employee whose name is ‗John B. Smith‘.
Q0 : {uv | ( q) ( r) ( s) ( t) ( w) ( x) ( y) ( z)
(EMPLOYEE(qrstuvwxyz) and q=‘John‘ and r=‘B‘ and s=‘Smith‘)}
Example: Q1
Retrieve the name and address of all employees who work for the ‗Research‘ department.
Q1 : {qsv | ( z) ( l) ( m) (EMPLOYEE(qrstuvwxyz) and
DEPARTMENT(lmno) and l=‗Research‘ and m=z)}
A condition relating two domain variables that range over attributes from two relations, such as m = z in
Q1, is a join condition; whereas a condition that relates a domain variable to a constant, such as l =
‗Research‘, is a selection condition.
Example: Q2
For every project located in ‗Stafford‘, list the project number, the controlling department number, and
the department manager‘s last name, birthdate, and address.
Q2 : {iksuv | ( j) ( m)( n) ( t)(PROJECT(hijk) and EMPLOYEE(qrstuvwxyz) and
DEPARTMENT(lmno) and k=m and n=t and j=‗Stafford‘)}
As mentioned earlier, it can be shown that any query that can be expressed in the relational
algebra can also be expressed in the domain or tuple relational calculus. Also, any safe expression in
the domain or tuple relational calculus can be expressed in the relational algebra.
The Entity/Relationship (E/R) model was developed to give an overall, conceptual view of the organization
of data. In these notes, we present the modeling concepts. The E/R model has an associated graphical
representation, called E/R diagrams which will be discussed later.
Analogies
A Mini world is a small part of the real world that we are interested in Modeling.
Movie World Example: For a running example we will assume that our Mini world is the motion picture
industry.
Student World Example: For another running example we will assume that your Mini world is the
students and subjects at JCU.
Entity
An entity is a thing or an object in that world, usually one that physically exists, that is distinguishable
from other entities.
Attribute
blond
entity a3 has attributes Name = Yul Brenner, Age = 60, HairColour = bald
entity m1 has attributes Name = Sneakers, Cost = $10M, Earning = $40M, Profit = $30M, When-Released
= 1995
Where a1 and m1 indicates stars and movies respectively.
Student World Example: Let's assume that we have several "Student" and ―Subject" entities.
entity s1 has attributes Name = Charles Walker, Id = 484350 entity s2 has attributes Name = Jasper, Id
= 2234433
entity u1 has attributes Code = CP1500, Name = Information Systems entity u2 has attributes Code =
CP1200, Name = Programming
Now we will consider the following observations from the above.
Even among these simple entities we notice that there are several different kinds of attributes.
One distinction is simple vs. composite. A simple attribute has an atomic value, while a composite
attribute is (naturally) composed of other attributes.
Movie World Example: We could view a "Star's" Name attribute as a composite attribute, since it is the
composition of Given Names and Surname attributes.
Student World Example: We could view a "Student's" Name attribute as a composite attribute,
since it is the composition of Given Names and Surname attributes.
Another distinction we can make is single-valued vs. multivalued. A single-valued attribute can only be a
single value, while a multivalued attribute can be a list or set of values.
Movie World Example: The HairColour attribute is multivalued since Meryl Streep's hair colour is three
different colours. We will assume that it is three different colours all at the same time!
Student World Example: A Location attribute could be added to each Subject indicating in which rooms
lectures are held. It is often the case that a subject is taught in different rooms. So Location is a
multivalued attribute
In general, the fact that a single-valued attribute changes value over time (e.g., when a person dyes their
hair) does not mean that it is multivalued. A third distinction is stored vs. derived. While the vast majority
of attributes will be stored, some attributes can be computed or derived from other attributes.
Movie World Example: A movie's Profit is a derived attribute, computable from the Cost and Earnings
attributes.
Student World Example: Assume that each subject has a When multivalued attribute that indicates when
the lectures are held. Then a possible derived attribute would be Lecture hours, which is total number of
hours that the class meets each week. Lecture hours is derived from the When attribute.
So, you now seem to be got the basic idea of entities and attributes.
Database architecture essentially describes the location of all the pieces of information that make up the
database application. The database architecture can be broadly classified into two-, three-, and multitier
architecture.
42 Fundamentals of Database Management
The two-tier architecture is a client–server architecture in which the client contains the presentation code
and the SQL statements for data access. The database server processes the SQL statements and sends
query results back to the client. The two-tier architecture is shown in the figure depicted below. Two-tier
client/server provides a basic separation of tasks. The client, or first tier, is primarily responsible for the
presentation of data to the user and the server, or second tier, is primarily responsible for supplying
data services to the client.
Presentation Services
Presentation services refers to the portion of the application which presents data to the user. In
addition, it also provides for the mechanisms in which the user will interact with the data. More simply
put, presentation logic defines and interacts with the user interface. The presentation of the data should
generally not contain any validation rules.
Business Services/objects
Business services are a category of application services. Business services encapsulate an organizations
business processes and requirements. These rules are derived from the steps necessary to carry out day-
to day business in an organization. These rules can be validation rules, used to be sure that the incoming
information is of a valid type and format, or they can be process rules, which ensure that the proper
business process is followed in order to complete an operation.
Application Services
Data services provide access to data independent of their location. The data can come from legacy
mainframe, SQL RDBMS, or proprietary data access systems. Once again, the data services provide a
standard interface for accessing data.
The two-tier architecture is a good approach for systems with stable requirements and a moderate
number of clients. The two-tier architecture is the simplest to implement, due to the number of good
commercial development environments.
Software maintenance can be difficult because PC clients contain a mixture of presentation, validation, and
business logic code. To make a significant change in the business logic, code must be modified on many
PC clients. Moreover the performance of two-tier architecture can be poor when a large number of clients
submit requests because the database server may be overwhelmed with managing messages. With a
large number of simultaneous clients, three-tier architecture may be necessary.
Three-tier Architecture
Through standard tiered interfaces, services are made available to the application. A single application can
employ many different services which may reside on dissimilar platforms or are developed and maintained
with different tools. This approach allows a developer to leverage investments in existing systems while
creating new application which can utilize existing resources.
Although the three-tier architecture addresses performance degradations of the two-tier architecture, it
does not address division-of-processing concerns.
The PC clients and the database server still contain the same division of code although the tasks of the
database server are reduced. Multiple-tier architectures provide more flexibility on division of processing.
Multitier Architecture
Application Servers can take many forms. An Application Server may be anything from custom application
services, Transaction Processing Monitors, Database Middleware, Message Queue to a CORBA/COM based
solution.
46 Fundamentals of Database Management
E-R Diagrams
The entity-relationship (ER) data model allows us to describe the data involved in a real-world
enterprise in terms of objects and their relationships and is widely used to develop an initial database
design. Here, we introduce the ER model and discuss how its features allow us to model a wide
range of data faithfully.
The ER model is important primarily for its role in database design. It provides useful concepts that allow
us to move from an informal description of what users want from their database to a more detailed,
and precise, description that can be implemented in a DBMS. Within the larger context of the overall
design process, the ER model is used in a phase called conceptual database design.
There are many variations of ER diagrams are in use, and no widely accepted standards prevail.
The presentation here is representative of the family of ER models and includes a selection of the most
popular features.
Analogies
A Mini world is a small part of the real world that we are interested in Modeling.
Movie World Example: For a running example we will assume that our Mini world is the motion picture
industry.
Student World Example: For another running example we will assume that your Mini world is the
students and subjects at JCU.
Entity
An entity is a thing or an object in that world, usually one that physically exists, that is distinguishable
from other entities.
Attribute
Another distinction we can make is single-valued vs. multivalued A single-valued attribute can only
be a single value, while a multivalued attribute can be a list or set of values.
The HairColour attribute is multivalued since Meryl Streep's hair colour is three different colours. We will
assume that it is three different colours all at the same time!
A Location attribute could be added to each Subject indicating in which rooms lectures are held. It is often
the case that a subject is taught in different rooms. So Location is a multivalued attribute
In general, the fact that a single-valued attribute changes value over time (e.g., when a person dyes their
hair) does not mean that it is multivalued. A third distinction is stored vs. derived. While the vast majority
of attributes will be stored, some attributes can be computed or derived from other attributes.
A movie's Profit is a derived attribute, computable from the Cost and Earnings attributes.
Assume that each subject has a When multivalued attribute that indicates when the lectures are held.
Then a possible derived attribute would be Lecture hours, which is total number of hours that the class
meets each week. Lecture hours is derived from the When attribute.
So, you now seem to be got the basic idea of entities and attributes.
48 Fundamentals of Database Management
E-R Diagrams
In an E/R diagram we will represent an attribute using an oval inscribed with the name of the attribute, as
follows.
At least that is how we will represent a simple, single-valued, stored attribute. A composite
attribute will be represented by a hierarchy of ovals, where each oval represents an attribute
value within the composite. A multivalued attribute will be represented as an oval within an oval.
Finally, a derived attribute will be represented as an attribute with dashed or dotted lines.
One interesting question is what happens when we don't know the value of a particular attribute? When an
attribute value is unknown we will use a null value. For the above entities, we have complete information,
but in real world databases null values will often be present. We will represent a null value with the special
symbol @.
For some entities an attribute is inapplicable, which means that the entity does not have a value
for that attribute. For instance, the HairColour attribute for Yul Brenner is really inapplicable since he does
not have any hair. We will use a @ to represent inapplicable values as well. We have thus
overloaded the semantics of @ with two completely disparate meanings. The overloaded semantics
however is common in databases since it is in SQL.
Fundamentals of Database Management 49
The following figure shows the various symbols used in an E/R diagram.
Entity Type
Now you know what an entity is; now we will look into what exactly is an entity type?
An entity type is a description of the attributes that a set of possible entities has in common.
In our running example, we so far have two entity types: Star and Movie. We will use a third, Studio
as well. We will assume that Star has attributes Name, Age, and HairColour. Movie has attributes
50 Fundamentals of Database Management
Name, When Released, Cost, Earnings, and Profit. Finally, Studio has attributes Name and Location. Name
is certainly a popular attribute name for these entity types!
In our running example, we so far have two entity types: Student and Subject. We will use a third,
Lecturer as well. We will assume that Student has attributes Name, Address, and Id. Subject has
attributes Code, Name, and When. Finally, Lecturer has attributes Name and Age. Name is certainly
a popular attribute name for these entity types!
An entity type is sometimes called an entity set, however, some authors distinguish between the
two. More specifically an entity set is a set of actual entities (that is, it is an extension of an entity type,
rather than an entity type itself). We will use the two terms interchangeably. In an E/R diagram an entity
type is represented with a rectangular box inscribed with the name of that entity type.
Key Attributes
Key attributes (or just keys) are a set of attributes which have distinct values for any possible
entity. There may be several keys for a particular entity type.
By convention, two movies with the same name cannot be released during the same year. So
the attributes Name and When Released form a perfectly reasonable key for the Movie entity type.
Each student has a unique Id, so that attribute makes a perfectly reasonable key for the Student entity
type. In an E/R diagram we depict a key attribute (or an attribute that is part of a key) by underlining the
attribute name.
Relationship
We both have a good relation now, we could call it as a student and faculty relationship, likewise in E-R
model…. A relationship is an association between two or more entities.
Relationship Type
A relationship type or relationship set is a set of "similar in kind" relationships among one or more entities.
Mathematically, a relationship type, R, among entity types E1, E2, ...En is R E1 E2 ... En . In other
Fundamentals of Database Management 51
words a relationship set can be thought of as a subset of the Cartesian product of the participating entity
types. The Cartesian product is just the space of all possible associations among the entity types. A
relationship type is often also called a role because it describes a role that one entity plays with another.
Each star may "star in" one or more movies. So we could have a relationship type StarsIn that
captures has all the associations between stars and the movies in which they star.
The relationship type EnrolledIn is the set of associations between Student and the Subject in which they
are enrolled.
Cardinality Ratio
We will often be interested in the cardinality ratio of a relationship type, that is, how many of
each entity type participate in the relationship. Possible cardinality ratios are the following.
One-to-one(1-to-1)
Each entity in E1 is associated with 0 or one entity in E2, and vice versa.
Assume that Married is a relationship type between Star and Star, which captures whom is married to
whom. It is a 1-1 relationship since each Star is married to at most one other Star (let's not worry
too much about people who currently have multiple wives or husbands!).
Assume that Married is a relationship type between Student and Student, which captures whom is
married to whom. It is a 1-1 relationship since each Student is married to at most one other
Student (let's not worry too much about students who currently have multiple wives or husbands!).
one-to-many
A one-to-many relationship type (1-N or 1:N) is one in which a single entity of one entity type can be
related to several entities of another type, but each entity of the other type is related to at most
one entity of the first type.
Assume that Produces is a relationship type between Studio and Movie, which captures which studio
produces which movies. It is a 1-N relationship since each Studio may produce several different
Movies, but each movie can be produced by at most one Studio (assuming that only one studio can
produce a movie, let's not worry too much about collaboration between studios).
Assume that Teaches is a relationship type between Lecturers and Subjects, which captures which
lecturer teaches which subject. It is a 1-N relationship since each Lecturer can teach several
different subjects, but each Subject has a single Lecturer (let's not worry too much about subjects
that have more than one lecturer).
many-to-many
A many-to-many relationship type (N-M or N:M) is one in which a single entity of one entity type is
related to at most N entities of another type, and vice- versa.
Fundamentals of Database Management 53
Assume that StarsIn is a relationship type between Star and Movie, which captures who stars in what
movies. It is a N-M relationship since each Star may star in many different Movies, and each Movie may
have many different Stars.
Assume that EnrolledIn is a relationship type between Student and Subject, which captures who is
enrolled in what subject. It is a N-M relationship since each Student may enroll in many different
Subjects, and each Subject may have many different Students.
In an E/R diagram we depict a relationship type as a diagonal box. The cardinality ratio is also shown by
adding 1, N, or M to the lines connecting the relationship type to the entity type.
A weak entity type is an entity that needs the key attributes from another entity to uniquely
identify tuples. Weak entities lack keys. In an E/R diagram a weak entity type is represented by a nested
pair of rectangles as shown below. The weak entity is connected by an identifying or owning relationship
to the entity type that supplies the key attributes, which in turn is called the owning entity type. An
owning relationship is depicted as a nested pair of diamonds.
Each Star could have several children. We choose to represent a Child entity type using Child
Name and Age attributes. For instance assume that Meryl Streep has a child named Joe who is 6 years
old. The key of the Star entity type needs to be used to help identify which Child is dependent on
which Star since children in different families could be the same age with the same first name.
For instance assume that Robert Redford also has a child named Joe who is 6 years old. We need the
Star's key to identify which child is Owned by which Star, to keep the two Joe's separate.
54 Fundamentals of Database Management
In addition to E-R diagrams, another tool that comes handy during database as well as system design is
the Data Flow Diagram (DFD). Both DFD and ERD are important for an organization. While entities,
whether they are people, places, events or objects are represented in an ERD, DFD talks about how data
flows between entities. One gets to know about the entities for which data is stored in the organization
through ERD while DFD gives information about the flow of data between entities and how and where it is
stored.
Data flow diagram will support 4 main activities:
Analysis: DFD is used to determine requirements of users
Design: DFD is used to map out a plan and illustrate solutions to analysts and users while designing
a new system
Communication: One of the strength of DFD is its simplicity and ease to understand to analysts and
users;
56 Fundamentals of Database Management
Documents: DFD is used to provide special description of requirements and system design.
DFD provide an overview of key functional components of the system but it does not provide any detail on
these components. We have to use other tools like database dictionary, process specification to get
an idea of which information will be exchanged and how.
The data dictionary is an organized listing of all the data elements pertinent to the system, with precise,
rigorous definitions so that both user and systems analyst will have a common understanding of all inputs,
outputs, components of stores, and intermediate calculations. The data dictionary defines the data
elements by doing the following:
Describing the meaning of the flows and stores shown in the data flow diagrams;
Describing the composition of aggregate packets of data moving along the flow;
Describing the composition of packets of data in stores;
Specifying the relevant values and units of elementary chunks of information in the
data flows and data stores.
Describing the details of relationships between stores that are highlighted in an
entity- relationship diagram.
The system analysis can ensure that the dictionary is complete, consistent, and non-
contradictory. He can examine the dictionary on his own and ask the following
questions:
Has every flow on the data flow diagram been defined in the data dictionary?
Have all the components of composite data elements been defined?
Has any data element been defined more than once?
Has the correct notation been used for all data dictionary definition?
Are there any data elements in the data dictionary that are not referenced in the
functioning diagrams, data flow diagrams, or entity-relationship diagrams
Building a data dictionary is one of the more important aspects and time consuming of systems analysis.
But, without a formal dictionary that defines the meaning of all the terms, there can be no hope for
precision.
As we know, there is a variety of tools that we can use to produce a process specification: decision tables,
structured English, pre/post conditions, flowcharts, and so on. Most of the systems analysts use structured
English. But, any method can be used as long as it satisfies two important requirements:
The process specification must be expressed in a form that can be verified by the
user and the systems analysts;
The process specification must be expressed in a form that can be
effectively communicated to the various audiences involved.
The process specification represents the largest amount of detailed work in building a system model.
Because of the amount of work involved, you may want to consider the top – down implementation
approach: begin the design and implementation phase of your project before all the process specifications
have been finished.
The activity of writing process specifications regarded as a check of the data flow diagrams that have
already developed. In writing process specifications, you may discover that the process specifications
needs additional functions, input data flow or output data flow... Thus, the DFD model may be changed,
revisions, and corrections based on the detailed work of writing the process specifications.
Data flow diagram can be described in the following ways:
What functions should the system perform?
Interaction between functions?
What does the system have to transfer?
What inputs are transferred to what outputs?
What type of work does the system do?
Where does the system get information from to work?
And where does it give work results to?
Fundamentals of Database Management 57
Regardless of the ways it is described, the data flow diagram needs to meet the following
requirements:
Without explanation in words, the diagram can still tell the system‘s functions and
its information flowing process. Moreover, it must be really simple for users and
systems analysts to understand.
The diagram must be balance laid out in one page (for small systems) and in
every single page showing system‘s functions of the same level (for larger systems)
It is better for the diagram to be laid out with computer supporting tools, because that
way the diagram will be consistent and standardized. Also, the adjustment process (when
needed) will be done quickly and easily.
The main components of data flow diagram are:
The process: The process shows a part of the system that transforms inputs into outputs;
that is, it shows how one or more inputs are changed into outputs. Generally, the process
is represented graphically as a circle or rectangle with rounded edges. The process name
will describe what the process does.
The flow: The flow is used to describe the movement of information from one part of the
system to another. Thus, the flow represents data in motion, whereas the stores represent
data at rest. A flow is represented graphically by an arrow into or out of a process.
The store: the store is used to model a collection of data packets at rest. A
store is represented graphically by two parallel lines. The name of a store identified the
store is the plural of the name of the packets that are carried by flows into and out of the
store
External factors: External factors can be a person, a group of persons or an organization
that are not under the studying field of the system (they can stay in or out of
the organization), but has certain contact with the system. The presence of these factors
on the diagram shows the limit of the system and identifies the system
relationship to the outside world. External factors are important components crucial to
the survival of every system, because they are sources of information for the
systems and are where system products are transferred to. An external factor tends to
be represented by an rectangle, one shorter edge of which is omitted while the other is
drawn by a duplicated line.
Internal factors: While the external factors‘ names are always nouns
showing a department or an organization, internal factors‘ names are
expressed by verbs or modifiers. Internal factors are systems‘ functions or process. To
distinguish itself from external factors, an internal factor is represented by an
rectangle, one shorter edge of which is omitted while the other is drawn by a single
line.
You can construct DFD model of system with the following guidelines:
Choose meaningful names for processes, flows, stores, and terminators
Number of processes
Re-draw the DFD many times
Avoid overly complex DFD
Make sure the DFD is consistent internally and with any associated DFD
To recap, DFD is one of the most important tools in a structured system analysis. It presents a method of
establishing relationship between functions or processes of the system with information it uses. DFD is a
key component of the system requirement specification, because it determines what information is needed
for the process before it is implemented. Many systems analysts reckon that DFD is all they need to know
about structured analysis.
On the one hand, this is because DFD is the only thing that a systems analyst remembers after reading a
book focussing on DFD or after a course in structured analysis. On the other hand, without the additional
modelling tools such as Data Dictionary, Process Specification, DFD not only can‘t show all the necessary
details, but also becomes meaningless and useless.
In the example of library management system, corresponding to each level of function hierarchy diagram,
we develop the data flow diagrams:
58 Fundamentals of Database Management
Functional Dependencies
Introduction
For our discussion on functional dependencies assume that a relational schema has attributes (A,
B, C... Z) and that the whole database is described by a single universal relation called R = (A,
B, C, ..., Z). This assumption means that every attribute in the database has a unique name.
A functional dependency is a property of the semantics of the attributes in a relation. The semantics
indicate how attributes relate to one another, and specify the functional dependencies between
attributes. When a functional dependency is present, the dependency is specified as a constraint between
the attributes.
Consider a relation with attributes A and B, where attribute B is functionally dependent on attribute A. If
we know the value of A and we examine the relation that holds this dependency, we will find only
one value of B in all of the tuples that have a given value of A, at any moment in time. Note however,
that for a given value of B there may be several different values of A.
The functional dependency staff# position clearly holds on this relation instance. However, the reverse
functional dependency position staff# clearly does not hold.
The relationship between staff# and position is 1:1 – for each staff member there is only one position. On
the other hand, the relationship between position and staff# is 1:M – there are several staff numbers
associated with a given position.
For the purposes of normalization we are interested in identifying functional dependencies between
attributes of a relation that have a 1:1 relationship.
When identifying Fds between attributes in a relation it is important to distinguish clearly between the
values held by an attribute at a given point in time and the set of all possible values that an attributes
may hold at different times.
In other words, a functional dependency is a property of a relational schema (its intension) and
not a property of a particular instance of the schema (extension).
The reason that we need to identify Fds that hold for all possible values for attributes of a relation is that
these represent the types of integrity constraints that we need to identify. Such constraints indicate the
62 Fundamentals of Database Management
limitations on the values that a relation can legitimately assume. In other words, they identify the
legal instances which are possible.
Let‘s identify the functional dependencies that hold using the relation schema
STAFFBRANCH.
In order to identify the time invariant Fds, we need to clearly understand the semantics of the various
attributes in each of the relation schemas in question.
For example, if we know that a staff member‘s position and the branch at which they are located
determines their salary. There is no way of knowing this constraint unless you are familiar with the
enterprise, but this is what the requirements analysis phase and the conceptual design phase are
all about!
staff# (sname, position, salary, branch#, baddress branch# baddressbaddress branch# branch#,
position salary baddress, position, salary )
As well as identifying Fds which hold for all possible values of the attributes involved in the fd, we also
want to ignore trivial functional dependencies. A functional dependency is trivial if, the consequent is a
subset of the determinant. In other words, it is impossible for it not to be satisfied.
Although trivial Fds are valid, they offer no additional information about integrity constraints for the
relation. As far as normalization is concerned, trivial Fds are ignored.
We‘ll denote as F, the set of functional dependencies that are specified on a relational schema R.
Typically, the schema designer specifies the Fds that are semantically obvious; usually however,
numerous other Fds hold in all legal relation instances that satisfy the dependencies in F.
These additional Fds that hold are those Fds which can be inferred or deduced from the Fds in F.
The set of all functional dependencies implied by a set of functional dependencies F is called the closure of
F and is denoted F+.
The notation: FX → Y denotes that the functional dependency X→ Y is implied by the set of Fds F.
Formally, F+ {X→ Y | F X → Y}
A set of inference rules is required to infer the set of Fds in F+.
For example, if Kristi is older than Debi and that Debi is older than Traci, you are able to infer that Kristi is
older than Traci. How did you make this inference? Without thinking about it or maybe knowing about it,
you utilized a transitivity rule to allow you to make this inference. The set of all Fds that are implied by a
given set S of Fds is called the closure of S, written S+.
Clearly we need an algorithm that will allow us to compute S+ from S. You know the first attack on this
problem appeared in a paper by Armstrong which gives a set of inference rules. The following are the
six well-known inference rules that apply to functional dependencies.
IR1: reflexive rule – if X Y, then X → Y
IR2: augmentation rule – if X → Y, then XZ → YZ
IR3: transitive rule – if X → Y and Y → Z, then X → Z
IR4: projection rule – if X → YZ, then X → Y and X → Z
IR5: additive rule – if X → Y and X → Z, then X → YZ
IR6: pseudo transitive rule – if X → Y and YZ → W, then XZ → W
The first three of these rules (IR1-IR3) are known as Armstrong‘s Axioms and constitute a necessary and
sufficient set of inference rules for generating the closure of a set of functional dependencies. These rules
can be stated in a variety of equivalent ways. Each of these rules can be directly proved from the
definition of functional dependency. Moreover the rules are complete, in the sense that, given a set S of
Fds, all Fds implied by S can be derived from S using the rules. The other rules are derived from these
three rules.
Fundamentals of Database Management 63
Chapter 5 : Normalization
Analysis of Redundancies
Before we go into the detail of Normalization I would like to discuss with you the redundancies
in the databases.
A redundancy in a conceptual schema corresponds to a piece of information that can be derived (that is,
obtained through a series of retrieval operations) from other data in the database.
The presence of a redundancy in a database may be decided upon the following factors
An advantage: a reduction in the number of accesses necessary to obtain the
derived information;
A disadvantage: because of larger storage requirements, (but, usually at negligible
cost) and the necessity to carry out additional operations in order to keep the derived data
consistent.
The decision to maintain or delete a redundancy is made by comparing the cost of operations
that involve the redundant information and the storage needed, in the case of presence or absence of
redundancy.
The time has come to reveal the actual facts why normalization is needed. We will look in to the matter in
detail now.
The serious problem with using the relations is the problem of update anomalies. These can be classified
in to:
Insertion anomalies
Deletion anomalies .Modification anomalies
Insertion Anomalies
An "insertion anomaly" is a failure to place information about a new database entry into all the places in
the database where information about that new entry needs to be stored.
In a properly normalized database, information about a new entry needs to be inserted into only one place
in the database; in an inadequately normalized database, information about a new entry may need to be
inserted into more than one place and, human fallibility being what it is, some of the needed additional
insertions may be missed.
This can be differentiated in to two types based on the following example:
Emp_Dept
Codd proposed the relational data model in 1970. At that time most database systems were based on one
of two older data models (the hierarchical model and the network model); the relational model
revolutionized the database field and largely supplanted these earlier models. Prototype relational database
management systems were devel-oped in pioneering research projects at IBM and UC-Berkeley by the
mid-70s, and several vendors were o ering relational database products shortly thereafter. Today, the
relational model is by far the dominant data model and is the foundation for the leading DBMS products,
including IBM's DB2 family, Informix, Oracle, Sybase, Mi-crosoft's Access and SQLServer, FoxBase, and
Paradox. Relational database systems are ubiquitous in the marketplace and represent a multibillion dollar
industry.
The relational model is very simple and elegant; a database is a collection of one or more relations, where
each relation is a table with rows and columns. This simple tabularrepresentation enables even novice
users to understand the contents of a database, and it permits the use of simple, high-level languages to
query the data. The major advantages of the relational model over the older data models are its simple
data representation and the ease with which even complex queries can be expressed.
This chapter introduces the relational model and covers the following issues:
SQL: It was the query language of the pioneering System-R relational DBMS developed at IBM. Over the
years, SQL has become the most widely used language for creating, manipulating, and querying relational
DBMSs. Since many vendors o er SQL products, there is a need for a standard that de nes `o cial SQL.'
The existence of a standard allows users to measure a given vendor's version of SQL for completeness. It
also allows users to distinguish SQL features that are speci c to one product from those that are standard;
an application that relies on non-standard features is less portable.
The rst SQL standard was developed in 1986 by the American National Stan-dards Institute (ANSI), and
was called SQL-86. There was a minor revision in 1989 called SQL-89, and a major revision in 1992 called
SQL-92. The Interna-tional Standards Organization (ISO) collaborated with ANSI to develop SQL-92. Most
commercial DBMSs currently support SQL-92. An exciting development is the imminent approval of
SQL:1999, a major extension of SQL-92. While the cov-erage of SQL in this book is based upon SQL-92,
we will cover the main extensions of SQL:1999 as well.
While we concentrate on the underlying concepts, we also introduce the Data Def-inition Language (DDL)
features of SQL-92, the standard language for creating, manipulating, and querying data in a relational
DBMS. This allows us to ground the discussion rmly in terms of real database systems.
Fundamentals of Database Management 65
We discuss the concept of a relation in Section 3.1 and show how to create relations using the SQL
language. An important component of a data model is the set of constructs it provides for specifying
conditions that must be satised by the data. Such conditions, called integrity constraints (ICs), enable the
DBMS to reject operations that might corrupt the data. We present integrity constraints in the relational
model in Section 3.2, along with a discussion of SQL support for ICs. We discuss how a DBMS enforces
integrity constraints in Section 3.3. In Section 3.4 we turn to the mechanism for accessing and retrieving
data from the database, query languages, and introduce the querying features of SQL, which we examine
in greater detail in a later chapter.
We then discuss the step of converting an ER diagram into a relational database schema in Section 3.5.
Finally, we introduce views, or tables de ned using queries, in Section 3.6. Views can be used to de ne the
external schema for a database and thus provide the support for logical data independence in the
relational model.
The main construct for representing data in the relational model is a relation. A relation consists of a
relation schema and a relation instance. The relation instance
is a table, and the relation schema describes the column heads for the table. We first describe the relation
schema and then the relation instance. The schema species the relation's name, the name of each field (or
column, or attribute), and the domain of each field. Adomain is referred to in a relation schema by the
domain name and has a set of associated values.
We use the example of student information in a university database from Chapter 1 to illustrate the parts
of a relation schema:
Students(sid: string, name: string, login: string, age: integer, gpa: real)
This says, for instance, that the field named sid has a domain named string. The set of values associated
with domain string is the set of all character strings.
We now turn to the instances of a relation. An instance of a relation is a set of tuples, also called records,
in which each tuple has the same number of fields as the relation schema. A relation instance can be
thought of as a table in which each tuple is a row, and all rows have the same number of fields. (The term
relation instance is often abbreviated to just relation, when there is no confusion with other aspects of a
relation such as its schema.)
An instance of the Students relation appears in Figure below. The instance S 1 contains
FIELDS (ATTRIBUTES,
COLUMNS)
Field names
sid name login age gpa
50000 Dave dave@cs 19 3.3
53666 Jones jones@cs 18 3.4
TUPLES 53688 Smith smith@ee 18 3.2
(RECORDS, ROWS) 53650 Smith smith@math 19 3.8
66 Fundamentals of Database Management
madayan@mus
53831 Madayan ic 11 1.8
53832 Guldu guldu@music 12 2.0
six tuples and has, as we expect from the schema, fields. Note that no two rows are identical. This is a
requirement of the relational model|each relation is de ned to be a set of unique tuples or rows.1 The
order in which the rows are listed is not important. Figure below shows the same relation instance. If the
fields are named, as in
1In practice, commercial systems allow tables to have duplicate rows, but we will assume that a relation is
indeed a set of tuples unless otherwise noted.
our schema definitions depicting relation instances, the order of fields does not matter either. However,
an alternative convention is to list fields in a specific order and to refer to field by its position.
Thus sid is field 1 of Students login is field 3, and so on. If this convention is used, the order of fields is signi
-ficant. Most database systems use a combination of these conventions. For example, in SQL the named fields
convention is used in statements that retrieve tuples, and the ordered elds convention is commonly used
when inserting tuples.
A relation schema species the domain of each field or column in the relation instance. These domain
constraints in the schema specify an important condition that we want each instance of the relation to
satisfy: The values that appear in a column must be drawn from the domain associated with that column.
Thus, the domain of a field is essentially the type of that field, in programming language terms, and restricts
the values that can appear in the field.
More formally, let R(f1:D1, : : :, fn:Dn) be a relation schema, and for each fi, 1 i n, let Domi be the set of
Fundamentals of Database Management 67
values associated with the domain named Di. An instance of R that satises the domain constraints in the
schema is a set of tuples with n fields:
The angular brackets h: : :i identify the fields of a tuple. Using this notation, the rst Students tuple shown
in figure-1 is written as hsid: 50000, name: Dave, login:dave@cs, age: 19, gpa: 3.3i. The curly
brackets f: : :g denote a set (of tuples, in this de nition). The vertical bar j should be read `such that,' the
symbol 2 should be read `in,' and the expression to the right of the vertical bar is a condition that must
be satised by the field values of each tuple in the set.Thus, an instance of R is de ned as a set of tuples.
The field of each tuple must correspond to the fields in the relation schema.
Domain constraints are so fundamental in the relational model that we will henceforth consider only
relation instances that satisfy them; therefore, relation instance means relation instance that satises the
domain constraints in the relation schema.
The degree, also called cardinalityof a relation is the number of fields. The cardinality of a relation instance
is the number of tuples in it. In Figure-1, the degree of the relation (the number of columns) is five, and the
cardinality of this instance is six.
A relational database is a collection of relations with distinct relation names. The relational database
schema is the collection of schemas for the relations in the database. For example, in Chapter 1, we
discussed a university database with rela-tions called Students, Faculty, Courses, Rooms, Enrolled,
Teaches, and Meets In. An instance of a relational database is a collection of relation instances, one per
rela-tion schema in the database schema; of course, each relation instance must satisfy the domain
constraints in its schema.
The SQL-92 language standard uses the word table to denote relation, and we will often follow this
convention when discussing SQL. The subset of SQL that supports the creation, deletion, and modication
of tables is called the Data De nitionLan-guage (DDL). Further, while there is a command that lets users
de ne new domains, analogous to type de nition commands in a programming language, we postpone a
dis-cussion of domain de nition until Section 5.11. For now, we will just consider domains that are built-in
types, such as integer.
The CREATE TABLE statement is used to de ne a new table.2To create the Students relation, we can use
the following statement:
name CHAR(30),
login CHAR(20),
age INTEGER,
gpa REAL )
Tuples are inserted using the INSERT command. We can insert a single tuple into the Students table as
follows:
INSERT
68 Fundamentals of Database Management
We can optionally omit the list of column names in the INTO clause and list the values in the appropriate
order, but it is good style to be explicit about column names.
2SQL also provides statements to destroy tables and to change the columns associated with a table; we
We can delete tuples using the DELETE command. We can delete all Students tuples with name equal to
Smith using the command:
DELETE
FROM Students S
We can modify the column values in an existing row using the UPDATE command. For example, we can
increment the age and decrement the gpa of the student with sid 53688:
UPDATE Students S
These examples illustrate some important points. The WHERE clause is applied rst and determines which
rows are to be modied. The SET clause then determines how these rows are to be modied. If the column
that is being modied is also used to determine the new value, the value used in the expression on the
right side of equals
(=) is the old value, that is, before the modication. To illustrate these points further, consider the
following variation of the previous query:
UPDATE Students S
If this query is applied on the instance S 1 of Students shown in Figure-1 , we obtain the instance shown
in Figure-3
A database is only as good as the information stored in it, and a DBMS must therefore help prevent the
entry of incorrect information. An integrity constraint (IC) is a condition that is specified on a database
schema, and restricts the data that can be stored in an instance of condition that is specified on a database.
If a database instance satises all the integrity constraints specied on the database schema, it is a legal
instance. A DBMS enforces integrity constraints, in that it permits only legal instances to be stored
in the database.
1.When the DBA or end user de nes a database schema, he or she species the ICs that must hold on
any instance of this database.
2.When a database application is run, the DBMS checks for violations and disallows changes to the data
that violate the specied ICs. (In some situations, rather than disallow the change, the DBMS might
instead make some compensating changes to the data to ensure that the database instance satises
all ICs. In any case, changes to the database are not allowed to create an instance that violates any
IC.)
Many kinds of integrity constraints can be specied in the relational model. We have already seen one
example of an integrity constraint in the domain constraints associated with a relation schema (Section 3.1).
In general, other kinds of constraints can be specied as well; for example, no two students have the same
sid value. In this section we discuss the integrity constraints, other than domain constraints, that a DBA or
user can specify in the relational model.
Consider the Students relation and the constraint that no two students have the same student id. This IC is
an example of a key constraint. A key constraint is a statement that a certain minimal subset of the fields of
a relation is a unique identier for a tuple. A set of fields that uniquely identies a tuple according to a key
constraint is called a candidate key for the relation; we often abbreviate this to just key. In the case of the
Students relation, the (set of fields containing just the) sid field is a candidate key.
Let us take a closer look at the above de nition of a (candidate) key. There are two parts to the de nition:3
1.Two distinct tuples in a legal instance (an instance that satises all ICs, including the key constraint)
cannot have identical values in all the fields of a key.
2.No subset of the set of fields in a key is a unique identier for a tuple.
3.The term key is rather overworked. In the context of access methods, we speak of search keys, which
are quite di erent.
The first part of the definition means that in any legal instance, the values in the key fields uniquely identify
a tuple in the instance.
When specifying a key constraint, the DBA or user must be sure that this constraint will not prevent them
from storing a correct' set of tuples. (A similar comment applies to the specication of other kinds of Integrity
Constraints well).The notion
of `correctness' here depends upon the nature of the data being stored. For example, several students may
have the same name, although each student has a unique student id. If the name field is declared to be a
key, the DBMS will not allow the Students relation to contain two tuples describing different students with
the same name!
70 Fundamentals of Database Management
The second part of the definition means, for example, that the set of field names is not a key for
Students, because this set properly contains the key fi eld. The set fieid, name is an example of a superkey,
which is a set of fields that contains a key.
Look again at the instance of the Students relation in Figure 3. Observe that two di erent rows always
have di erentsid values; sid is a key and uniquely identies a tuple. However, this does not hold for
nonkey fields. For example, the relation contains two rows with Smith in the name field.
Note that every relation is guaranteed to have a key. Since a relation is a set of tuples, the set of all fields is
always a superkey. If other constraints hold, some subset of the fields may form a key, but if not, the set of
all fields is a key.
A relation may have several candidate keys. For example, the login and age fields of the Students relation
may, taken together, also identify students uniquely. That is, flogin, ageg is also a key. It may seem that
login is a key, since no two rows in the example instance have the same login value. However, the key must
identify tuples uniquely in all possible legal instances of the relation. By stating that flogin, age g is a key,
the user is declaring that two students may have the same login or age, but not both.
Out of all the available candidate keys, a database designer can identify a primary key. Intuitively, a tuple
can be referred to from elsewhere in the database by storing the values of its primary key fields. For
example, we can refer to a Students tuple by storing its sid value. As a consequence of referring to student
tuples in this manner, tuples are frequently accessed by specifying their sid value. In principle, we can use
any key, not just the primary key, to refer to a tuple. However, using the primary key is preferable because
it is what the DBMS expects|this is the signicance of designating a particular candidate key as a primary
key|and optimizes for. For example, the DBMS may create an index with the primary key fields as the search
key, to make the retrieval of a tuple given its primary key value e cient. The idea of referring to a tuple is
developed further in the next section.
In SQL we can declare that a subset of the columns of a table constitute a key by using the UNIQUE
constraint. At most one of these `candidate' keys can be declared to be a primary key, using the PRIMARY
KEY constraint. (SQL does not require that such constraints be declared for a table.)
Let us revisit our example table de nition and specify key information:
name CHAR(30),
login CHAR(20),
ageINTEGER,
gpaREAL,
This definition says that sid is the primary key and that the combination of name and age is also a key. The
definition of the primary key also illustrates how we can name a constraint by preceding it with CONSTRAINT
Fundamentals of Database Management 71
constraint-name. If the constraint is violated, the constraint name is returned and can be used to identify
the error.
Sometimes the information stored in a relation is linked to the information stored in another relation. If one
of the relations is modified, the other must be checked, and perhaps modified, to keep the data consistent. An
IC involving both relations must be specied if a DBMS is to make such checks. The most common IC
involving two relations is a foreign key constraint.
To ensure that onl y B a n d A g r a d e students can enroll in courses, any value that appears in the sid
field of an instance of the Enrolled relation should also appear in the sid field of some tuple in the Students
relation. The sid field of Enrolled is called a foreign key and refers to Students. The foreign key in the
the referencing relation (Enrolled, in our example) must match the primary key of the referencced relation
(Students), i.e., it must have the same number of columns and compatible data types, although the column
names can be different.
This constraint is illustrated in Figure- 4. As the figure shows, there may be some students who are not
referenced from Enrolled (e.g., the student with sid=50000).
However, every sid value that appears in the instance of the Enrolled table appears in the primary key
column of a row in the Students table.
Carnatic101 C 53831
Reggae203 B 53832
Topology112 A 53650
History105 B 53666
72 Fundamentals of Database Management
If we try to insert the tuple h55555, Art104, Ai into E1, the IC is violated because there is no tuple
in S1 with the id 55555; the database system should reject such an insertion. Similarly, if we
delete the tuple h53666, Jones, jones@cs, 18, 3.4i from S1, we violate the foreign key constraint
because the tuple h53666, History105, Bi in E1 contains sid value 53666, the sid of the deleted
Students tuple. The DBMS should disallow the deletion or, perhaps, also delete the Enrolled tuple
that refers to the deleted Students tuple. We discuss foreign key constraints and their impact on
updates in Section 3.3.
Finally, we note that a foreign key could refer to the same relation. For example, we could extend
the Students relation with a column called partner and declare this column to be a foreign key
referring to Students. Intuitively, every student could then have a partner, and the partner field
contains the partner's sid. The observant reader will no doubt ask, What if a student does not
(yet) have a partner?" This situation is handled in SQL by using a special value called null. The use
of null in field of a tuple means that value in that field either unknown or not applicable (e.g., we
do not know the partner yet, or there is no partner). The appearance of null in a foreign key field
does not violate the foreign key constraint. However, null values are not allowed to appear in a
primary key field (because the primary key fields are used to identify a tuple uniquely). We will
discuss null values further.
The foreign key constraint states that every sid value in Enrolled must also appear in Students, that
is, sid in Enrolled is a foreign key referencing Students. Incidentally, the primary key constraint
states that a student has exactly one grade for each course that he or she is enrolled in. If we want
to record more than one grade per student per course, we should change the primary key
constraint.
General Constraints
Domain, primary key, and foreign key constraints are considered to be a fundamental part of the
relational data model and are given special attention in most commercial systems. Sometimes,
however, it is necessary to specify more general constraints.
For example, we may require that student ages be within a certain range of values; given such an
IC specication, the DBMS will reject inserts and updates that violate the constraint. This is very
useful in preventing data entry errors. If we specify that all students must be at least 16 years old,
the instance of Students shown in Figure- 1 is illegal because two students are underage. If we
Fundamentals of Database Management 73
disallow the insertion of these two tuples, we have a legal instance, as shown in Figure- 5
The IC that students must be older than 16 can be thought of as an extended domain constraint,
since we are essentially defining the set of permissible age values more strin-gently than is possible
by simply using a standard domain such as integer. In general, however, constraints that go well
beyond domain, key, or foreign key constraints can be specified. For example, we could require that
every student whose age is greater than 18 must have a gpa greater than 3.
Current relational database systems support such general constraints in the form of table
constraints and assertions. Table constraints are associated with a single table and are checked
whenever that table is modified. In contrast, assertions involve several tables and are checked
whenever any of these tables is modified. Both table constraints and assertions can use the full
power of SQL queries to specify the desired restriction. We discuss SQL support for table
constraints and assertions in Section 5.11 because a full appreciation of their power requires a
good grasp of SQL's query capabilities.
As we observed earlier, ICs are specified when a relation is created and enforced when a relation is
modified. The impact of domain, PRIMARY KEY, and UNIQUE constraints is straightforward: if an
insert, delete, or update command causes a violation, it is rejected. Potential IC violation is
generally checked at the end of each SQL statement execution, although it can be deferred until
the end of the transaction executing the statement.
Consider the instance S1 of Students shown in Figure 1. The following insertion violates the
primary key constraint because there is already a tuple with the sid 53688, and it will be rejected
by the DBMS:
INSERT
The following insertion violates the constraint that the primary key cannot contain null:
INSERT
Of course, a similar problem arises whenever we try to insert a tuple with a value in a field that is not
in the domain associated with that field, i.e., wh
nenever we violate a domain constraint. Deletion
74 Fundamentals of Database Management
does not cause a violation of domain, primary key or unique constraints. However, an update can
cause violations, similar to an insertion:
UPDATE Students S
This update violates the primary key constraint because there is already a tuple with sid 50000.
The impact of foreign key constraints is more complex because SQL sometimes tries to rectify a
foreign key constraint violation instead of simply rejecting the change. We will
discuss the referential integrity enforcement steps taken by the DBMS in terms of our Enrolled and
Students tables, with the foreign key constraint that Enrolled.sid is a reference to (the primary key
of) Students.
In addition to the instance S1 of Students, consider the instance of Enrolled shown in Figure 4.
Deletions of Enrolled tuples do not violate referential integrity, but insertions of Enrolled tuples
could. The following insertion is illegal because there is no student with sid 51111:
INSERT
On the other hand, insertions of Students tuples do not violate referential integrity although
deletions could. Further, updates on either Enrolled or Students that change the sid value could
potentially violate referential integrity.
SQL-92 provides several alternative ways to handle foreign key violations. We must consider three
basic questions:
1. What should we do if an Enrolled row is inserted, with a sid column value that does not appear
in any row of the Students table?
Delete all Enrolled rows that refer to the deleted Students row.
Disallow the deletion of the Students row if an Enrolled row refers to it.
Set the sid column to the sid of some (existing) `default' student, for every Enrolled row that refers
to the deleted Students row.
For every Enrolled row that refers to it, set the sid column to null. In our example, this option
conflicts with the fact that sid is part of the primary key of Enrolled and therefore cannot be set to
null. Thus, we are limited to the first three options in our example, although this fourth option
(setting the foreign key to null) is available in the general case.
Fundamentals of Database Management 75
SQL-92 allows us to choose any of the four options on DELETE and UPDATE. For example, we can
specify that when a Students row is deleted, all Enrolled rows that refer to it are to be deleted as
well, but that when the sid column of a Students row is modified, this update is to be rejected if an
Enrolled row refers to the modfiied Students row:
ON DELETE CASCADE
ON UPDATE NO ACTION )
The options are specied as part of the foreign key declaration. The default option is
NO ACTION, which means that the action (DELETE or UPDATE) is to be rejected. Thus, the ON
UPDATE clause in our example could be omitted, with the same effect. The CASCADE keyword says
that if a Students row is deleted, all Enrolled rows that refer to it are to be deleted as well. If the
UPDATE clause specied CASCADE, and the sid column of a Students row is updated, this update is
also carried out in each Enrolled row that refers to the updated Students row.
If a Students row is deleted, we can switch the enrollment to a `default' student by using ON
DELETE SET DEFAULT. The default student is specified as part of the definition of the sid field in
Enrolled; for example, sid CHAR(20) DEFAULT`53666'. Although the specication of a default value
is appropriate in some situations (e.g., a default parts supplier if a particular supplier goes out of
business), it is really not appropriate to switch enrollments to a default student. The correct
solution in this example is to also delete all enrollment tuples for the deleted student (that is,
CASCADE), or to reject the update.
SQL also allows the use of null as the default value by specifying ON DELETE SET NULL.
A relational database query (query, for short) is a question about the data, and the answer consists
of a new relation containing the result. For example, we might want to add all students younger than
18 or all students enrolled in Reggae203. A query language is a specialized language for writing
queries.
SQL is the most popular commercial query language for a relational DBMS. We now present some
SQL examples that illustrate how easily relations can be queried. Consider the instance of the
Students relation shown in Figure 1. We can retrieve rows corresponding to students who are
younger than 18 with the following SQL query:
SELECT *
FROM Students S
WHERE S.age< 1
76 Fundamentals of Database Management
The symbol * means that we retain all fields of selected tuples in the result. To understand this
query, think of S as a variable that takes on the value of each tuple in Students, one tuple after the
other. The condition S.age<18 in the WHERE clause species that we want to select only tuples in
which the age field has a value less than 18.
This example illustrates that the domain of a field restricts the operations that are permittedon field
values, in addition to restricting the values that can appear in the field. The condition S.age<18
involves an arithmetic comparison of an age value with an integer and is permissible because the
domain of age is the set of integers. On the other hand, a condition such as S.age = S.sid does not
make sense because it compares an integer value with a string value, and this comparison is de
fined to fail in SQL; a query containing this condition will produce no answer tuples.
In addition to selecting a subset of tuples, a query can extract a subset of the fields of each selected
tuple. We can compute the names and logins of students who are younger than 18 with the
following query:
FROM Students S
WHERE s.age<18;
Figure 7 shows the answer to this query; it is obtained by applying the selection to the instance
S1 of Students (to get the relation shown in Figure 6), followed by removing unwanted fields. Note
that the order in which we perform these operations does matter if we remove unwanted fields
first, we cannot check the condition S.age< 18 , which involves one of those fields.
We can also combine information in the Students and Enrolled relations. If we want to obtain the
names of all students who obtained an A and the id of the course in which they got an A, we could
write the following query:
DISTINCT types in SQL: A comparison of two values drawn from different domains should fail, even
if the values are `compatible' in the sense that both are numeric or both are string values etc. For
Fundamentals of Database Management 77
example, if salary and age are two different domains whose values are represented as integers, a
comparison of a salary value with an age value should fail. Unfortunately, SQL-92's support for the
concept of domains does not go this far: We are forced to deny salary and age as integer types
and the comparison S < A will succeed when S is bound to the salary value 25 and A is bound to
the age value 50. The latest version of the SQL standard, called SQL:1999, addresses this problem,
and allows us to deny salary and age as DISTINCT types even though their values are represented
as integers. Many systems, e.g., Informix UDS and IBM DB2, already support this feature.
Name login
Madayan madayan@music
Guldu guldu@music
This query can be understood as follows: If there is a Students tuple S and an Enrolled tuple E
such that S.sid = E.sid (so that S describes the student who is enrolled in E) and E.grade = `A',
then print the student's name and the course id." When evaluated on the instances of Students and
Enrolled in Figure 3.4, this query returns a single tuple, hSmith, Topology112i.
The ER model is convenient for representing an initial, high-level database design. Given an ER
diagram describing a database, there is a standard approach to generating a relational database
schema that closely approximates the ER design. (The translation is approximate to the extent that
we cannot capture all the constraints implicit in the ER design using SQL-92, unless we use certain
SQL-92 constraints that are costly to check.) We now describe how to translate an ER diagram into
a collection of tables with associated constraints, i.e., a relational database schema.
An entity set is mapped to a relation in a straightforward way: Each attribute of the entity set
becomes an attribute of the table. Note that we know both the domain of each attribute and the
(primary) key of an entity set.
Consider the Employees entity set with attributes ssn, name, and lot shown in Figure 8. A
possible instance of the Employees entity set, containing three Employees
name
ssn lot
Employees
123-22-3666 Attishoo 48
231-31-5368 Smiley 22
131-24-3650 Smethurst 35
The following SQL statement captures the preceding information, including the domain constraints
and key information:
name CHAR(30),
lot INTEGER,
A relationship set, like an entity set, is mapped to a relation in the relational model. We begin by
considering relationship sets without key and participation constraints, and we discuss how to
handle such constraints in subsequent sections. To represent a relationship, we must be able to
identify each participating entity and give values to the descriptive attributes of the relationship.
Thus, the attributes of the relation include:
The primary key attributes of each participating entity set, as foreign key fields.
The set of non-descriptive attributes is a superkeyfor the relation. If there are no key constraints
(see Section 2.4.1), this set of attributes is a candidate key.
Consider the Works_In2 relationship set shown in Figure 10. Each department has offices in several
locations and we want to record the locations at which each employee works.
since
Name dname
ssn Lot did budget
All the available information about the Works_In2 table is captured by the following SQL definition:
Note that the address, did, and ssn fields cannot take null values. Because these fields are part of
the primary key for Works_In2, a NOT NULL constraint is implicit for each of these fields.This
constraint ensures that these fields uniquely identitify a department, employee, and a location in
each tuple of Works In. We can also specify that a particular action is desired when a referenced
Employees, Departments or Locations tuple is deleted, as explained in the discussion of integrity
constraints in Section 3.2. In this chapter we assume that the default action is appropriate except
for situations in which the semantics of the ER diagram require some other action.
name
ssn lot
Employees
supervisor subordinate
Reports_To
FOREIGN KEY (supervisor ssn) REFERENCES Employees(ssn), FOREIGN KEY (subordinate ssn)
REFERENCES Employees(ssn) )
Observe that we need to explicitly name the referenced field of Employees because the field name
differs from the name(s) of the referring field(s).
If a relationship set involves n entity sets and some m of them are linked via arrows in the ER
diagram, the key for any one of these m entity sets constitutes a key for the relation to which the
relationship set is mapped. Thus we have m candidate keys, and one of these should be designated
as the primary key. The translation discussed in Section 2.3 from relationship sets to a relation can
be used in the presence of key constraints, taking into account this point about keys.
Consider the relationship set Manages shown in Figure12. The table corresponding
since
name dname
ssn lot did budget
to Manages has the attributes ssn, did, since. However, because each department has at most one
manager, no two tuples can have the same did value but differ on the ssn value. A consequence of
this observation is that did is itself a key for Manages; indeed, the set did, ssn is not a key
(because it is not minimal). The Manages relation can be defined using the following SQL
statement:
did INTEGER,
since DATE,
A second approach to translating a relationship set with key constraints is often superior because
it avoids creating a distinct table for the relationship set. The idea is to include the information
about the relationship set in the table corresponding to the entity set with the key, taking
advantage of the key constraint. In the Manages example, because a department has at most one
manager, we can add the key fields of the Employees tuple denoting the manager and the since
attribute to the Departments tuple.
Fundamentals of Database Management 81
This approach eliminates the need for a separate Manages relation, and queries asking for a
department's manager can be answered without combining information from two relations. The
only drawback to this approach is that space could be wasted if several departments have no
managers. In this case the added fields would have to be lied with null values. The first translation
(using a separate table for Manages) avoids this inefficiency, but some important queries require us
to combine information from two relations, which can be a slow operation.
The following SQL statement, defining a DeptMgr relation that captures the information in both
Departments and Manages, illustrates the second approach to translating relationship sets with key
constraints:
This idea can be extended to deal with relationship sets involving more than two entity sets. In
general, if a relationship set involves n entity sets and some m of them are linked via arrows in the
ER diagram, the relation corresponding to any one of the m sets can be augmented to capture the
relationship.
We discuss the relative merits of the two translation approaches further after considering how to
translate relationship sets with participation constraints into tables.
Consider the ER diagram in Figure 13, which shows two relationship sets, Manages and Works In.
since
name dname
ssn lot did budget
Works_In
since
82 Fundamentals of Database Management
First Instance: - To insert a new employee tuple in to Emp_Dept table, we must include either the
attribute values for the department that the employee works for, or nulls (if the employee does not
work for a department as yet). For example to insert a new tuple for an employee who works in
department no 5, we must enter the attribute values of department number 5 correctly so
that they are consistent, with values for the department 5 in other tuples in emp_dept.
Second Instance: - It is difficult to insert a new department that has no employees as yet in the
emp_dept relation. The only way to do this is to place null values in the attributes for the
employee this causes a problem because SSN in the primary key of emp_dept table and
each tuple is supposed to represent an employee entity- not a department entity.
Moreover, when the first employee is assigned to that department, we do not need this tuple with
null values anymore.
Deletion Anomalies
A "deletion anomaly" is a failure to remove information about an existing database entry when it is
time to remove that entry. In a properly normalized database, information about an old, to-
be-gotten-rid-of entry needs to be deleted from only one place in the database; in an
inadequately normalized database, information about that old entry may need to be deleted from
more than one place, and, human fallibility being what it is, some of the needed additional
deletions may be missed.
The problem of deletion anomaly is related to the second insertion anomaly situation which
we have discussed earlier, if we delete from emp_dept an employee tuple that happens to
represent the last employee working for a particular department, the information
concerning that department is lost from the database.
Modification Anomalies
In Emp_Dept, if we change the value of one of the attribute of a particular department- say, the
manager of department 5-we must update the tuples of all employees who work in that
department; otherwise, the database will become inconsistent. If we fail to update some tuples, the
same department will be shown to have 2 different values for manager in different employee tuple
which would be wrong.
All three kinds of anomalies are highly undesirable, since their occurrence constitutes
corruption of the database. Properly normalized databases are much less susceptible to
corruption than are unnormalized databases.
Normalization
Designing a normalized database structure is the first step when building a database that is meant
to last. Normalization is a simple, commonsense, process that leads to flexible, efficient,
maintainable database structures. We‘ll examine the major principles and objectives of
normalization and denormalization, and then take a look at some powerful optimization techniques
that can break the rules of normalization.
What is Normalization?
Fundamentals of Database Management 83
Yes, but what is this normalization all about? If I am simply putting it, normalization is a formal
process for determining which fields belong in which tables in a relational database.
Normalization follows a set of rules worked out at the time relational databases were born. A
normalized relational database provides several benefits:
Elimination of redundant data storage.
Close modeling of real world entities, processes, and their relationships.
Structuring of data so that the model is flexible.
Normalization ensures that you get the benefits relational databases offer. Time spent
learning about normalization will begin paying for itself immediately.
Now we will look in to the aspects regarding the tasks associated with designing and
implementing a database.
Designing a database structure and implementing a database structure are different tasks. When
you design a structure it should be described without reference to the specific database
tool you will use to implement the system, or what concessions you plan to make for
performance reasons. These steps come later. After you‘ve designed the database structure
abstractly, then you implement it in a particular environment--4D in our case. Too often people
new to database design combine design and implementation in one step. 4D makes this
tempting because the structure editor is so easy to use. Implementing a structure without
designing it quickly leads to flawed structures that are difficult and costly to modify. Design first,
implement second, and you'll finish faster and cheaper.
Oh, now we‘ve implied that there are various advantages to producing a properly
normalized design before you implement your system. Let's look at a detailed list of the pros and
cons:
Pros of Normalizing:
Cons of Normalizing:
You can‘t start building the database before you know what the user needs.
As from above, it is clear that the pros outweigh the cons.
Terminology
84 Fundamentals of Database Management
There are a couple terms that are central to a discussion of normalization: "key" and
"dependency". These are probably familiar concepts to anyone who has built relational database
systems, though they may not be using these words. We define and discuss them here as
necessary background for the discussion of normal forms that follows.
The above definition merely states that the relations are always in first normal form which
is always correct. However the relation that is only in first normal form has a structure
those undesirable for a number of reasons.
First normal form (1NF) sets the very basic rules for an organized database:
Eliminate duplicative columns from the same table.
Create separate tables for each group of related data and identify each row with a unique
column or set of columns (the primary key).
2nd Normal Form (2NF)
Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on the entire key.
Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of
the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in
2NF if it is in 1NF and if it has no partial dependencies." Recall the general requirements of 2NF:
Remove subsets of data that apply to multiple rows of a table and place them in
separate rows.
Create relationships between these new tables and their predecessors through the
use of foreign keys.
These rules can be summarized in a simple statement: 2NF attempts to reduce the
amount of redundant data in a table by extracting it, placing it in new table(s) and
creating relationships between those tables.
Let's look at an example. Imagine an online store that maintains customer information in a
database. Their Customers table might look something like this:
Fundamentals of Database Management 85
A brief look at this table reveals a small amount of redundant data. We're storing the
"Sea Cliff, NY 11579" and "Miami, FL 33157" entries twice each. Now, that might not seem like too
much added storage in our simple example, but imagine the wasted space if we had thousands of
rows in our table. Additionally, if the ZIP code for Sea Cliff were to change, we'd need to make
that change in many places throughout the database.
In a 2NF-compliant database structure, this redundant information is extracted and stored in a
separate table. Our new table (let's call it ZIPs) might look like this:
If we want to be super-efficient, we can even fill this table in advance -- the post office provides a
directory of all valid ZIP codes and their city/state relationships. Surely, you've encountered
a situation where this type of database was utilized. Someone taking an order might have asked
you for your ZIP code first and then knew the city and state you were calling from. This type
of arrangement reduces operator error and increases efficiency.
Now that we've removed the duplicative data from the Customers table, we've satisfied the first
rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll
use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new
Customers table:
We've now minimized the amount of redundant information stored within the database and our
structure is in second normal form, great isn‘t it?
3rd Normal Form (3NF)
86 Fundamentals of Database Management
Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies. The basic requirements
of 3NF are as follows:
Meet the requirements of 1NF and 2NF
Remove columns that are not fully dependent upon the primary key.
Imagine that we have a table of widget orders:
Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF.
Are there any duplicative columns? No. Do we have a primary key? Yes, the order number.
Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple
rows? No, so we also satisfy the requirements of 2NF.
Now, are all of the columns fully dependent upon the primary key? The customer number
varies with the order number and it doesn't appear to depend upon any of the other
fields. What about the unit price? This field could be dependent upon the customer number in a
situation where we charged each customer a set price. However, looking at the data above, it
appears we sometimes charge the same customer different prices. Therefore, the unit price is
fully dependent upon the order number. The quantity of items also varies from order to order, so
we're OK there.
What about the total? It looks like we might be in trouble here. The total can be derived by
multiplying the unit price by the quantity; therefore it's not fully dependent upon the primary key.
We must remove it from the table to comply with the third normal form:
Def: A table is in DKNF if every constraint on the table is a logical consequence of the definition of
keys and domains.
The term instance is typically used to describe a complete database environment, including the
RDBMS software, table structure, stored procedures and other functionality. It is most commonly
used when administrators describe multiple instances of the same database.
Example:
An organization with an employees database might have three different instances: production
(used to contain live data), pre-production (used to test new functionality prior to release into
production) and development (used by database developers to create new functionality).
Database language is another important part of DBMS. It is used to access the required data from
database as well as to design the structure of database. A user uses a database language for
interfacing with the DBMS to access the data from database. A user can either be an application
programmer or an end-user. For example, an application programmer may use COBOL or C++ or
Visual Basic or any fourth-general language (4GL).
88 Fundamentals of Database Management
Similarly, an end-user may use database access language, which is also known as query language.
Mostly, the application programmer inserts the statements of the database access language into its
program written in general-purpose programming language. It is because database access
language is also referred to as data sub-language. Similarly, the database language does not
provide the complete programming language features. Many DBMSs have their own unique sub-
languages.
The users use the database access language to enter new data, change the existing data in
database and to retrieve required data from databases. The user writes a set of appropriate
commands or statements in a database access language and submits these to the DBMS. The
DBMS translates the user commands and sends it to a specific part of the DBMS called the
Database Jet Engine. The database engine generates a set of results according to the commands
submitted by user, converts these into a user readable form called an Inquiry Report and then
displays them on the screen. The administrators use the database access language to create and
maintain the databases.
The most popular database access language is SQL (Structured Query Language). Relational
Databases are required to have a database query language. Today most of the RDBMSs use the
SQL as database access language. Ms-Access also uses the SQL to perform different operations on
the databases. These operations are hidden from the users.
Database security is the system, processes, and procedures that protect a database from
unintended activity. Well Unintended activity can be categorized as authenticated misuse, malicious
attacks or inadvertent mistakes made by authorized individuals or processes. Database security is
also a specialty within the broader discipline of computer security.
Traditionally databases have been protected from external connections by firewalls or routers on
the network perimeter with the database environment existing on the internal network opposed to
being located within a demilitarized zone. Additional network security devices that detect and alert
on malicious database protocol traffic include network intrusion detection systems along with host-
based intrusion detection systems.
Database security is more critical as networks have become more open.
Databases provide many layers and types of information security, typically specified in the data
dictionary, including:
Access control
Auditing
Authentication
Encryption
Integrity controls
Discretionary access control verifies whether the user who is attempting to perform an operation
has been granted the required privileges to perform that operation. You can perform the following
types of discretionary access control:
Create user roles to control which users can perform operations on which database
objectsControl who is allowed to create databases.
Prevent unauthorized users from registering user-defined routines.
Control whether other users besides the DBSA are allowed to view executing SQL statements
Fundamentals of Database Management 89
User Roles
A role is a work-task classification, such as payroll or payroll manager. Each defined role has
privileges on the database object granted to the role. You use the CREATE ROLE statement to
define a role.
External routines with shared libraries that are outside the database server can be security risks.
External routines include user-defined routines (UDRs) and the routines in DataBlade modules.
Mandatory Access
Statistical Databases
A statistical database is a database used for statistical analysis purposes. It is an OLAP instead of
OLTP system, although this term precedes that modern decision, and classical statistical databases
are often closer to the relational model than the multidimensional model commonly used in OLAP
systems today.
Statistical databases often incorporate support for advanced statistical analysis techniques, such as
correlations, which go beyond SQL. They also pose unique security concerns, which were the focus
of much research, particularly in the late 1970s and early to mid 1980s
90 Fundamentals of Database Management
In a statistical database, it is often desired to allow query access only to aggregate data, not
individual records. However, securing such a database is a difficult problem, since intelligent users
can use a combination of aggregate queries to derive information about a single individual.
Some common approaches are:
Data Encryption
Here we want to emphasize that SQL is both deep and wide. Deep in the sense that it is
implemented at many levels of database communication, from a simple Access form list box right
up to high-volume communications between mainframes. SQL is widely implemented in
that almost every DBMS supports SQL statements for communication. The reason for
this level of acceptance is partially explained by the amount of effort that went into the theory
and development of the standards.
Current State
So the ANSI-SQL group has published three standards over the years:
SQL89 (SQL1) SQL92 (SQL2) SQL99 (SQL3)
The vast majority of the language has not changed through these updates. We can all
profit from the fact that almost all of the code we wrote to SQL standards of 1989 is still perfectly
usable. Or in other words, as a new student of SQL there is over ten years of SQL code out there
that needs your expertise to maintain and expand.
Most DBMS are designed to meet the SQL92 standard. Virtually all of the material in this book
was available in the earlier standards as well. Since many of the advanced features of
SQL92 have yet to be implemented by DBMS vendors, there has been little pressure for a new
version of the standard. Nevertheless a SQL99 standard was developed to address advanced
issues in SQL. All of the core functions of SQL, such as adding, reading and modifying data, are the
same. Therefore, the topics in this book are not affected by the new standard. As of early
2001, no vendor has implemented the SQL99 standard.
There are three areas where there is current development in SQL standards. First entails improving
Internet access to data, particularly to meet the needs of the emerging XML standards. Second
is integration with Java, either through Sun's Java Database Connectivity (JDBC) or
through internal implementations. Last, the groups that establish SQL standards are considering
how to integrate object- based programming models.
temporary_tablespace_clause::=
Fundamentals of Database Management 93
Semantics
BIGFILE | SMALLFILE
Use this clause to determine whether the tablespace is a bigfile or smallfiletablespace. This clause
overrides any default tablespace type setting for the database.
A bigfiletablespace contains only one datafile or tempfile, which can contain up to
approximately 4 billion (232) blocks. The maximum size of the single datafile or tempfile is
128 terabytes (TB) for a tablespace with 32K blocks and 32TB for a tablespace with 8K
blocks.
A smallfiletablespace is a traditional Oracle tablespace, which can contain 1022 datafiles
or tempfiles, each of which can contain up to approximately 4 million (222) blocks.
If you omit this clause, then Oracle Database uses the current default tablespace type of
permanent or temporary tablespace set for the database. If you specify BIGFILE for a permanent
tablespace, then the database by default creates a locally managed tablespace with automatic
segment-space management.
Oracle Database stores data logically in tablespaces and physically in datafiles associated with
the corresponding tablespace. Figure A illustrates this relationship.
94 Fundamentals of Database Management
Databases, tablespaces, and datafiles are closely related, but they have
important differences:
An Oracle database consists of at least two logical storage units called tablespaces, which
collectively store all of the database's data. You must have
the SYSTEM and SYSAUX tablespaces and a third tablespace, called TEMP, is
optional.
Each tablespace in an Oracle database consists of one or more files called datafiles, which
are physical structures that conform to the operating system in which Oracle Database is
running.
A database's data is collectively stored in the datafiles that constitute each tablespace of
the database. For example, the simplest Oracle database would have one tablespace and
one datafile. Another database can have three tablespaces, each consisting of two datafiles
(for a total of six datafiles).
Fundamentals of Database Management 95
So What is SQL?
SQL Commands
Here you can see that SQL commands follow a number of basic rules:
SQL keywords are not normally case sensitive, though this in this tutorial all
commands (SELECT, UPDATE etc) are upper-cased.
Variable and parameter names are displayed here as lower-case.
New-line characters are ignored in SQL, so a command may be all on one line or
broken up across a number of lines for the sake of clarity.
Many DBMS systems expect to have SQL commands terminated with a semi-colon
character.
The Data Definition Language (DDL) part of SQL permits database tables to be created or deleted.
We can also define indexes (keys), specify links between tables, and impose constraints between
database tables.
The most important DDL statements in SQL are:
The TEXT datatype, supported by many of the most common DBMS, specifies a string of characters
of any length. In practice there is often a default string length which varies by product. In some
DBMS TEXT is not supported, and instead a specific string length has to be declared. Fixed
length strings are often called CHAR(x), VCHAR(x) or VARCHAR(x), where x is the string
length. In the case of INTEGER there are often multiple flavors of integer available.
Remembering that larger integers require more bytes for data storage, the choice of int size is
usually a design decision that ought to be made up front.
Once a table is created it's structure is not necessarily fixed in stone. In time requirements change
and the structure of the database is likely to evolve to match your wishes. SQL can be used to
change the structure of a table, so, for example, if we need to add a new field to our User table to
tell us if the user has Internet access, then we can execute an SQL ALTER TABLE command as
shown below:
ALTER TABLE User ADD COLUMN Internet BOOLEAN;
To delete a column the ADD keyword is replaced with DROP, so to delete the field we have just
added the SQL is:
ALTER TABLE User DROP COLUMN Internet; How to delete table
If you have already executed the original CREATE TABLE command your database will already
contain a table called User, so let's get rid of that using the DROP command: DROP TABLE User;
And now we'll recreate the User table we'll use throughout the rest of this tutorial:
CREATE TABLE User (FirstName VARCHAR (20), LastName VARCHAR (20), UserID
VARCHAR(12) UNIQUE, Dept VARCHAR(20), EmpNo INTEGER UNIQUE, PCType
VARCHAR(20);
SQL language also includes syntax to update, insert, and delete records.
These query and update commands together form the Data Manipulation Language (DML)
part of SQL:
around some kind of GUI form. The form gives a representation of the information required for the
application, rather than providing a simple mapping onto the tables. So, in this sample application
you would imagine a form with text boxes for the user details, drop-down lists to select from
the PC table, drop-down selection of the software packages etc. In such a situation the
database user is shielded both from the underlying structure of the database and from the SQL
which may be used to enter data into it. However we are going to use the SQL directly to populate
the tables so that we can move on to the next stage of learning SQL.
The command to add new records to a table (usually referred to as an append query), is:
INSERT INTO target [(field1[, field2[, ...]])] VALUES (value1[, value2[, ...]);
So, to add a User record for user Jim Jones, we would issue the following INSERT query:
INSERT INTO User (FirstName, LastName, UserID, Dept, EmpNo, PCType) VALUES ("Jim", "Jones",
"Jjones","Finance", 9, "DellDimR450");
The INSERT command is used to add records to a table, but what if you need to make an
amendment to a particular record? In this case the SQL command to perform updates is the
UPDATE command, with syntax:
UPDATE table SET newvalue WHERE criteria;
For example, let's assume that we want to move user Jim Jones from the Finance
department to Marketing. Our SQL statement would then be:
UPDATE User
SET Dept="Marketing" WHERE EmpNo=9;
Notice that we used the EmpNo field to set the criteria because we know it is unique. If we'd used
another field, for example LastName, we might have accidentally updated the records for any other
user with the same surname.
The UPDATE command can be used for more than just changing a single field or record at a time.
The SET keyword can be used to set new values for a number of different fields, so we
could have moved Jim Jones from Finance to marketing and changed the PCType as well in the
same statement (SET Dept="Marketing", PCType="PrettyPC"). Or if all of the Finance
department were suddenly granted Internet access then we could have issued the following
SQL query:
UPDATE User
SET Internet=TRUE WHERE Dept="Finance";
You can also use the SET keyword to perform arithmetical or logical operations on the values. For
example if you have a table of salaries and you want to give everybody a 10% increase you can
issue the following command:
UPDATE PayRoll
SET Salary=Salary * 1.1;
How to Delete Data
Now that we know how to add new records and to update existing records it only remains to learn
how to delete records before we move on to look at how we search through and collate data. As
you would expect SQL provides a simple command to delete complete records. The syntax of the
command is:
DELETE FROM table [WHERE <condition>];
98 Fundamentals of Database Management
Let's assume we have a user record for John Doe, (with an employee number of 99),
which we want to remove from our User we could issue the following query:
DELETE * FROM User
WHERE EmpNo=99;
In practice delete operations are not handled by manually keying in SQL queries, but are likely to
be generated from a front end system which will handle warnings and add safe- guards against
accidental deletion of records.
Note that the DELETE query will delete an entire record or group of records. If you want to delete
a single field or group of fields without destroying that record then use an UPDATE query
and set the fields to Null to over-write the data that needs deleting. It is also worth noting that the
DELETE query does not do anything to the structure of the table itself, it deletes data only. To
delete a table, or part of a table, then you have to use the DROP clause of an ALTER TABLE query.
The SQL Data Control Language (DCL) provides security for your database. The DCL consists of
the GRANT, REVOKE, COMMIT, and ROLLBACK statements. GRANT and REVOKE statements
enable you to determine whether a user can view, modify, add, or delete database information.
Working With Transaction Control
Applications execute a SQL statement or group of logically related SQL statements to
perform a database transaction. The SQL statement or statements add, delete, or modify data in
the database.
Transactions are atomic and durable. To be considered atomic, a transaction must
successfully complete all of its statements; otherwise none of the statements execute. To be
considered durable, a transaction's changes to a database must be permanent.
Complete a transaction by using either the COMMIT or ROLLBACK statements. COMMIT
statements make permanent the changes to the database created by a transaction.
ROLLBACK restores the database to the state it was in before the transaction was performed.
SQL Transaction Control Language Commands (TCL.)
This page contains some SQL TCL. commands that I think it might be useful. Each
command's description is taken and modified from the SQLPlus help. They are provided as is and
most likely are partially described. So, if you want more detail or other commands, please
use HELP in the SQLPlus directly.
COMMIT
PURPOSE:
To end your current transaction and make permanent all changes performed in the
transaction. This command also erases all savepoints in the transaction and releases the
transaction's locks. You can also use this command to manually commit an indoubt
distributed transaction.
SYNTAX:
SQL>COMMIT;