Database Topic 1
Database Topic 1
1
member of a team, or to work individually, in the areas of database application
development or administration. In addition, some coverage of current research
areas is provided, partly as a stimulus for possible future dissertation topics,
and also to provide an awareness of possible future developments within the
database arena.
Module objectives
At the end of this module you will have acquired practical and theoretical knowl-
edge and skills relating to modern database systems. The module is designed
so that this knowledge will be applicable across a wide variety of database en-
vironments. At the end of the module you will be able to:
• Understand and explain the key ideas underlying database systems and
the database approach to information storage and manipulation.
• Design and implement database applications.
• Carry out actions to improve the performance of existing database appli-
cations.
• Understand the issues involved in providing multiple users concurrent ac-
cess to database systems.
• Be able to design adequate backup, recovery and security measures for
a database installation, and understand the facilities provided by typical
database systems to support these tasks.
• Understand the types of tasks involved in database administration and
the facilities provided in a typical database system to support these tasks.
• Be able to describe the issues and objectives in a range of areas of contem-
porary database research.
Chapter objectives
2
Introduction
In parallel with this chapter, you should read Chapter 1 and Chapter 2 of
Thomas Connolly and Carolyn Begg, “Database Systems A Practical Approach
to Design, Implementation, and Management”, (5th edn.).
This chapter sets the scene for all of the forthcoming chapters of the module.
We begin by examining the approach to storing and processing data that was
used before the arrival of database systems, and that is still appropriate today
in certain situations (which will be explained). We then go on to examine the
difference between this traditional, file-based approach to data storage, and that
of the database approach. We do this first by examining inherent limitations of
the file-based approach, and then discuss ways in which the database approach
can be used to overcome these limitations.
A particular model of database systems, known as the Relational model, has
been the dominant approach in the database industry since the early ’80s. There
are now important rivals and extensions to the Relational model, which will be
examined in later chapters, but the Relational model remains the core technol-
ogy on which the database industry worldwide is based, and for this reason this
model will be central to the entire module.
3
If there are a small number of records to be kept, and these do not need to be
changed very often, a card index might be all that is required. However, where
there is a high volume of data, and a need to manipulate this data on a regular
basis, a computer-based solution will often be chosen. This might sound like a
simple solution, but there are a number of different approaches that could be
taken.
The term ‘file-based approach’ refers to the situation where data is stored in one
or more separate computer files defined and managed by different application
programs. Typically, for example, the details of customers may be stored in
one file, orders in another, etc. Computer programs access the stored files to
perform the various tasks required by the business. Each program, or some-
times a related set of programs, is called a computer application. For example,
all of the programs associated with processing customers’ orders are referred to
as the order processing application. The file-based approach might have appli-
cation programs that deal with purchase orders, invoices, sales and marketing,
suppliers, customers, employees, and so on.
Limitations
• Data duplication: Each program stores its own separate files. If the same
data is to be accessed by different programs, then each program must store
its own copy of the same data.
• Data inconsistency: If the data is kept in different files, there could be
problems when an item of data needs updating, as it will need to be
updated in all the relevant files; if this is not done, the data will be incon-
sistent, and this could lead to errors.
• Difficult to implement data security: Data is stored in different files by
different application programs. This makes it difficult and expensive to
implement organisation-wide security procedures on the data.
The following diagram shows how different applications will each have their own
copy of the files they need in order to carry out the activities for which they are
responsible:
4
The shared file approach
One approach to solving the problem of each application having its own set
of files is to share files between different applications. This will alleviate the
problem of duplication and inconsistent data between different applications, and
is illustrated in the diagram below:
The introduction of shared files solves the problem of duplication and inconsis-
tent data across different versions of the same file held by different departments,
but other problems may emerge, including:
• File incompatibility: When each department had its own version of a file
for processing, each department could ensure that the structure of the
file suited their specific application. If departments have to share files,
the file structure that suits one department might not suit another. For
5
example, data might need to be sorted in a different sequence for different
applications (for instance, customer details could be stored in alphabetical
order, or numerical order, or ascending or descending order of customer
number).
• Difficult to control access: Some applications may require access to more
data than others; for instance, a credit control application will need access
to customer credit limit information, whereas a delivery note printing
application will only need access to customer name and address details.
The file will still need to contain the additional information to support
the application that requires it.
• Physical data dependence: If the structure of the data file needs to be
changed in some way (for example, to reflect a change in currency), this
alteration will need to be reflected in all application programs that use
that data file. This problem is known as physical data dependence, and
will be examined in more detail later in the chapter.
• Difficult to implement concurrency: While a data file is being processed
by one application, the file will not be available for other applications or
for ad hoc queries. This is because, if more than one application is allowed
to alter data in a file at one time, serious problems can arise in ensuring
that the updates made by each application do not clash with one another.
This issue of ensuring consistent, concurrent updating of information is an
extremely important one, and is dealt with in detail for database systems
in the chapter on concurrency control. File-based systems avoid these
problems by not allowing more than one application to access a file at one
time.
Review question 1
What is meant by the file-based approach to storing data? Describe some of the
disadvantages of this approach.
Review question 2
How can some of the problems of the file-based approach to data storage be
avoided?
Review question 3
What are the problems that remain with the shared file approach?
The database approach is an improvement on the shared file solution as the use
of a database management system (DBMS) provides facilities for querying, data
security and integrity, and allows simultaneous access to data by a number of
different users. At this point we should explain some important terminology:
6
• Database: A database is a collection of related data.
• Database management system: The term ‘database management sys-
tem’, often abbreviated to DBMS, refers to a software system used to
create and manage databases. The software of such systems is complex,
consisting of a number of different components, which are described later
in this chapter. The term database system is usually an alternative term
for database management system.
• System catalogue/Data dictionary: The description of the data in
the database management system.
• Database application: Database application refers to a program, or
related set of programs, which use the database management system to
perform the computer-related tasks of a particular business function, such
as order processing.
One of the benefits of the database approach is that the problem of physical
data dependence is resolved; this means that the underlying structure of a data
file can be changed without the application programs needing amendment. This
is achieved by a hierarchy of levels of data specification. Each such specification
of data in a database system is called a schema. The different levels of schema
provided in database systems are described below. Further details of what is
included within each specific schema are discussed later in the chapter.
The Systems Planning and Requirements Committee of the American National
Standards Institute encapsulated the concept of schema in its three-level
database architecture model, known as the ANSI/SPARC architecture, which
is shown in the diagram below:
7
ANSI/SPARC three-level architecture
8
user applications. The external schema maps onto the conceptual schema, which
is described below.
There may be many external schemas, each reflecting a simplified model of the
world, as seen by particular applications. External schemas may be modified, or
new ones created, without the need to make alterations to the physical storage
of data. The interface between the external schema and the conceptual schema
can be amended to accommodate any such changes.
The external schema allows the application programs to see as much of the
data as they require, while excluding other items that are not relevant to that
application. In this way, the external schema provides a view of the data that
corresponds to the nature of each task.
The external schema is more than a subset of the conceptual schema. While
items in the external schema must be derivable from the conceptual schema,
this could be a complicated process, involving computation and other activities.
9
itself as physical. This may contrast with the perspective of a systems program-
mer, who may consider data files as logical in concept, but their implementation
on magnetic disks in cylinders, tracks and sectors as physical.
Components of a DBMS
DBMS engine
The engine is the central component of a DBMS. This component provides access
to the database and coordinates all of the functional elements of the DBMS. An
important source of data for the DBMS engine, and the database system as a
whole, is known as metadata. Metadata means data about data. Metadata is
10
contained in a part of the DBMS called the data dictionary (described below),
and is a key source of information to guide the processes of the DBMS engine.
The DBMS engine receives logical requests for data (and metadata) from human
users and from applications, determines the secondary storage location (i.e. the
disk address of the requested data), and issues physical input/output requests
to the computer operating system. The data requested is fetched from physical
storage into computer main memory; it is contained in special data structures
provided by the DBMS. While the data remains in memory, it is managed by the
DBMS engine. Additional data structures are created by the database system
itself, or by users of the system, in order to provide rapid access to data being
processed by the system. These data structures include indexes to speed up
access to the data, buffer areas into which particular types of data are retrieved,
lists of free space, etc. The management of these additional data structures is
also carried out by the DBMS engine.
11
which will be triggered by the actions of users as they use the forms-based
user interface.
• A DBMS procedural programming language, often based on standard
third-generation programming languages such as C and COBOL, which
allows programmers to develop sophisticated applications.
• Fourth-generation languages, such as Smalltalk, JavaScript, etc. These
permit applications to be developed relatively quickly compared to the
procedural languages mentioned above.
• A natural language user interface that allows users to present requests in
free-form English statements.
12
Data integrity management subsystem
The data integrity management subsystem provides facilities for managing the
integrity of data in the database and the integrity of metadata in the dictionary.
This subsystem is concerned with ensuring that data is, as far as software can
ensure, correct and consistent. There are three important functions:
• Intra-record integrity: Enforcing constraints on data item values and types
within each record in the database.
• Referential integrity: Enforcing the validity of references between records
in the database.
• Concurrency control: Ensuring the validity of database updates when
multiple users access the database (discussed in a later chapter).
13
The second of the three environments is often called pre-production. Applica-
tions that have been tested in the development environment will be moved into
pre-production for volume testing; that is, testing with quantities of data that
are typical of the application when it is in live operation.
The final environment is known as the production or live environment. Appli-
cations should only be moved into this environment when they have been fully
tested in pre-production. Security is nearly always a very important issue in the
production environment, as the data being used reflects important information
in current use by the organisation.
Each of these separate environments will have at least one database system,
and because of the widely varying activities and security measures required in
each environment, the volume of data and degree of administration required will
itself vary considerably between environments, with the production database(s)
requiring by far the most support.
Given the need for the database administrator to migrate both programs and
data between these environments, an important tool in performing this pro-
cess will be a set of utilities or programs for migrating applications and their
associated data both forwards and backwards between the environments in use.
14
• Better modelling of real-world data: Databases are based on semantically
rich data models that allow the accurate representation of real-world in-
formation.
• Uniform security and integrity controls: Security control ensures that ap-
plications can only access the data they are required to access. Integrity
control ensures that the database represents what it purports to represent.
• Economy of scale: Concentration of processing, control personal and tech-
nical expertise.
Organisations need data to provide details of the current state of affairs; for
example, the amount of product items in stock, customer orders, staff details,
office and warehouse space, etc. Raw data can then be processed to enable
decisions to be taken and actions to be made. Data is therefore an important
resource that needs to be safeguarded. Organisations will therefore have rules,
standards, policies and procedures for data handling to ensure that accuracy is
maintained and that proper and appropriate use is made of the data. It is for
this reason that organisations may employ data administrators and database
administrators.
It is important that the data administrator is aware of any issues that may af-
fect the handling and use of data within the organisation. Data administration
15
includes the responsibility for determining and publicising policy and standards
for data naming and data definition conventions, access permissions and restric-
tions for data and processing of data, and security issues.
The data administrator needs to be a skilled manager, able to implement policy
and make strategic decisions concerning the organisation’s data resource. It is
not sufficient for the data administrator to propose a set of rules and regulations
for the use of data within an organisation; the role also requires the investigation
of ways in which the organisation can extract the maximum benefit from the
available data.
One of the problems facing the data administrator is that data may exist in
a range of different formats, such as plain text, formatted documents, tables,
charts, photographs, spreadsheets, graphics, diagrams, multimedia (including
video, animated graphics and audio), plans, etc. In cases where the data is avail-
able on computer-readable media, consideration needs to be given to whether
the data is in the correct format.
The different formats in which data may appear is further complicated by the
range of terms used to describe it within the organisation. One problem is
the use of synonyms, where a single item of data may be known by a number
of different names. An example of the use of synonyms would be the terms
‘telephone number’, ‘telephone extension’, ‘direct line’, ‘contact number’ or just
‘number’ to mean the organisation’s internal telephone number for a particular
member of staff. In an example such as this, it is easy to see that the terms
refer to the same item of data, but it might not be so clear in other contexts.
A further complication is the existence of homonyms. A homonym is a term
which may be used for several different items in different contexts; this can
often happen when acronyms are used. One example is the use of the terms
‘communication’ and ‘networking’; these terms are sometimes used to refer to
interpersonal skills, but may also be employed in the context of data communi-
cation and computer networks.
When the items of data that are important to an organisation have been iden-
tified, it is important to ensure that there is a standard representation format.
It might be acceptable to tell a colleague within the organisation that your tele-
phone extension is 5264, but this would be insufficient information for someone
outside the organisation. It may be necessary to include full details, such as
international access code, national code, area code and local code as well as
the telephone extension to ensure that the telephone contact details are usable
worldwide.
Dates are a typical example of an item of data with a wide variety of formats.
The ranges of date formats include: day-month-year, month-day-year, year-
month-day, etc. The month may appear as a value in the range 1 to 12, as the
name of the month in full, or a three-letter abbreviation. These formats can be
varied by changing the separating character between fields from a hyphen (-) to
a slash (/), full stop (.) or space ( ).
16
The use of standardised names and formats will assist an organisation in making
good use of its data. The role of the data administrator involves the creation
of these standards and their publication (including the reasons for them and
guidelines for their use) across the organisation. Data administration provides
a service to the organisation, and it is important that it is perceived as such,
rather than the introduction of unnecessary rules and regulations.
A number of different approaches or models have been developed for the logical
organisation of data within a database system. This ‘logical’ organisation must
be distinguished from the ‘physical’ organisation of data, which describes how
the data is stored on some suitable storage medium such as a disk. The physical
17
organisation of data will be dealt with in the chapter on physical storage. By
far the most commonly used approach to the logical organisation of data is the
Relational model. In this section we shall introduce the basic concepts of the
Relational model, and give examples of its use. Later in the module, we shall
make practical use of this knowledge in both using and developing examples of
Relational database applications.
18
Very often it is required to be able to identify uniquely each of the different
instances of entities in a database. In order to do this we use something called
a primary key. We will discuss the nature of primary keys in detail in the next
learning chapter, but for now we shall use examples where the primary key is
the first of the attributes in each tuple of a relation.
Relation: Stationery
Here, the attributes are item-code, item-name, colour and price. The values for
each attribute for each item are shown as a single value in each column for a
particular row. Thus for item-code 20217, the values are A4 paper 250 sheets
for the item-name, Blue for the attribute colour, and <=2.75 is stored as the
price.
Question: Which of the attributes in the stationery relation do you think would
make a suitable key, and why?
The schema defines the ‘shape’ or structure of a relation. It defines the number
of attributes, their names and domains. Column headings in a table represent
the schema. The extension is the set of tuples that comprise the relation at
any time. The extension (contents) of a relation may vary, but the schema
(structure) generally does not.
From the example above, the schema is represented as:
19
The extension from the above example is given as:
The extension will vary as rows are inserted or deleted from the table, or values
of attributes (e.g. price) change. The number of attributes will not change, as
this is determined by the schema. The number of rows in a relation is sometimes
referred to as its cardinality. The number of attributes is sometimes referred to
as the degree or grade of a relation.
Each relation needs to be declared, its attributes defined, a domain specified for
each attribute, and a primary key identified.
Review question 7
Distinguish between the terms ‘entity’ and ‘attribute’. Give some examples of
entities and attributes that might be stored in a hospital database.
Review question 8
The range of values that a column in a relational table may be assigned is called
the domain of that column. Many database systems provide the possibility of
specifying limits or constraints upon these values, and this is a very effective
way of screening out incorrect values from being stored in the system. It is
useful, therefore, when identifying which attributes or columns we wish to store
for an entity, to consider carefully what is the domain for each column, and
which values are permissible for that domain.
Consider then for the following attributes, what the corresponding domains are,
20
and whether there are any restrictions we can identify which we might use to
validate the correctness of data values entered into attributes with each domain:
• Attribute: EMPLOYEE_NAME
• Attribute: JOB (i.e. the job held by an individual in an organisation)
• Attribute: DATE_OF_BIRTH
Discussion topic
External schemas can be used to give individual users, or groups of users, access
to a part of the data in a database. Many systems also allow the format of the
data to be changed for presentation in the external schema, or for calculations to
be carried out on it to make it more usable to the users of the external schema.
Discuss the possible uses of external schemas, and the sorts of calculations
and/or reformatting that might be used to make the data more usable to specific
users or user groups.
External schemas might be used to provide a degree of security in the database,
by making available to users only that part of the database that they require
in order to perform their jobs. So for example, an Order Clerk may be given
access to order information, while employees working in Human Resources may
be given access to the details of employees.
In order to improve the usability of an external schema, the data in it may be
summarised or organised into categories. For example, an external schema for a
Sales Manager, rather than containing details of individual sales, might contain
summarised details of sales over the last six months, perhaps organised into
categories such as geographical region. Furthermore, some systems provide the
ability to display data graphically, in which case it might be formatted as a bar,
line or pie chart for easier viewing.
21