Database Notes
Database Notes
Def 3: A data structure that stores metadata, i.e. data about data. Data as a resource
More generally we can
say an organized collection of information. Data Resource is a component of information technology
infrastructure that represents all the data available to an organization,
Def 4: A collection of information organized and presented to serve a whether they are automated or non-automated. Different business
specific purpose. organizations may have different needs.
(A telephone book is a common database.) A computerized database The data resource is a separate component of this infrastructure. Like
is an updated, the network and server components, and many other components not
1
0scn scn 0scn mentioned, it is important to carefully plan the data – permanence
resource of the IT infrastructure. – Database
• Integrated collection of logically related data
The data resource encompasses all its representation of each and elements
every single data available to an organization. This means that even • Consolidates records into a common pool of
those non-automated data such as bulks of paper files in individual data elements
desks of each staff, confidential paper data hidden in steel cabinets, • Data is independent of the application program
sales receipts, invoices and all other transaction paper documents using them and type of storage device
constitute the Data Resource. It cannot be denied that despite the Types of Databases
digitalization of all business processes, papers still play a large part in • Operational
business operations. – Supports business processes and operations
A digital Data Resource provides faster and more efficient means of – Also called subject-area databases, transaction
managing data for the company. In today’ world, Data Resource databases, and production databases
implementation does not just end in the the digitalization aspect. • Distributed
– Replicated and distributed copies or parts of databases
Data Resource Management on network servers at a variety of sites.
A managerial activity Applies information systems technology to – Done to improve database performance and security
managing data resources to meet needs of business stakeholders • External
• Levels of data – Available for a fee from commercial sources or with or
– Character without charge on the Internet or World Wide Web
• Single alphabetical, numeric, or other symbol • Hypermedia
– Field Hyperlinked pages of multimedia
• Groupings of characters
• Represents an attribute of some entity • Data warehouse
– Records – Stores data extracted from operational, external, or
• Related fields of data other databases of an organization
• Collection of attributes that describe an entity – Central source of “structured” data
Fixed-length or variable-length – May be subdivided into data marts
– Files (table)
• A group of related records
• Classified by
– Primary use
– Type of data
2
Data mining A major use of data warehouse databases Multi-user access control creates structures that allow multiple users
– Data is analyzed to reveal hidden correlations, patterns, to access the data
and trends
– Consolidates data records and objects into databases Backup and recovery management provides backup and data recovery
that can be accessed by many different application procedures
programs
Data integrity management promotes and enforces integrity rules to
Database Management System eliminate data integrity problems
Software interface between users and databases Controls creation,
maintenance, and use of the database Database access languages and application programming
interfaces provides data access through a query language
Query Supports ad hoc requests and tells the software how you want
to organize the data Database communication interfaces allows database to accept
end-user requests within a computer network environment
Report Generator turns results of query into a useable report
3
File processing evolution (Traditional vs. Database approach) - Consistency
Traditional Approach - Improved strategic use of data
Each application such as sales by salesperson, invoicing, payroll all - Non duplication of data items/Reduced data redundancy
maintained their own files to store their data. This can lead to
Data about a person / invoice / product is stored only one time
significant duplication of data and the problem of updating all files if a
- Easier to enforce standards
piece of data (i.e. address for a sales person in both the payroll and
commissions files). - Sharing of data
- Improved security
Database Approach - Cost effective
The database approach is to store data about an entity (i.e. student)
only one time, so that if the data changes we only have to change one Disadvantages of Databases
location, as well as if we need to get a piece of information about that - Concurrency problems
entity (student) we know the one location where it is stored. All - Prone to network congestion
applications share the same data! - Prone to catastrophic failure
- Not suitable for limited records
PROPERTIES OF DATABASE PACKAGE
- Support of data model through which the user can view data
Database components and their relationships
- Support of high level language that allow the user to define the Hardware
structure of data Software
- Transaction management Data
- Ability to limit access to data by unauthorized users People
- Checking validity of data Procedures
- Ability to recover from system failure
Components of DBMS
Advantages Of Databases A database management system (DBMS) consists of several
- Data independence components. Each component plays very important role in the
database management system environment. The major components
The data files are separate from the applications (HR, payroll,
of database management system are:
invoicing) and thus can be used by many applications
Software
- Improved data integrity
Hardware
Since data is stored only once for each entity, we don't need to
Data
worry about updating multiple records for the same entity (i.e.
Procedures
storing home address several times for the same person)
Database Access Language
4
People(Users) operational) data and the metadata (data about data or description
Software about data).
The main component of a DBMS is the software. It is the set of
programs used to handle the database and to control and manage the Procedures
overall computerized database Procedures refer to the instructions and rules that help to design the
1. DBMS software itself, is the most important software database and to use the DBMS. The users that operate and manage
component in the overall system the DBMS require documented procedures on hot use or run the
database management system. These may include.
2. Operating system including network software being used in 1. Procedure to install the new DBMS.
network, to share the data of database among multiple users. 2. To log on to the DBMS.
3. To use the DBMS or application program.
3. Application programs developed in programming languages 4. To make backup copies of database.
such as C++, Visual Basic that are used to to access database in 5. To change the structure of database.
database management system. Each program contains 6. To generate the reports of data retrieved from database.
statements that request the DBMS to perform operation on
database. The operations may include retrieving, updating, Database Access Language
deleting data etc . The application program may be The database access language is used to access the data to and from
conventional or online workstations or terminals. the database. The users use the database access language to enter
Hardware new data, change the existing data in database and to retrieve
Hardware consists of a set of physical electronic devices such as required data from databases. The user write a set of appropriate
computers (together with associated I/O devices like disk drives), commands in a database access language and submits these to the
storage devices, I/O channels, electromechanical devices that make DBMS. The administrators may also use the database access language
interface between computers and the real world systems etc, and so to create and maintain the databases.
on. It is impossible to implement the DBMS without the hardware The most popular database access language is SQL (Structured Query
devices, In a network, a powerful computer with high data processing Language).
speed and a storage device with large storage capacity is required as Users
database server. The users are the people who manage the databases and perform
different operations on the databases in the database system. There
Data are three kinds of people who play different roles in database system
Data is the most important component of the DBMS. The main 1. Application Programmers
purpose of DBMS is to process the data. In DBMS, databases are 2. Database Administrators
defined, constructed and then data is stored, updated and retrieved to 3. End-Users
and from the databases. The database contains both the actual (or
5
Application Programmers This data is stored in a Microsoft Access database or a Microsoft SQL
The people who write application programs in programming languages Server database.
(such as Visual Basic, Java, or C++) to interact with databases are called
Application Programmer. Macros: a macro is a set of one or more actions that each performs a
Database Administrators particular operation, such as opening a form or printing a report.
A person who is responsible for managing the overall database Macros can help you automate common tasks. For example, you can
management system is called database administrator or simply DBA. run a macro that prints a report when a user clicks a command button.
End-Users
The end-users are the people who interact with database Modules: a module is a collection of Visual Basic for Applications
management system to perform different operations on database declarations and procedures that are stored together as a unit.
such as retrieving, updating, inserting, deleting data etc. A database is used to store and retrieve data. The database is housed
in a database server and largely controlled by a database management
Database Components system. All SQL based databases, whether they MS SQL Server,
An Access database consists of several different components. Each MySQL, Oracle, or Progress have several components in common.
component listed is called an object. They are:
Tables: tables are where the actual data is defined and entered. Tables
consist of records (rows) and fields (columns). Tables
Indexes
Queries: queries are basically questions about the data in a database. Views
A query consists of specifications indicating which fields, records, and Stored Procedures
summaries you want to see from a database. Queries allow you to Triggers
extract data based on the criteria you define. It is these various pieces that are used to house, retrieve, and process
data with the database.
Forms: forms are designed to ease the data entry process. For Tables
example, you can create a data entry form that looks exactly like a
paper form . People generally prefer to enter data into a well-designed Tables are used to store data within the database. They are its main
form, rather than a table. component and without them, the database would serve little
purpose. Tables are uniquely named within a database. Many
Reports: when you want to print records from your database, design a operations, such as queries use these names. Typically a table is
report. Access even has a wizard to help produce mailing labels. named to represent the type of data stored within. For example, a
table holding employee data may be called Employees. A table
Pages: a data access page is a special type of Web page designed for consists of rows and columns. The columns are defined to house a
viewing and working with data from the Internet or an intranet. specific data type, such as dates, numeric, or textual data. Each
column is also given a name. Continuing with our example, an
6
employee’s name may defined in the table as two columns as the language provides the same basic abilities, such as being able to
FirstName and LastName. move record by record through a query, perform if-then logic, and call
special built in functions to assist with complicated calculations.
Indexes
Triggers
Indexes are used to make data retrieval faster. Rather than having to
scan an entire table for data, an index allows the database to, Triggers are special instructions that are executed when important
essentially, directly retrieve the data being asked of it. An index events, such as inserting or updating records in a table happen. The
consists of keys, which in most cases directly relate to columns in a most common triggers are Insert, Update, and Delete triggers. Two
table. For example, we could create an index using FirstName and items define a trigger on a table: a stored procedure and an event,
LastName to make it quicker to look up employees by their name. such as inserting a record that invokes its execution. Triggers are
Once common property of an index is uniqueness. If an index is useful to ensure that data is update consistently.
unique, then it can only contain unique values for its defined keys. In
our employee example, this wouldn’t be practical, as a company may
have more than one John Smith working at it; however, it would make The database management systems can be classified based on several
sense to create a unique index on employee number. criteria.
Views The DBMS software is partitioned into several modules. Each module
Relationships between database tables can become quite complicated or component is assigned a specific operation to perform. Some of the
as data is stored in separate tables. Views help combat this issue by functions of the DBMS are supported by operating systems (OS) to
allowing the database administrator to create “canned” or pre-built provide basic services and DBMS is built on top of it. The physical data
queries that developers, report writer, and users can use in their own and system catalog are stored on a physical disk. Access to the disk is
database queries. In this way, the view hides some of the database controlled primarily by as, which schedules disk input/output.
complexity. This makes it easier to read queries. Therefore, while designing a DBMS its interface with the as must be
taken into account.
Stored Procedures
(i)Query processor: The query processor transforms user queries into
There are many situations where queries alone are insufficient to solve
a problem. In these cases, developers rely on programming languages a series of low level instructions. It is used to interpret the online
to process logic, to loop through records, and perform conditional user's query and convert it into an efficient series of operations in a
comparisons as required. These programs can be stored in the form capable of being sent to the run time data manager for
database as stored procedures. The language used to create the execution. The query processor uses the data dictionary to find the
stored procedures are vendor specific. T/SQL is the language used by structure of the relevant portion of the database and uses this
Microsoft SQL Server; whereas, PL/SQL is used by Oracle. In each case information in modifying the query and preparing and optimal plan to
access the database.
7
(ii) Run time database manager: Run time database manager is the failure. It includes Recovery manager and Buffer manager. The buffer
central software component of the DBMS, which interfaces with user- manager is responsible for the transfer of data between the main
submitted application programs and queries. It handles database memory and secondary storage (such as disk or tape). It is also
access at run time. It converts operations in user's queries coming. referred as the cache manger.
Directly via the query processor or indirectly via an application Execution Process of a DBMS
program from the user's logical view to a physical file system. It
accepts queries and examines the external and conceptual schemas to As show, conceptually, following logical steps are followed while
determine what conceptual records are required to satisfy the user’s executing users to request to access the database system:
request. It enforces constraints to maintain the consistency and (I) Users issue a query using particular database language, for
integrity of the data, as well as its security. It also performs backing example, SQL commands.
and recovery operations. Run time database manager is sometimes
referred to as the database control system and has the following (ii) The passes query is presented to a query optimizer, which uses
components: information about how the data is stored to produce an efficient
execution plan for the evaluating the query.
• Authorization control: The authorization control module checks the
authorization of users in terms of various privileges to users. (iii) The DBMS accepts the users SQL commands and analyses them.
• Command processor: The command processor processes the queries (iv) The DBMS produces query evaluation plans, that is, the external
passed by authorization control module. schema for the user, the corresponding external/conceptual mapping,
the conceptual schema, the conceptual/internal mapping, and the
Integrity checker: It .checks the integrity constraints so that only valid storage structure definition. Thus, an evaluation\ plan is a blueprint for
data can be entered into the database. evaluating a query.
•Query optimizer: The query optimizers determine an optimal (v) The DBMS executes these plans against the physical database and
strategy for the query execution. returns the answers to the user.
• Transaction manager: The transaction manager ensures that the Using components such as transaction manager, buffer manager, and
transaction properties should be maintained by the system. recovery manager, the DBMS supports concurrency and recovery.
• Scheduler: It provides an environment in which multiple users can
work on same piece of data at the same time in other words it 2.0 DATABASE SYSTEMS ARCHITECTURE
supports concurrency. Three level schema (External, Conceptual and Internal Schema)
Logical and physical data independence
(iii) Data Manager: The data manager is responsible for the actual
handling of data in the database. It provides recovery to the system
which that system should be able to recover the data after some
8
program that accesses the DB
The Three Schema Architecture - written in, say, Java, COBOL, C++, Perl (for web programming)
Proposed by ANSI/SPARC in 1975 or even Microsoft Access
- fits most DB models: e.g. Relational, Network, Hierarchical such programs include some sort of data sublanguage [where
- divides the system into three levels, external, conceptual and data sublanguages consist of: Data Definition/Description
internal: Language (DDL) and Data Manipulation Language (DML)] which
is used to access the DB. e.g. SQL
Note the following, generally, are users of application programs and
not of the database:
data entry person using an application program, e.g. shop
counter clerk.
data entry person using a forms based system. e.g. telephone
call centre staff member
Each user's program application has its own host language (e.g. Java,
COBOL or Perl) with the DB’s data sublanguage (e.g. SQL) embedded
within the host language.
An end data entry user cannot differentiate between an applications
host language and the DB's embedded language.
An external view only contains the data relevant to that user and an
external view’s definition of a record (logical record) may be different
to what is actually stored.
e.g. an end user may see the entry of a new customer's order as a
new, single record, however that single order may be mapped over
many tables in the DB. This same end user should not even see (i.e.
have access) to the rest of the DB, which may include customer
External level (or external view, Logical view) accounts, supplier’s accounts, etc.
Is the individual user's view of the DB (content and organisation of the The External schema describes the external view.
data). External schema is defined by the Data Definition Language (DDL)
Users The Data Manipulation Language (DML) is used to access data.
Relative to the database a user could be:
professional non Data Processing person using a query
language
application programmer
9
Should be able to change, reorganise, optimise the physical
Conceptual level (Abstraction level) characteristics without affecting conceptual view.
Conceptual view is the representation of the entire DB, of the data "as Fine tuning here affects all users (programs) that access the DB.
it really is". Internal Data Definition Language is used to write the internal
i.e. the logical model of the whole DB. schema.
Conceptual schema is a definition of the total DB. This language does not specify actual disk block sizes, disk
Security checks are defined so only allowed users obtain access pages etc. which is performed by the OS. So the database's
e.g. passwords and grant permissions for each table internal level is one removed from the computer's physical
Integrity checks (validation) are also defined that ensure that level.
users do not harm the correctness of the database. Mappings
e.g. range checks on certain fields, referential integrity checks. DB needs two sets of mappings:
Specifications should be data independent, with no reference i. between conceptual/internal levels:
to physical storage structure or access methods.
Ideally, the conceptual schema describes the complete defines how the conceptual schema is actually to be stored
enterprise, including data flows from point to point, audit alterations to the internal level should be hidden from the
considerations etc. conceptual view by updating the mappings
Once a good conceptual schema is developed, the rest is easy.
Conceptual Data Definition Language (DDL) provides for ii. between external/conceptual levels:
creation and maintenance of the full schema. defines how an external user sees the database e.g. only certain fields
The Database Administrator is responsible for the maintenance
from a table are visible, or fields from different tables are combined as
of the conceptual schema one single view a new external view is created by specifying a new
Schema mapping
A formal definition of the logical structure of the database, often
represented as an Entity-Relationship diagram. Alterations to one level may not necessarily have any impact on
The schema for the external and internal levels is kept by the database another level.
in its Data Dictionary, also know as the Catalogue or System Tables. e.g.: addition of a new field to the database (at the conceptual
level) does not impact an external user as the external schema
Internal level (Physical view) mapped for the user cannot see the new field anyway, but
Internal view is the representation of the actual storage of the DB. naturally the new field needs to be mapped onto the internal
Internal schema defines the records, indexes, files and other database.
physical attributes. Similarly, addition of a new index for this new field (or an
For every conceptual table (file) there may be several actual existing field) at the internal level may significantly improve
storage files. performance, but has no effect on the external schemas.
10
The Database Administrator schemes. In this scheme, all the records have the same size and the
Person(s) responsible for overall control of the total system. same field format, with the fields having fixed size as well. The records
Responsibilities include: are sorted in the file according to the content of a field of a scalar
Defining the conceptual schema type, called ``key''. The key must identify uniquely a records, hence
Defining the internal schema different record have diferent keys. This organization is well suited for
Liasing with users, mainly application programmers. batch processing of the entire file, without adding or deleting items:
Includes defining external views for external users for access to this kind of operation can take advantage of the fixed size of records
the DB and file; moreover, this organization is easily stored both on disk and
Defining security and integrity rules tape. The key ordering, along with the fixed record size, makes this
Defining backup and recovery procedures organization amenable to dicotomic search However, adding and
Monitoring and fine-tuning performance deleting records to this kind of file is a tricky process: the logical
sequence of records tipycally matches their physical layout on the
File Organization Methods media storage, so to ease file navigation, hence adding a record and
Pile. maintaining the key order requires a reorganization of the whole file.
The usual solution is to make use of a ``log file'' (also called
Sequential.
``transaction file''), structured as a pile, to perform this kind of
Indexed-sequential. modification, and periodically perform a batch update on the master
Indexed. file.
Hashed. Indexed sequential
Pile An index file can be used to effectively overcome the above
It's the simplest possible organization: the data are collected in the file mentioned problem, and to speed up the key search as well. The
in the order in which they arrive, and it's not even required that the simplest indexing structure is the single-level one: a file whose records
records have a common format across the file (different fields/sizes, are pairs key-pointer, where the pointer is the position in the data file
same fields in different orders, etc.are possible). This implies that each of the record with the given key. Only a subset of data records, evenly
record/field must be self-describing. Despite the obvious storage spaced along the data file, are indexed, so to mark intervals of data
efficiency and the easy update, it's quite clear that this ``structure'' is records.
not suited for easy data retireval, since retrieving a datum basically
A key search then proceeds as follows: the search key is compared
requires detailed analysis of the file content. It makes sense only as
with the index ones to find the highest index key preceding the search
temporary storage for data to be later structured in some way.
one, and a linear search is performed from the record the index key
Sequential points onward, until the search key is matched or until the record
This is the most common structure for large files that are typically pointed by the next index entry is reached. In spite of the double file
processed in their entirety, and it's at the heart of the more complex
11
access (index + data) needed by this kind of search, the decrease in order given by the indexed key, and the partial type, which contain an
access time with respect to a sequential file is significant. entry for all those records that contain the chosen key field (for
variable records only).
Consider, for example, the case of simple linear search on a file with
1,000 records. With the sequential organization, an average of 500 key Hashed
comparisons are necessary (assuming uniformly distributed search key As with sequential or indexed files, a key field is required for this
among the data ones). However, using and evenly spaced index with organization, as well as fixed record length. However, no explicit
100 entries, the number of comparisons is reduced to 50 in the index ordering in the keys is used for the hash search, other than the one
file plus 50 in the data file: a 5:1 reduction in the number of implicitly determined by a hash function.
operations. File Access Method
This scheme can obviously be hyerarchically extended: an index is a The way by which information/data can be retrieved. There are
sequential file in itself, amenable to be indexed in turn by a second- two method of file accesss:
level index, and so on, thus exploiting more and more the hyerarchical 1. Direct Access
decomposition of the searches to decrease the access time. Obviously, 2. Sequential Access
if the layering of indexes is pushed too far, a point is reached when the Direct Access
advantages of indexing are hampered by the increased storage costs, This access method the information/data stored on a device
and by the index access times as well. can be accessed randomly and immediately irrespective to the
order it was stored. The data with this access method is quicker
Indexed than sequential access. This is also known as random access
Why using a single index for a certain key field of a data record? method. For example Hard disk, Flash Memory
Indexes can be obviously built for each field that uniquely identifies a Sequential Access
record (or set of records within the file), and whose type is amenable This access method the information/data stored on a device is
to ordering. Multiple indexes hence provide a high degree of flexibility accessed in the exact order in which it was stored. Sequential
for accessing the data via search on various attributes; this access methods are seen in older storage devices such as
organization also allows the use of variable length records (containing magnetic tape.
different fields). File Organization Method
The process that involves how data/information is stored so
It should be noted that when multiple indexes are are used the file access could be as easy and quickly as possible. Three main
concept of sequentiality of the records within the file is useless: each ways of file organization:
attribute (field) used to construct an index typically imposes an 1. Sequential
ordering of its own. For this very reason is typicaly not possible to use 2. Index-Sequential
the ``sparse'' (or ``spaced'') type of indexing previously described. Two 3. Random
types of indexes are usually found in the applications: the exhaustive
type, which contains an entry for each record in the main file, in the
12
Sequential file organization providing the database with quick jump points on where to find the
All records are stored in some sort of order (ascending, full reference (or to find the database row).
descending, alphabetical). The order is based on a field in the
record. For example a file holding the records of employeeID, There are both advantages and disadvantages to using
date of birth and address. The employee ID is used and records indexes,however.
stored is group accordingly (ascending/descending). Can be One disadvantage is they can take up quite a bit of space – check a
used with both direct and sequential access. textbook or reference guide and you’ll see it takes quite a few pages to
Index-Sequential organization include those page references.
The records is stores in some order but there is a second file
called the index-file that indicates where exactly certain key Another disadvantage is using too many indexes can actually slow your
points. Can not be used with sequential access method. database down. Thinking of a book again, imagine if every “the”, “and”
Random file organization or “at” was included in the index. That would stop the index being
The records are stored randomly but each record has its own useful – the index becomes as big as the text! On top of that, each
specific position on the disk (address). With this method no time a page or database row is updated or removed, the reference or
time could be wasted searching for a file. Instead it jumps to index also has to be updated.
the exact position and access the data/information. Can only
be used with direct access access method. So indexes speed up finding data, but slow down inserting, updating or
deleting data.
What is indexing?
Some fields are automatically indexed. A primary key or a field marked
Indexing is a way of sorting a number of records on multiple fields. as ‘unique’ – for example an email address, a userid or a social security
Creating an index on a field in a table creates another data structure number – are automatically indexed so the database can quickly check
which holds the field value, and pointer to the record it relates to. This to make sure that you’re not going to introduce bad data.
index structure is then sorted, allowing Binary Searches to be
performed on it. So when should a database field be indexed?
An index is a copy of select columns of data from a table that can be The general rule is anything that is used to limit the number of results
searched very efficiently that also includes a low-level disk block you’re trying to find.
address or direct link to the complete row of data it was copied from.
Some databases extend the power of indexing by letting developers
create indices on functions or expressions. What is hashing?
Put simply, database indexes help speed up retrieval of data. The
Hashing is the transformation of a string of characters into a usually
other great benefit of indexes is that your server doesn’t have to work
shorter fixed-length value or key that represents the original string.
as hard to get the data. They are much the same as book indexes,
13
Hashing is used to index and retrieve items in a database because it is The hashing algorithm is called the hash function-- probably the term
faster to find the item using the shorter hashed key than to find it is derived from the idea that the resulting hash value can be thought
using the original value. of as a "mixed up" version of the represented value.
Hashing is the transformation of a string of characters into a usually In addition to faster data retrieval, hashing is also used to encrypt and
shorter fixed-length value or key that represents the original string. decrypt digital signatures (used to authenticate message senders and
Hashing is used to index and retrieve items in a database because it is receivers). The digital signature is transformed with the hash function
faster to find the item using the shorter hashed key than to find it and then both the hashed value (known as a message-digest) and the
using the original value. It is also used in many encryption algorithms. signature are sent in separate transmissions to the receiver. Using the
same hash function as the sender, the receiver derives a message-
As a simple example of the using of hashing in databases, a group of digest from the signature and compares it with the message-digest it
people could be arranged in a database like this: also received. (They should be the same.)
Abernathy, Sara Epperdingle, Roscoe Moore, Wilfred Smith, David The hash function is used to index the original value or key and then
(and many more sorted into alphabetical order) used later each time the data associated with the value or key is to be
Each of these names would be the key in the database for that retrieved. Thus, hashing is always a one-way operation. There's no
person's data. A database search mechanism would first have to start need to "reverse engineer" the hash function by analyzing the hashed
looking character-by-character across the name for matches until it values. In fact, the ideal hash function can't be derived by such
found the match (or ruled the other entries out). But if each of the analysis. A good hash function also should not produce the same hash
names were hashed, it might be possible (depending on the number of value from two different inputs. If it does, this is known as a collision. A
names in the database) to generate a unique four-digit key for each hash function that offers an extremely low risk of collision may be
name. For example: considered acceptable.
7864 Abernathy, Sara 9802 Epperdingle, Roscoe 1990 Moore, Here are some relatively simple hash functions that have been used:
Wilfred 8822 Smith, David (and so forth) Division-remainder method: The size of the number of items in
A search for any name would first consist of computing the hash value the table is estimated. That number is then used as a divisor
(using the same hash function used to store the item) and then into each original value or key to extract a quotient and a
comparing for a match using that value. It would, in general, be much remainder. The remainder is the hashed value. (Since this
faster to find a match across four digits, each having only 10 method is liable to produce a number of collisions, any search
possibilities, than across an unpredictable value length where each mechanism would have to be able to recognize a collision and
character had 26 possibilities. offer an alternate search mechanism.)
16
that I believe it will simply become an embedded feature in storage frequency count. This is achieved using exactly the same technique as
devices. used at the field level to work out the unique combinations.
Data compression is an effective means for saving storage space and Algorithmic compression: Field and pattern compression techniques
network bandwidth. A large number of compression schemes have save disk space as much as saving memory. RainStor’s algorithmic
been devised based on character encoding or on detection of compression involves innovative techniques designed to reduce the
repetitive strings [2, 18], Many compression schemes achieve data amount of disk space required for storage.
reduction rates to 2.3−2.5 bits per character for English text [2], i.e.,
compression factors of about 3 1 ⁄4 . Since compression schemes are Byte-level compression: In this scenario, components of the tree are
so successful for network bandwidth, the advantageous effects of data aggressively compressed independently using industry standard byte-
compression on I/O performance in database systems are rather compression algorithms tuned to offer optimal savings.
obvious, i.e., its effects on disk space, bandwidth, and throughput. Front-End Compression
However, we believe that the benefits of compression in database
systems can be observed and exploited beyond I/O performance. Front-end compression leaves the first entry in an index block
unchanged. In all other entries of the block, the compression removes
Database performance strongly depends on the amount of available
the leftmost characters that are identical to the leftmost characters of
memory, be it as I/O buffers or as work space for query processing
the first entry; however, the compression does not necessarily occur
algorithms. Therefore, it seems logical to try to use all available when the entry is first added. The index entry will contain a count of
memory as effectively as possible — in other words, to keep and the number of characters removed from each entry name.
manipulate data in memory in compressed form. This requires, of
course, that the query processing algorithms can operate on For example, given the following index block:
compressed data. In this report, we introduce techniques to allow just
Skip over ASCII art..
that, and demonstrate their effect on database performance
┌─────────┐
Field-level de-duplication: This involves processing the source data on │ FEG.ABC │
a column-by-column basis, reducing the dataset to only the list of the │ FEG.ADE │
│ FEG.F │
unique values that each column holds, together with a frequency
│ FEX.P │
count of the number of times the value appears. In this instance the │ FOM │
storage space required using field-level de-duplication is a fraction of │ GES.B │
the original data. └─────────┘
RACF® compresses the entries of the block, preceded by the
Pattern-level de-duplication: In order to store compressed data in a compression counts, as follows:
lossless state, a binary tree is built with pointers that can be used to
reconstitute the data as it existed in its original form. Pattern-level de-
duplication builds on field-level de-duplication by further leveraging
the ability to store only unique values of the branches, again with a
17
Skip over ASCII art.. multiuser database may exist on a single machine, such as a
mainframe or other powerful computer, or it may be distributed and
exist on multiple computers. Multiuser databases are accessible from
┌───────────┐ multiple computers simultaneously. Many people can be working
│ 0 FEG.ABC │
│ 5 DE │ together to update information at the same time. All employees have
│ 4 F │ access to the most up-to-date information all of the time. Customers
│ 2 X.P │ have instant access to their personal information held by companies
│ 1 OM │
│ 0 GES.B │ COMPARISION BETWEEN SINGLE USER AND MULTIPLE USER
└───────────┘ DATABASESINGLE USER DBMS MULTIPLE USER DBMSAccess Restricted
If an entry in an index block has exactly the same name as the first to single user at a Access can share by Multiple user at atime
entry in the block, the entry may be totally compressed. timeDatabase Structure relatively simple Complex Database Structure
due to shared access. Complexity Increases with the structure of
databaseSwitching between projects is difficult as Switching between
3.0 CLASSIFICATION OF DATABASES projects is easy asdifferent schemas repositories are used single
Categorizing databases schemas repository is used
According to number of users COMPARISION BETWEEN SINGLE USER AND MULTIPLE USER
According to geographical location DATABASESINGLE USER DBMS MULTIPLE USER DBMSCommitting
According to purpose change in the database Access sharing makes it difficult to
makewithout causing deadlock changes, sometimes causes
4.0 CLASSIFICATION OF DATABASE/CATEGORIES OF DATABASES deadlockInfrastructure cost is minimum as Infrastructure such as
Servers, Networksdatabase is accessed by single user at a are needed
According to number of users for shared access.time Maintenance is also overhead expenseWastage
SINGLE USER DBMS VS MULTI USER DBMS of CPU and resource when Optimum usage/optimization of
A single-user can access the database at one point of time. These theuser/application remain idle. resources between various users.
types of systems are optimized for a personal desktop experience, not
for multiple users of the system at the same time. All the resources
are always available for the user to work. The architecture According to Geographical location
implemented is both One or Two tier. Both the application and Centralized Systems
physical layer are operated by user. For Ex: Standalone Personal With centralized database systems, the system is stored at a single site.
Computers, Microsoft Access, etc.
Multi user DBMS are the systems that support two or more
simultaneous users. All mainframes and minicomputers are multi-user
systems, but most personal computers and workstations are not. A
18
According to data model
The most popular data model in use today is the relational data model. Well
known DBMSs like Oracle, MS SQL Server, DB2, MySQL support this model.
Other traditional models can be named hierarchical data model, or network
data model. In the recent years, we are getting familiar with object-oriented
data models but these models have not had widespread use. Some examples
of Object-oriented DBMSs are O2, ObjectStore or Jasmine.
4. Efficiency:
19
The hierarchical database model is a very efficient, one when the changed the applications will also have to be modified. Thus, in a
database contains a large number of I: N relationships (one-to-many hierarchical database the benefits of data independence are limited by
relationships) and when the users require large number of structural dependence.
transactions, using data whose relationships are fixed.
7. Programs Complexity: Due to the structural dependence and the
navigational structure, the application programs and the end users
Disadvantages must know precisely how the data is distributed physically in the
1. Complexity of Implementation: The actual implementation of a database in order to access data. This requires knowledge of complex
hierarchical database depends on the physical storage of data. This pointer systems, which is often beyond the grasp of ordinary users
makes the implementation complicated. (users who have little or no programming knowledge).
2. Difficulty in Management: The movement of a data segment from 8. Operational Anomalies: As discussed earlier, hierarchical model
one location to another cause all the accessing programs to be suffers from the Insert anomalies, Update anomalies and Deletion
modified making database management a complex affair. anomalies, also the retrieval operation is complex and asymmetric,
thus hierarchical model is not suitable for all the cases.
3. Complexity of Programming: Programming a hierarchical database
is relatively complex because the programmers must know the 9. Implementation Limitation: Many of the common relationships do
physical path of the data items. not conform to the l:N format required by the hierarchical model. The
many-to-many (N:N) relationships, which are more common in real life
4. Poor Portability: The database is not easily portable mainly are very difficult to implement in a hierarchical model.
because there is little or no standard existing for these types of
database. Network e.g. CODASYL
More complex
5. Database Management Problems: If you make any changes in the Many-to-many relationship
database structure of a hierarchical database, then you need to make More flexible but doesn’t support ad hoc requests well
the necessary changes in all the application programs that access the
database. Thus, maintaining the database and the applications can Relational
become very difficult. Data elements stored in simple tables
Can link data elements from various tables
6. Lack of structural independence: Structural independence exists Very supportive of ad hoc requests but slower at processing large
when the changes to the database structure does not affect the amounts of data than hierarchical or network models
DBMS's ability to access data. Hierarchical database systems use
physical storage paths to navigate to the different data segments. So, Object Oriented
the application programs should have a good knowledge of the Key technology of multimedia web-based applications
relevant access paths to access the data. So, if the physical structure is Good for complex, high-volume applications
20
There is no physical connection between tables, however they
Multi-Dimensional are linked by cross-references.
All records are directly accessible by any means imaginable.
A variation of the relational model Data accessed via a query language is based on:
Cubes of data and cubes within cubes Relational algebra - procedural language - state how data are
Popular for online analytical processing (OLAP) applications to be found.
Relational calculus - non-procedural language - state what is
Other Database Architectures required.
Client/Server Relational database queries lend themselves to optimisation
DBMS is housed within the server because of their mathematical foundation.
Clients access the server via some network Properties of Tables (Relations)
Application programs are stored on the client machines
Distributed DBMS
The DB is stored across many computers (servers or hosts) Definition:
data is combined from several servers A Table (or Relation) is a time-varying, two-dimensional table of values
many computers may help in the processing of one query with column headings and values within each column being of the
machines may be both a client and a server same data type.
Relational Database Entries (like cells in a spreadsheet) in a table are all single
The Relational Database Model valued. This implies 1NF.
This was proposed by Edgar (Ted) Codd in 1970. It: Entries in a column (attribute or field) are all of the same data
was developed from scratch. type.
is completely different to earlier models (Hierarchical and Each column has a unique name within the table.
CODASYL network). The order of the columns is not important! (Not always true for
is based on branches of mathematics including set theory, SQL.)
relational algebra and relational calculus. No two rows (tuples) are allowed to be identical. (Not true in
requires data to have a simplified structure - at least to first most implementations.)
normal form (1NF), preferably at least 3NF. Each column has a domain which is the set of allowable values
has more overhead than earlier models. for the column.
Features of the Relational Database Model The number of columns is the degree of the relation.
Very simple (although concepts are tricky at first) Today the terms table, row and column are preferred over relation,
Database is made up of many tables each of a fixed length tuple and attribute respectively.
record structure (A table may be stored as a single file - that is e.g.: The table (relation) CUSTOMER:
up to the DBMS)
21
Relational Schema
(Schema = data structure or view definition)
CUSTCODE NAME STREET TOWN BALANCE Conceptual Schema Description (Data Definition)
Describes the composition of ALL the relations as they really are,
13 GREGORY’S often indicating any keys.eg:
REE1 REED, JIM SYDNEY $167.00
LANE customer = (custcode CHAR(4) NOT NULL,
DIFILLIPO, 34 SAINTY name CHAR(25),
DIF1 MELBOURNE $34.00 town CHAR(25),
DARREN DRIVE
balance CURRENCY)
23
REED, PRIMARY KEY (custcode );
REE2 POMEGRANATE BRISBANE $702.50
STEVE order = (ordnum INTEGER NOT NULL,
ST
custcode CHAR(4) NOT NULL,
SPLATT, 23 DRAGWAYS orderdate DATE)
SPL1 MELBOURNE $0.00
RACHELLE AVENUE PRIMARY KEY (ordnum)
KIRBY, 67 PENNZOIL MAIDEN FOREIGN KEY (custcode) REFERENCES customer;
KIR1 $328.35 Data Structure Diagram
ROBIN CLOSE GULLY
Shows the relationships between the tables:
Important Definitions Derivation of the contents (attributes) of each table is by a process
Candidate key called Normalisation.
A minimum combination of columns that always uniquely identifies a Other Relational Features
row (’record’). The set of all columns is always a candidate key. According to Codd (1990), the advantages of the Relational DBMS are:
Power and Productivity
Adapability and Safety of Investment
(Primary) key Database controllability
One chosen candidate key. (Each table has only one key.) Efficient optimiser (response times)
Foreign key User (external) views
In a table, a combination of columns which is not the (primary) key, Integrity control
but which is the (primary) key in another table, providing a link Concurrency control
between the two tables. Access security
Any data item can be accessed through: Distributability
table name The System Catalogue is also stored as a set of tables and can
+ primary key value (specifies row) be accessed and queried using the same relational language as
+ column name the regular data.
22
Recovery from both soft and hard crashes.
Ease of conversion to/from any other approach 3. A database management system is designed to coordinate multiple
Other features provided by RDBMS: users accessing the same data at the same time. A file-processing
A language that provides query, data manipulation, data system is usually designed to allow one or more programs to access
definition and data control facilities. different data files at the same time. In a file-processing system, a file
Efficient file structures for storage and access. can be accessed by two programs concurrently only if both programs
Report and graphic generators have read-only access to the file.
Some criticisms of the Relational Database Model
It is inefficient in that it requires high processing overheads 4. Redundancy is control in DBMS, but not in file system.
There is inadequate semantic support (support of meaning)
There is no formal support for multimedia and large objects 5. Unauthorized access is restricted in DBMS but not in the file system.
Application Programmers 7. DBMS provide back up and recovery whereas data lost in file system
The people who write application programs in programming can't be recovered.
languages (such as Visual Basic, Java, or C++) to interact with
databases are called Application Programmer. 8. DBMS provide multiple user interfaces. Data is isolated in file
Database Administrators system.
A person who is responsible for managing the overall database
management system is called database administrator or simply Choosing a file organization is a design decision, hence it must be done
DBA. having in mind the achievement of good performance with respect to
End-Users the most likely usage of the file. The criteria usually considered
The end-users are the people who interact with database important are:
management system to perform different operations on 1. Fast access to single record or collection of related recors.
database such as retrieving, updating, inserting, deleting data 2. Easy record adding/update/removal, without disrupting (1).
etc. 3. Storage efficiency.
Difference between File processing system and DBMS: 4. Redundance as a warranty against data corruption.
1. A database management system coordinates both the physical and
the logical access to the data, whereas a file-processing system
coordinates only the physical access.
23
5.0 DATABASE LIFECYCLE
Database life cycle phases
Activities and personnel involved at each stage
Roles of personnel at each stage
Database life cycle
24
DBA Responsibilities
The job of the DBA seems to be everything that everyone else either
doesn't want to do, or doesn't have the ability to do. DBAs get the
enviable task of figuring out all of the things no one else can figure
out. More seriously though, here is a list of typical DBA
responsibilities:
Installation, configuration and upgrading of Oracle server
software and related products
Evaluate Oracle features and Oracle related products
Establish and maintain sound backup and recovery policies and
procedures
Take care of the Database design and implementation
Implement and maintain database security (create and
maintain users and roles, assign privileges)
Perform database tuning and performance monitoring
Perform application tuning and performance monitoring
Setup and maintain documentation and standards
25
Plan growth and changes (capacity planning) DBA Qualifications
Work as part of a team and provide 24x7 support when May be certified as an Oracle DBA. See Oracle Certification
required Program.
Perform general technical trouble shooting and give Preferably a BS in computer science or related engineering field
consultation to development teams Lots and lots of EXPERIENCE
Interface with Oracle Corporation for technical support. Application Database Administrator (ADBA)
Patch Management and Version Control Application DBA's or ADBA's are responsible for looking after the
DBA Skills Required application tasks pertaining to a specific application. This includes the
Good understanding of the Oracle database, related utilities creation of database objects, snapshots, SQL tuning, etc.
and tools
A good understanding of the underlying operating system Typical ADBA responsibilities:
A good knowledge of the physical database design
Implement and maintain the database design
Ability to perform both Oracle and operating system
Create database objects (tables, indexes, etc.)
performance tuning and monitoring Write database procedures, functions and triggers
Knowledge of ALL Oracle backup and recovery scenarios
Assist developers with database activities
A good knowledge of Oracle security management
Tune database queries
A good knowledge of how Oracle acquires and manages
Monitor application related jobs and data replication activities
resources Applications Database Administrator (Apps DBA)
A good knowledge Oracle data integrity
Administration of the Oracle E-Business Suite environment, including
Sound knowledge of the implemented application systems
the normal DBA and ADBA functions for such an environment.
Experience in code migration, database change management
Oracle Developer
and data management through the various stages of the Person that develops application systems that will be deployed on an
development life cycle Oracle database.
A sound knowledge of both database and system performance
Generic responsibilities
tuning The responsibilities below can be added to any of the above.
A DBA should have sound communication skills with
Complete time sheets
management, development teams, vendors and systems Document work performed
administrators Interview new candidates
Provide a strategic database direction for the organisation
A DBA should have the ability to handle multiple projects and
deadlines
A DBA should possess a sound understanding of the business
6.0 DATABASE LANGUAGES (SQL)
26
Data Definition language (DDL), databases views and indexes SELECT CustomerName,City FROM Customers;
Data manipulation language (DML)(select, insert, Delete and
update) SELECT * Example
The following SQL statement selects all the columns from the
What is SQL? "Customers" table:
SQL stands for Structured Query Language Example
SQL lets you access and manipulate databases
SQL is an ANSI (American National Standards Institute) SELECT * FROM Customers;
standard
What Can SQL do? The SELECT DISTINCT statement is used to return only distinct
SQL can execute queries against a database (different) values.
SQL can retrieve data from a database The SQL SELECT DISTINCT Statement
SQL can insert records in a database In a table, a column may contain many duplicate values; and
SQL can update records in a database sometimes you only want to list the different (distinct) values.
SQL can delete records from a database
SQL can create new databases The DISTINCT keyword can be used to return only distinct (different)
SQL can create new tables in a database values.
SQL can create stored procedures in a database SQL SELECT DISTINCT Syntax
SQL can create views in a database SELECT DISTINCT column_name,column_name
SQL can set permissions on tables, procedures, and views FROM table_name;
Example
The SELECT statement is used to select data from a database.
SELECT DISTINCT City FROM Customers;
The SELECT statement is used to select data from a database.
The result is stored in a result table, called the result-set. The WHERE clause is used to filter records.
The SQL WHERE Clause
SQL SELECT Syntax
SELECT column_name,column_name The WHERE clause is used to extract only those records that fulfill a
FROM table_name; specified criterion.
and SQL WHERE Syntax
SELECT column_name,column_name
SELECT * FROM table_name; FROM table_name
Example WHERE column_name operator value;
27
Example
SELECT * FROM Customers
WHERE Country='Mexico'; The AND & OR operators are used to filter records based on more than
one condition.
The SQL AND & OR Operators
Example The AND operator displays a record if both the first condition AND the
SELECT * FROM Customers second condition are true.
WHERE CustomerID=1; The OR operator displays a record if either the first condition OR the
second condition is true.
29
The second form specifies both the column names and the values to SQL UPDATE Syntax
be inserted: UPDATE table_name
INSERT INTO table_name (column1,column2,column3,...) SET column1=value1,column2=value2,...
VALUES (value1,value2,value3,...); WHERE some_column=some_value;
Assume we wish to insert a new row in the "Customers" table. Assume we wish to update the customer "Alfreds Futterkiste" with a
new contact person and city.
We can use the following SQL statement:
We use the following SQL statement:
Example
Example
INSERT INTO Customers (CustomerName, ContactName, Address, City,
PostalCode, Country) UPDATE Customers
VALUES ('Cardinal','Tom B. Erichsen','Skagen SET ContactName='Alfred Schmidt', City='Hamburg'
21','Stavanger','4006','Norway'); WHERE CustomerName='Alfreds Futterkiste';
Insert Data Only in Specified Columns The DELETE statement is used to delete records in a table.
It is also possible to only insert data in specific columns. The SQL DELETE Statement
The following SQL statement will insert a new row, but only insert data The DELETE statement is used to delete rows in a table.
in the "CustomerName", "City", and "Country" columns (and the SQL DELETE Syntax
CustomerID field will of course also be updated automatically): DELETE FROM table_name
Example WHERE some_column=some_value;
Database tables can be added with the CREATE TABLE statement. It is possible to delete all rows in a table without deleting the table.
The SQL CREATE TABLE Statement This means that the table structure, attributes, and indexes will be
intact:
The CREATE TABLE statement is used to create a table in a database.
DELETE FROM table_name;
Tables are organized into rows and columns; and each table must have
a name. or
SQL CREATE TABLE Syntax
CREATE TABLE table_name DELETE * FROM table_name;
(
column_name1 data_type(size), The SQL CREATE DATABASE Statement
column_name2 data_type(size), The CREATE DATABASE statement is used to create a database.
31
SQL CREATE DATABASE Syntax FirstName varchar(255),
CREATE DATABASE dbname; Address varchar(255),
SQL CREATE DATABASE Example City varchar(255)
);
The following SQL statement creates a database called "my_db":
CREATE DATABASE my_db; SQL Statements
The following is an alphabetical list of SQL statements that can be
Database tables can be added with the CREATE TABLE statement. issued against an Oracle database. These commands are available to
The SQL CREATE TABLE Statement any user of the Oracle database. Emphasized items are most
The CREATE TABLE statement is used to create a table in a database. commonly used.
ALTER - Change an existing table, view or index definition
Tables are organized into rows and columns; and each table must have AUDIT - Track the changes made to a table
a name. COMMENT - Add a comment to a table or column in a table
COMMIT - Make all recent changes permanent
SQL CREATE TABLE Syntax
CREATE - Create new database objects such as tables or views
CREATE TABLE table_name
DELETE - Delete rows from a database table
(
DROP - Drop a database object such as a table, view or index
column_name1 data_type(size),
GRANT - Allow another user to access database objects such as
column_name2 data_type(size),
tables or views
column_name3 data_type(size),
INSERT - Insert new data into a database table
....
NO AUDIT - Turn off the auditing function
);
REVOKE - Disallow a user access to database objects such as
SQL CREATE TABLE Example tables and views
ROLLBACK - Undo any recent changes to the database
Now we want to create a table called "Persons" that contains five
SELECT - Retrieve data from a database table
columns: PersonID, LastName, FirstName, Address, and City.
UPDATE - Change the values of some data items in a database
We use the following CREATE TABLE statement: table
Some examples of SQL statements follow. For all examples in this
Example tutorial, key words used by SQL and Oracle are given in all uppercase
CREATE TABLE Persons while user-specific information, such as table and column names, is
( given in lower case.
PersonID int, To create a new table to hold employee data, we use the CREATE
LastName varchar(255), TABLE statement:
32
CREATE TABLE employee
(fname VARCHAR2(8), 7.0 NORMALISATION
minit VARCHAR2(2), Purpose and utilization
lname VARCHAR2(8), Normal forms up BCNF
ssn VARCHAR2(9) NOT NULL, Limitations of normalization
bdate DATE, 1. First normal form: A table is in the first normal form if it
address VARCHAR2(27), contains no repeating columns.
sex VARCHAR2(1),
salary NUMBER(7) NOT NULL, 2. Second normal form: A table is in the second normal form if it
superssn VARCHAR2(9), is in the first normal form and contains only columns that are
dno NUMBER(1) NOT NULL) ; dependent on the whole (primary) key.
To insert new data into the employee table, we use the INSERT 3. Third normal form: A table is in the third normal form if it is in
statement: the second normal form and all the non-key columns are
INSERT INTO employee dependent only on the primary key. If the value of a non-key
VALUES ('BUD', 'T', 'WILLIAMS', '132451122', '24-JAN-54', column is dependent on the value of another non-key column
'987 Western Way, Plano, TX', 'M', 42000, NULL, 5); we have a situation known as transitive dependency. This can
To retrieve a list of all employees with salary greater than 30000 from be resolved by removing the columns dependent on non-key
the employees table, the following SQL statement might be issued items to another table.
(Note that all SQL statements end with a semicolon): The Process of Normalisation
SELECT fname, lname, salary
FROM employee Normalisation is a data analysis technique to design a database
WHERE salary > 30000; system. It allows the database designer to understand the current data
structures in an organisation. Furthermore, it aids any future changes
To give each employee in department 5 a 4 percent raise, the and enhancements to the system.
following SQL statement might be issued:
UPDATE employee Normalisation is a technique for producing relational schema with the
SET salary = salary * 1.04 following properties:
WHERE dno = 5; No Information Redundancy
To delete an employee record from the database, the following SQL No Update Anomalies
statement might be issued: The end result of normalisation is a set of entities, which removes
unnecessary redundancy (ie duplication of data) and avoids the
DELETE FROM employee anomalies which will be discussed next.
WHERE empid = 101 ;
33
Anomalies Three Normal forms: 1NF, 2NF and 3NF were initially proposed
by Codd.
Anomalies are inconvenient or error-prone situation arising when we
process the tables. There are three types of anomalies: All these normal forms are based on the functional
Update Anomalies dependencies among the attributes of a relation.
Delete Anomalies Normalisation follows a staged process that obeys a set of rules. The
Insert Anomalies steps of normalisation are:
Update Anomalies Step 1: Select the data source and convert into an unnormalised table
An Update Anomaly exists when one or more instances of duplicated (UNF)
data is updated, but not all. For example, consider Jones moving Step 2: Transform the unnormalised data into first normal form (1NF)
address - you need to update all instances of Jones's address.
Delete Anomalies Step 3: Transform data in first normal form (1NF) into second normal
form (2NF)
A Delete Anomaly exists when certain attributes are lost because of
the deletion of other attributes. For example, consider what happens if Step 4: Transform data in second normal form (2NF) into third normal
Student S30 is the last student to leave the course - All information form (3NF)
about the course is lost.
Occasionally, the data may still be subject to anomalies in third normal
Insert Anomalies form. In this case, we may have to perform further transformations.
An Insert Anomaly occurs when certain attributes cannot be inserted Transform third normal form to Boyce-Codd normal form
into the database without the presence of other attributes. For (BCNF)
example this is the converse of delete anomaly - we can't add a new Transform Boyce-Codd normal form to fourth normal form
course unless we have at least one student enrolled on the course. (4NF)
Normalisation Stages Transform fourth normal form to fifth normal form (5NF)
BCNF is based on the concept of a determinant.
Process involves applying a series of tests on a relation to determine A determinant is any attribute (simple or composite) on which some
whether it satisfies or violates the requirements of a given normal other attribute is fully functionally dependent.
form. A relation is in BCNF is, and only if, every determinant is a candidate
When a test fails, the relation is decomposed into simpler key.
relations that individually meet the normalisation tests.
EXAMPLES
The higher the normal form the less vulnerable to update
anomalies the relations become. Patient No Patient Name Appointment Id Time Doctor
34
or
DB(Patno,PatName,appNo,time,doctor) (example 1b)
1 John 0 09:00 Zorro Example 1a - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
2 Kerr 0 09:00 Killer
None:
3 Adam 1 10:00 Zorro DB(Patno,PatName,appNo,time,doctor)
2NF Eliminate partial key dependencies
4 Robert 0 13:00 Killer
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)
5 Zane 1 14:00 Zorro
Without Normalization, it becomes difficult to handle and update the Student Table :
database, without facing data loss. Insertion, Updation and Deletion
Student Age Subject
Anamolies are very frequent if Database is not Normalized. To
understand these anomalies let us take an example of Student table. Adam 15 Biology, Maths
Alex 14 Maths
39
Stuart 17 Maths New Student Table following 2NF will be :
In First Normal Form, any row must not have a column in which Student Age
more than one value is saved, like separated with commas. Adam 15
Rather than that, we must separate such data into multiple Alex 14
rows. Stuart 17
Student Table following 1NF will be : In Student Table the candidate key will be Student column, because all
other column i.e Age is dependent on it.
Student Age Subject
Adam 15 Biology New Subject Table introduced for 2NF will be :
Adam 15 Maths
Student Subject
Alex 14 Maths
Adam Biology
Stuart 17 Maths
Adam Maths
Using the First Normal Form, data redundancy increases, as Alex Maths
there will be many columns with same data in multiple rows Stuart Maths
but each row as a whole will be unique.
In Subject Table the candidate key will be {Student, Subject} column.
Now, both the above tables qualifies for Second Normal Form and will
As per the Second Normal Form there must not be any partial never suffer from Update Anomalies. Although there are a few
dependency of any column on primary key. It means that for a table complex cases in which table in Second Normal Form suffers Update
that has concatenated primary key, each column in the table that is Anomalies, and to handle those scenarios Third Normal Form is there.
not part of the primary key must depend upon the entire
concatenated key for its existence. If any column depends only on one Third Normal form applies that every non-prime attribute of table
part of the concatenated key, then the table fails Second normal form. must be dependent on primary key, or we can say that, there should
not be the case that a non-prime attribute is determined by another
In example of First Normal Form there are two rows for Adam, to non-prime attribute. So this transitive functional dependency should
include multiple subjects that he has opted for. While this is be removed from the table and also the table must be in Second
searchable, and follows First normal form, it is an inefficient use of Normal form. For example, consider a table with following fields.
space. Also in the above Table in First Normal Form, while the
candidate key is {Student, Subject}, Age of Student only depends on Student_Detail Table :
Student column, which is incorrect as per Second Normal Form. To Student_id Student_name DOB Street city State Zip
achieve second normal form, it would be helpful to split out the
In this table Student_id is Primary key, but street, city and state
subjects into an independent table, and match them up using the
depends upon Zip. The dependency between zip and other fields is
student names as foreign keys.
40
called transitive dependency. Hence to apply 3NF, we need to move
the street, city and state to new table, with Zip as primary key.
Boyce and Codd Normal Form is a higher version of the Third Normal
form. This form deals with certain type of anamoly that is not handled
by 3NF. A 3NF table which does not have multiple overlapping
candidate keys is said to be in BCNF. For a table to be in BCNF,
following conditions must be satisfied:
41
2NF Rules
Rule 1- Be in 1NF
Rule 2- Single Column Primary Key
It is clear that we can't move forward to make our simple database in
2nd Normalization form unless we partition the table above.
Table 1 Table 1
Here you see Movies Rented column has multiple values.
Now let's move in to 1st Normal Form
1NF Rules
Each table cell should contain single value.
Each record needs to be unique.
The above table in 1NF-
Table 2
We have divided our 1NF table into two tables viz. Table 1 and Table2.
Table 1 contains member information. Table 2 contains information on
movies rented.
We have introduced a new column called Membership_id which is the
primary key for table 1. Records can be uniquely identified in Table 1
using membership id
2NF Rules
Table 1 : In 1NF Form
Rule 1- Be in 1NF
Let's move into 2NF
Rule 2- Single Column Primary Key
42
It is clear that we can't move forward to make our simple database in
2nd Normalization form unless we partition the table above.
TABLE 1
Table 1
Table 2 Table 2
We have divided our 1NF table into two tables viz. Table 1 and Table2.
Table 1 contains member information. Table 2 contains information on
movies rented.
We have introduced a new column called Membership_id which is the
primary key for table 1. Records can be uniquely identified in Table 1
using membership id
43
In Table 3 Salutation ID is primary key and in Table 1 Salutation ID is foreign
to primary key in Table 3
A primary is a single column values used to uniquely identify a A foreign key can have a different name from its primary key
database record. It ensures rows in one table have corresponding rows in
another
It has following attributes
Unlike Primary key they do not have to be unique. Most often
A primary key cannot be NULL they aren't
A primary key value must be unique Foreign keys can be null even though primary keys can not
The primary key values can not be changed
The primary key must be given a value when a new record is
inserted.
What is a composite Key?
A composite key is a primary key composed of multiple columns used
to identify a record uniquely
In our database , we have two people with the same name Robert Phil
but they live at different places.
44
The above problem can be overcome by declaring membership id from
Table2 as foreign key of membership id from Table1
Now , if somebody tries to insert a value in the membership id field that does
not exist in the parent table , an error will be shown!
What is a transitive functional dependencies?
A transitive functional dependency is when changing a non-key column ,
might cause any of the other non-key columns to change
Consider the table 1. Changing the non-key column Full Name , may change
Salutation.
You will only be able to insert values into your foreign key that exist in the
unique key in the parent table. This helps in referential integrity.
45
Data Integrity is that a primary key of an entity, or any part of it, can never take a null
Integrity ensures that the data in a database is both accurate and value.
complete, in other words, that the data makes sense. There are at Oracle, and most other relational database management systems, will
least five different types of integrity that need to be considered: enforce this.
Domain constraints Column Constraints
Entity integrity
Column constraints During the data analysis phase, business rules will identify any column
User-defined integrity constraints constraints. For example, a salary cannot be negative; an employee
Referential integrity number must be in the range 1000 - 2000, etc.
Domain Constraints User-Defined Integrity Constraints
A domain is defined as the set of all unique values permitted for an Business rules may dictate that when a specific action occurs, further
attribute. For example, a domain of Date is the set of all possible valid actions should be triggered. For example, deletion of a record
dates, a domain of Integer is all possible whole numbers, and a automatically writes that record to an audit table.
domain of day-of-week is Monday, Tuesday ... Sunday.
Referential Integrity
This in effect is defining rules for a particular attribute. If it is Referential integrity is with the relationships between the tables of a
determined that an attribute is a date then it should be implemented database, ie that the data of one table does not contradict the data of
in the database to prevent invalid dates being entered. another table. Specifically, every foreign key value in a table must have
A classic example of this is where the data from a legacy system is a matching primary key value in the related table. This is the most
loaded into a newly designed database. The new system is well common type of integrity constraint. This is used to manage the
designed. Columns that hold dates are defined as such whereas, in the relationships between primary and foreign keys. Referential integrity is
old system, they were held as character strings. Much data is rejected best illustrated by an example:
because of invalid dates, eg 30 February 2000. Let's assume the department and employee entities have been
implemented as tables in a relational database system. When entering
If the system supports domain constraints then this invalid data would a new employee, the department in which they work needs to be
not have stored in the first place. That is, the integrity of the database specified. Department number is the foreign key in the employee table
is being preserved. and the primary key in the department table. In order to preserve the
Entity Integrity integrity of the data in the database there are a set of rules that need
Entity integrity is concerned with ensuring that each row of a table has to be observed:
a unique and non-null primary key value; this is the same as saying
that each row in a table represents a single instance of the entity type
modelled by the table. A requirement of E F Codd in his seminal paper
46
If inserting an employee in the employee table, the insertion PRIMARY KEYS
should only be allowed if their department number exists in
the department table Primary key is a unique identifier for a relation.
There could be several candidate keys, as long as the they satisfy two
If deleting a department in the department table, the deletion properties:
should only be allowed if there are no employees working in 1. uniqueness
that department 2. minimality
If changing the value of a department number in the From the set of candidate keys, one is chosen to be the primary key.
department table, the update should only be allowed if there The others become alternate keys.
are no employees working in the department whose number is
being changed EXAMPLE: The relation R has several candidate keys.
ID SSN License_Number NAME
If changing the value of a department number in the employee If we select ID to be the primary key, then the other candidate keys
table, the update should only be allowed if the new value exists become alternate keys.
in the department table
If any of the above is allowed to happen, then we have data in an
inconsistent state. The integrity of the data is compromised - the data THE ENTITY INTEGRITY RULE
does not make sense.
No component of the primary key of a base relation is allowed to
8.0 DATABASE SECURITY AND INTEGRITY accept nulls.
Database integrity rules
Entity, semantic and referential integrity
WHAT ARE NULLS?
INTEGRITY RULES Null may mean "property does not apply". For example, the supplier
Integrity rules are needed to inform the DBMS about certain may be a country, in which case the attribute CITY has a null value
constraints in the real world. because such property does not apply.
Specific integrity rules apply to one specific database. Null may mean "value is unknown". For example, if the supplier is a
Example: part weights must be greater than zero. person, then a null value for CITY attribute means we do not know the
General integrity rules apply to all databases. location of this supplier.
Two general rules will be discussed to deal with: primary keys and
Nulls cannot be in primary keys, but can be in alternate keys.
foreign keys.
EXAMPLE: SSN may be null for one and only one person (why?)
47
The referential diagrams are:
FOREIGN KEYS SP ---S#---> SC ---CITY---> CS
A foreign key is an attribute of one relation R2, whose values are DELETE INTEGRITY RULE:
required to match those of the primary key of some other relation R1 We should not delete (S5,London) from SC if S5 is present in SP.
(R1 and R2 can be identical) INSERT INTEGRITY RULE:
EXAMPLE: SP relation has attribute S#, and S relation has primary key We should not insert (S3,P2,200) into SP unless S3 is present in SC.
S#. Then S# in SP is considered a foreign key.
SP is called the "referencing relation". S is called the "referenced Integrity rule 1: Entity integrity
relation". It says that no component of a primary key may be null.
We can draw a "referential diagram"
SP ---S#---> S All entities must be distinguishable. That is, they must have a unique
or simply identification of some kind. Primary keys perform unique identification
SP --------> S function in a relational database. An identifier that was wholly null
would be a contradiction in terms. It would be like there was some
WHY ARE FOREIGN KEYS IMPORTANT? entity that did not have any unique identification. That is, it was not
distinguishable from other entities. If two entities are not
Foreign-to-primary-key matching are the "glue" which holds the distinguishable from each other, then by definition there are not two
database together. entities but only one.
Another way of saying it Integrity rule 2: Referential integrity
Foreign keys provide the "links" between two relations.
A relation's foreign key can refer to the same relation. The referential integrity constraint is specified between two relations
EXAMPLE: EMP ( EMP#, SALARY, MGR_EMP#, ... ) and is used to maintain the consistency among tuples of the two
EMP# is the primary key MGR_EMP# is the foreign key relations.
EMP is a "self-referencing relation". Suppose we wish to ensure that value that appears in one relation for
a given set of attributes also appears for a certain set of attributes in
THE REFERENTIAL INTEGRITY RULE another. This is referential integrity.
The database must not contain any unmatched foreign key values.
The referential integrity constraint states that, a tuple in one relation
REFERENTIAL INTEGRITY RULE that refers to another relation must refer to the existing tuple in that
EXAMPLE: The three 3NF relations are: relation. This means that the referential integrity is a constraint
SP(S#,P#,QTY) specified on more than one relation. This ensures that the consistency
SC(S#,CITY) is maintained across the relations.
CS(CITY,STATUS)
48
Table A
DeptID DeptName DeptManager
F-1001 Financial Nathan
S-2012 Software Martin
H-0001 HR Jason
49
Table B Inconsistent analysis (nonrepeatable reads).
EmpNo DeptID EmpName
Nonrepeatable reads occur when a second transaction
1001 F-1001 Tommy accesses the same row several times and reads different data
1002 S-2012 Will every time. This involves multiple reads of the same row. Every
time, the information is changed by another transaction.
1003 H-0001 Jonathan
Phantom reads.
Concurrence problems (Transaction, locks and deadlocks)
Phantom reads occur when an insert or a delete action is
performed against a row that belongs to a range of rows being
Concurrency is the ability of multiple users to access data at the same read by a transaction. The transaction's first read of the range
time. When the number of simultaneous operations that the database of rows shows a row that no longer exists in the subsequent
engine can support is large, the database concurrency is increased. In read, because of a deletion by a different transaction. Similarly,
Microsoft SQL Server Compact, concurrency control is achieved by as the result of an insert by a different transaction, the
using locks to help protect data. The locks control how multiple users subsequent read of the transaction shows a row that did not
can access and change shared data at the same time without exist in the original read.
conflicting with each other.
Database transaction and the ACID rules
Concurrency Problems
The concept of a database transaction (or atomic transaction) has
evolved in order to enable both a well understood database system
If you do not manage the modification and reading of data by multiple behavior in a faulty environment where crashes can happen any time,
users, concurrency problems can occur. For example, if several users and recovery from a crash to a well understood database state. A
access a database at the same time, their transactions could try to database transaction is a unit of work, typically encapsulating a
perform operations on the same data at the same time. Some of the number of operations over a database (e.g., reading a database object,
concurrency problems that occur while using SQL Server Compact writing, acquiring lock, etc.), an abstraction supported in database and
include the following: also other systems. Each transaction has well defined boundaries in
terms of which program/code executions are included in that
Lost updates.
transaction (determined by the transaction's programmer via special
Lost updates occur when two or more transactions select the transaction commands). Every database transaction obeys the
same row, and then update the row based on the value following rules (by support in the database system; i.e., a database
originally selected. The last update overwrites updates made system is designed to guarantee them for the transactions it runs):
by the other transactions, resulting in lost data.
50
Atomicity - Either the effects of all or none of its operations Why is concurrency control needed?
remain ("all or nothing" semantics) when a transaction is If transactions are executed serially, i.e., sequentially with no overlap
completed (committed or aborted respectively). In other in time, no transaction concurrency exists. However, if concurrent
words, to the outside world a committed transaction appears transactions with interleaving operations are allowed in an
(by its effects on the database) to be indivisible (atomic), and uncontrolled manner, some unexpected, undesirable result may occur,
an aborted transaction does not affect the database at all, as if such as:
never happened.
1. The lost update problem: A second transaction writes a second
Consistency - Every transaction must leave the database in a value of a data-item (datum) on top of a first value written by a
consistent (correct) state, i.e., maintain the predetermined first concurrent transaction, and the first value is lost to other
integrity rules of the database (constraints upon and among transactions running concurrently which need, by their
the database's objects). A transaction must transform a precedence, to read the first value. The transactions that have
database from one consistent state to another consistent state read the wrong value end with incorrect results.
(however, it is the responsibility of the transaction's
programmer to make sure that the transaction itself is correct, 2. The dirty read problem: Transactions read a value written by a
i.e., performs correctly what it intends to perform (from the transaction that has been later aborted. This value disappears
application's point of view) while the predefined integrity rules from the database upon abort, and should not have been read
are enforced by the DBMS). Thus since a database can be by any transaction ("dirty read"). The reading transactions end
normally changed only by transactions, all the database's with incorrect results.
states are consistent.
3. The incorrect summary problem: While one transaction takes a
Isolation - Transactions cannot interfere with each other (as an summary over the values of all the instances of a repeated
end result of their executions). Moreover, usually (depending data-item, a second transaction updates some instances of that
on concurrency control method) the effects of an incomplete data-item. The resulting summary does not reflect a correct
transaction are not even visible to another transaction. result for any (usually needed for correctness) precedence
Providing isolation is the main goal of concurrency control. order between the two transactions (if one is executed before
the other), but rather some random result, depending on the
Durability - Effects of successful (committed) transactions must timing of the updates, and whether certain update results have
persist through crashes (typically by recording the transaction's been included in the summary or not.
effects and its commit event in a non-volatile memory).
Most high-performance transactional systems need to run
The concept of atomic transaction has been extended during the years transactions concurrently to meet their performance requirements.
to what has become Business transactions which actually implement Thus, without concurrency control such systems can neither provide
types of Workflow and are not atomic. However also such enhanced correct results nor maintain their databases consistent.
transactions typically utilize atomic transactions as components.
51
Concurrency control mechanisms mechanisms (with blocking) are prone to deadlocks which are resolved
by an intentional abort of a stalled transaction (which releases the
The main categories of concurrency control mechanisms are: other transactions in that deadlock), and its immediate restart and re-
execution. The likelihood of a deadlock is typically low.
Optimistic - Delay the checking of whether a transaction meets
the isolation and other integrity rules (e.g., serializability and Blocking, deadlocks, and aborts all result in performance reduction,
recoverability) until its end, without blocking any of its (read, and hence the trade-offs between the categories.
write) operations ("...and be optimistic about the rules being Method
met..."), and then abort a transaction to prevent the violation, Many methods for concurrency control exist. Most of them can be
if the desired rules are to be violated upon its commit. An implemented within either main category above. The major methods,
aborted transaction is immediately restarted and re-executed, which have each many variants, and in some cases may overlap or be
which incurs an obvious overhead (versus executing it to the combined, are:
end only once). If not too many transactions are aborted, then
being optimistic is usually a good strategy. 1. Locking (e.g., Two-phase locking - 2PL) - Controlling access to
data by locks assigned to the data. Access of a transaction to a
Pessimistic - Block an operation of a transaction, if it may cause data item (database object) locked by another transaction may
violation of the rules, until the possibility of violation be blocked (depending on lock type and access operation type)
disappears. Blocking operations is typically involved with until lock release.
performance reduction.
2. Serialization graph checking (also called Serializability, or
Semi-optimistic - Block operations in some situations, if they Conflict, or Precedence graph checking) - Checking for cycles in
may cause violation of some rules, and do not block in other the schedule's graph and breaking them by aborts.
situations while delaying rules checking (if needed) to
transaction's end, as done with optimistic. 3. Timestamp ordering (TO) - Assigning timestamps to
transactions, and controlling or checking access to data by
Different categories provide different performance, i.e., different timestamp order.
average transaction completion rates (throughput), depending on
transaction types mix, computing level of parallelism, and other 4. Commitment ordering (or Commit ordering; CO) - Controlling
factors. If selection and knowledge about trade-offs are available, then or checking transactions' chronological order of commit events
category and method should be chosen to provide the highest to be compatible with their respective precedence order.
performance.
Other major concurrency control types that are utilized in conjunction
The mutual blocking between two transactions (where each one with the methods above include:
blocks the other) or more results in a deadlock, where the transactions
involved are stalled and cannot reach completion. Most non-optimistic
52
Multiversion concurrency control (MVCC) - Increasing achieved with as good performance as possible. In addition,
concurrency and performance by generating a new version of a increasingly a need exists to operate effectively while transactions are
database object each time the object is written, and allowing distributed over processes, computers, and computer networks. Other
transactions' read operations of several last relevant versions subjects that may affect concurrency control are recovery and
(of each object) depending on scheduling method. replication.
Major goals of concurrency control mechanisms[edit] Almost all implemented concurrency control mechanisms achieve
Concurrency control mechanisms firstly need to operate correctly, i.e., serializability by providing Conflict serializablity, a broad special case of
to maintain each transaction's integrity rules (as related to serializability (i.e., it covers, enables most serializable schedules, and
concurrency; application-specific integrity rule are out of the scope does not impose significant additional delay-causing constraints)
here) while transactions are running concurrently, and thus the which can be implemented efficiently.
integrity of the entire transactional system. Correctness needs to be
53
Recoverability in such distributed environments is common, e.g., in computer
Comment: While in the general area of systems the term clusters and multi-core processors. However the local techniques have
"recoverability" may refer to the ability of a system to recover from their limitations and use multi-processes (or threads) supported by
failure or from an incorrect/forbidden state, within concurrency multi-processors (or multi-cores) to scale. This often turns transactions
control of database systems this term has received a specific meaning. into distributed ones, if they themselves need to span multi-processes.
In these cases most local concurrency control techniques do not scale
Concurrency control typically also ensures the Recoverability property well.
of schedules for maintaining correctness in cases of aborted
transactions (which can always happen for many reasons). Database security techniques
Recoverability (from abort) means that no committed transaction in a Authorization mechanism
schedule has read data written by an aborted transaction. Such data Access matrix and delegation hierarchy
disappear from the database (upon the abort) and are parts of an Views
incorrect database state. Reading such data violates the consistency Audit trails
rule of ACID. Unlike Serializability, Recoverability cannot be Encryption
compromised, relaxed at any case, since any relaxation results in quick
database integrity violation upon aborts. The major methods listed
above provide serializability mechanisms. None of them in its general Secrecy, Integrity and Availability
form automatically provides recoverability, and special considerations
The objective of data security can be divided into three separate,
and mechanism enhancements are needed to support recoverability.
but interrelated, areas as follows.
A commonly utilized special case of recoverability is Strictness, which
� Secrecy is concerned with improper disclosure of information.
allows efficient database recovery from failure (but excludes optimistic
The terms con~-dentiality or non-disclosure are synonyms for
implementations; e.g., Strict CO (SCO) cannot have an optimistic
secrecy.
implementation, but has semi-optimistic ones).
� Integrity is concerned with improper modi~cation of
Comment: Note that the Recoverability property is needed even if no information or processes.
database failure occurs and no database recovery from failure is
needed. It is rather needed to correctly automatically handle � Availability is concerned with improper denial of access to
transaction aborts, which may be unrelated to database failure and information. The term denial of service is also used as a
recovery from it. synonym for availability.
These three objectives arise in practically every information system.
Distribution[edit]
For example, in a payroll system secrecy is concerned with
With the fast technological development of computing the difference preventing an employee from ~nding out the boss's salary; integrity
between local and distributed computing over low latency networks or is concerned with preventing an employee from changing his or her
buses is blurring. Thus the quite effective utilization of local techniques salary; and availability is concerned with ensuring that the
54
paychecks are printed on time as required by law. Similarly, in a security controls and enable all of the features before allowing
military command and control system secrecy is concerned with anyone access to the database.
preventing the enemy from determining the target coordinates of a
missile; integrity is concerned with preventing the enemy from Check the Patch Level: Check the patch level configuration in
altering the target coordinates; and availability is concerned with the database to determine if there are any vulnerabilities in the
ensuring that the missile does get launched when the order is given. default settings. Also, perform a full assessment of the
database to fix any existing vulnerabilities in the system before
placing any data into the database.
Any system will have these three requirements co-existing to
some degree. There are of course di~erences regarding the relative Exclude Copying of the Database: Although you may have one
importance of these objectives in a given system. The commercial chief IT administrator that is the primary gatekeeper to the
and military sectors both have similar needs for high integrity database, there is no control over the data once the database
systems. The secrecy and availability requirements of the military has been copied. For this reason you should disallow database
are often more stringent than in typical commercial applications. copying because it represents an internal threat to database
security.
These three objectives also di~er with respect to our understanding Restrict Access: Restrict access to the database by specifically
of the objectives themselves and of the technology to achieve designating who is allowed administrator privileges. For a small
them. I t is easiest to understand the objective of secrecy. Integrity business it is a good idea to delegate this responsibility to one
is a less tangible objective on which experts in the ~eld have diverse IT administrator and then place certain restrictions on other
opinions. Availability is technically the least understood aspect. I n users. In addition to restricting access, make sure the backups
terms of technology, the dominance of the commercial sector in the are stored in an encrypted format and restrict access to XML
marketplace has led vendors to emphasize mechanisms for integrity files. The files in XML format are files from a discontinued
rather than for military-like secrecy needs. The availability objective database.
is so poorly understood that no product today even tries to address
it directly. Availability is discussed only in passing in this chapter. Existing Databases: There are database discovery tools which
identify existing databases that contain confidential
information. The tools also monitor existing databases to
ensure the information is stored in encrypted format. In
addition to the new database, make sure you monitor all of the
Tips on Database Security existing databases to ensure that information is encrypted,
Enable Security Controls: Unlike older databases, the newer there are no vulnerabilities, and that there are no duplicates.
databases require passwords to gain full access to the stored
data. Often when the databases are shipped, none of the Shared Data: Sharing data becomes a concern when
security features are enabled. Make sure you check the businesses have to train new employees and developers have
55
to test new database applications. In this instance, the IT When you are designing your backup and restore plan, you should
administrator can perform what is called subsetting which consider your disaster recovery planning with regard to your particular
provides a separate type of restricted access with fake environmental and business needs. For example, suppose a fire occurs
information substituted for the sensitive information. and wipes out your 24-hour data center. Are you certain you can
Subsetting a database basically allows developers and new recover? How long will it take you to recover and have your system
employees to use the database for testing and training without available? How much data loss can your users tolerate?
exposing confidential or sensitive information.
Ideally, your disaster recovery plan states how long recovery will take
Keep in mind that securing a database also requires a change in and the final database state the users can expect. For example, you
thinking on the part of database administrator as well as the workers might determine that after the acquisition of specified hardware,
who have access privileges or restrictions to the database. A change in recovery will be completed in 48 hours, and data will be guaranteed
attitude ensures that everyone is on the same page with what is only until the end of the previous week.
expected when it comes to keeping data secure.
A disaster recovery plan can be structured in many different ways and
can contain many types of information. Disaster recovery plan types
include the following:
9.0 DISASTER RECOVERY TECHNIQUES
Disaster recovery policy A plan to acquire hardware.
A communication plan.
When you are administrating a SQL Server database, preparing to A list of people to be contacted if a disaster occurs.
recovery from potential disasters is important. A well-designed and Instructions for contacting the people involved in the response
tested backup and restore plan for your SQL Server backups is to the disaster.
necessary for recovering your databases after a disaster. For more Information about who owns the administration of the plan.
information, see Introduction to Backup and Restore Strategies in SQL A checklist of required tasks for each recovery scenario. To
Server. In addition, to make sure that all your systems and data can be help you review how disaster recovery progressed, initial each
quickly restored to regular operation if a natural disaster occurs, you task as it is completed, and indicate the time when it finished
must create a disaster recovery plan. When you create this plan, on the checklist.
consider scenarios for different types of disasters that might affect Formulating a detailed recovery plan is the main aim of the
your shop. These include natural disasters, such as a fire, and technical entire IT disaster recovery planning project. It is in these plans
disasters, such as a two-disk failure in a RAID-5 array. When you create that you will set out the detailed steps needed to recover your
a disaster recovery plan, identify and prepare for all the steps that are IT systems to a state in which they can support the business
required to respond to each type of disaster. Testing the recovery after a disaster.
steps for each scenario is necessary. We recommend that you verify But before you can generate that detailed recovery plan, you’ll
your disaster recovery plan through the simulation of a natural need to perform a risk assessment (RA) and/or business impact
disaster. analysis (BIA) to identify the IT services that support the
56
organisation’s critical business activities. Then, you’ll need to security, staff access procedures, ID badges and the location of
establish recovery time objectives (RTOs) and recovery point the alternate space relative to the primary site.
objectives (RPOs).
Once this work is out of the way, you’re ready to move on to Technology. You’ll need to consider access to equipment space
developing disaster recovery strategies, followed by the actual that is properly configured for IT systems, with raised floors,
plans. Here we’ll explain how to write a disaster recovery plan for example; suitable heating, ventilation and air conditioning
as well as how to develop disaster recovery strategies. (HVAC) for IT systems; sufficient primary electrical power;
Developing DR strategies suitable voice and data infrastructure; the distance of the
Translating strategies into plans alternate technology area from the primary site; provision for
Incident response staffing at an alternate technology site; availability of failover
DR plan structure (to a backup system) and failback (return to normal operations)
Developing DR strategies technologies to facilitate recovery; support for legacy systems;
Regarding disaster recovery strategies, ISO/IEC 27031, the and physical and information security capabilities at the
global standard for IT disaster recovery, states, “Strategies alternate site.
should define the approaches to implement the required Data. Areas to look at include timely backup of critical data to a
resilience so that the principles of incident prevention, secure storage area in accordance with RTO/RPO
detection, response, recovery and restoration are put in place.” requirements, method(s) of data storage (disk, tape, optical,
Strategies define what you plan to do when responding to an etc), connectivity and bandwidth requirements to ensure all
incident, while plans describe how you will do it. critical data can be backed up in accordance with RTO/RPO
Once you have identified your critical systems, RTOs, RPOs, etc, time scales, data protection capabilities at the alternate
create a table, as shown below, to help you formulate the storage site, and availability of technical support from qualified
disaster recovery strategies you will use to protect them. third-party service providers.
People. This involves availability of staff/contractors, training Suppliers. You’ll need to identify and contract with primary and
needs of staff/contractors, duplication of critical skills so there alternate suppliers for all critical systems and processes, and
can be a primary and at least one backup person, available even the sourcing of people. Key areas where alternate
documentation to be used by staff, and follow-up to ensure suppliers will be important include hardware (such as servers,
staff and contractor retention of knowledge. racks, etc), power (such as batteries, universal power supplies,
power protection, etc), networks (voice and data network
Physical facilities. Areas to look at are availability of alternate services), repair and replacement of components, and multiple
work areas within the same site, at a different company delivery firms (FedEx, UPS, etc).
location, at a third-party-provided location, at employees’
homes or at a transportable work facility. Then consider site Policies and procedures. Define policies for IT disaster recovery
and have them approved by senior management. Then define
57
step-by-step procedures to, for example, initiate data backup Incident response
to secure alternate locations, relocate operations to an
alternate space, recover systems and data at the alternate In addition to using the strategies previously developed, IT
sites, and resume operations at either the original site or at a disaster recovery plans should form part of an incident
new location. response process that addresses the initial stages of the
incident and the steps to be taken. This process can be seen as
Finally, be sure to obtain management sign-off for your a timeline, such as in Figure 2, in which incident response
strategies. Be prepared to demonstrate that your strategies actions precede disaster recovery actions.
align with the organisation’s business goals and business
continuity strategies. The DR plan structure
Translating disaster recovery strategies into DR plans The following section details the elements in a DR plan in the
sequence defined by ISO 27031 and ISO 24762.
Once your disaster recovery strategies have been developed,
you’re ready to translate them into disaster recovery plans. Important: Best-in-class DR plans should begin with a few
Let’s take Table 1 and recast it into Table 2, seen below. Here pages that summarise key action steps (such as where to
we can see the critical system and associated threat, the assemble employees if forced to evacuate the building) and lists
response strategy and (new) response action steps, as well as of key contacts and their contact information for ease of
the recovery strategy and (new) recovery action steps. This authorising and launching the plan.
approach can help you quickly drill down and define high-level 1. Introduction. Following the initial emergency pages, DR
action steps. plans have an introduction that includes the purpose and scope
Developing DR plans of the plan. This section should specify who has approved the
plan, who is authorised to activate it and a list of linkages to
DR plans provide a step-by-step process for responding to a other relevant plans and documents.
disruptive event. Procedures should ensure an easy-to-use and
repeatable process for recovering damaged IT assets and 2. Roles and responsibilities. The next section should define
returning them to normal operation as quickly as possible. If roles and responsibilities of DR recovery team members, their
staff relocation to a third-party hot site or other alternate contact details, spending limits (for example, if equipment has
space is necessary, procedures must be developed for those to be purchased) and the limits of their authority in a disaster
activities. situation.
When developing your IT DR plans, be sure to review the global 3. Incident response. During the incident response process,
standards ISO/IEC 24762 for disaster recovery and ISO/IEC we typically become aware of an out-of-normal situation (such
27035 (formerly ISO 18044) for incident response activities. as being alerted by various system-level alarms), quickly assess
the situation (and any damage) to make an early determination
58
of its severity, attempt to contain the incident and bring it 7. Appendixes. Located at the end of the plan, these can
under control, and notify management and other key include systems inventories, application inventories, network
stakeholders. asset inventories, contracts and service-level agreements,
supplier contact data, and any additional documentation that
4. Plan activation. Based on the findings from incident will facilitate recovery.
response activities, the next step is to determine if disaster
recovery plans should be launched, and which ones in Further activities
particular should be invoked. If DR plans are to be invoked,
incident response activities can be scaled back or terminated, Once your DR plans have been completed, they are ready to be
depending on the incident, allowing for launch of the DR plans. exercised. This process will determine whether they will
This section defines the criteria for launching the plan, what recover and restore IT assets as planned.
data is needed and who makes the determination. Included In parallel to these activities are three additional ones: creating
within this part of the plan should be assembly areas for staff employee awareness, training and records management. These
(primary and alternates), procedures for notifying and are essential in that they ensure employees are fully aware of
activating DR team members, and procedures for standing DR plans and their responsibilities in a disaster, and DR team
down the plan if management determines the DR plan members have been trained in their roles and responsibilities
response is not needed. as defined in the plans. And since DR planning generates a
5. Document history. A section on plan document dates and significant amount of documentation, records management
revisions is essential, and should include dates of revisions, (and change management) activities should also be initiated. If
what was revised and who approved the revisions. This can be your organisation already has records management and change
located at the front of the plan document. management programmes, use them in your DR planning.
Consistent State:In a valid state, with the information contained Backup/current version: Present files form the current version of the
satisfying user consistency constraints. Varies depending on the database. Files containing previous values form a consistent backup
database and users. version. (2,3)
Crash:A failure of a system that is covered by a recovery technique. Multiple copies: Multiple active copies of each file are maintained
during normal operation of the database. In cases of failure,
Catastrophe:A failure of a system not covered by a recovery comparison between the versions can be used to find a consistent
technique. version. (6)
Possible Levels of Recovery: Careful replacement: Nothing is updated in place, with the original
Recovery to the correct state. only being deleted after operation is complete. (2,6)
Recovery to a checkpointed (past) correct state. (Parens and numbers are used to indicate which levels from above are
Recovery to a possible previous state. supported by each technique).
60
Combinations of two techniques can be used to offer similar protection, most organizations only use them on a periodic basis
protection against different kinds of failures. The techniques above, because they are time consuming, and often require a large number of
when implemented, force changes to: tapes or disk.
o The way data is structured (4,5,6). Incremental backup
o The way data is updated and manipulated (7).
o nothing (available as utilities) (1,2,3). Because full backups are so time consuming, incremental backups
were introduced as a way of decreasing the amount of time that it
takes to do a backup. Incremental backups only backup the data that
Examples and bits of wisdom: has changed since the previous backup. For example, suppose that you
o Original Multics system : all disk files updated or created a full backup on Monday, and used incremental backups for
created by the user are copied when the user signs off. the rest of the week. Tuesday's backup would only contain the data
All newly created of modified files not previously that has changed since Monday. Wednesday's backup would only
dumped are copied to tapes once per hour. High contain the data that has changed since Tuesday.
reliability, but very high overhead. Changed to a system
using a mix of incremental dumping, full checkpointing, The primary disadvantage to incremental backups is that they can be
and salvage programs. time-consuming to restore. Going back to my previous example,
o Several other systems maintain backup copies of data suppose that you wanted to restore the backup from Wednesday. To
through the paging system (keep backups in the swap do so, you would have to first restore Monday's full backup. After that,
space). you would have to restore Tuesday's tape, followed by Wednesday's.
o Use of buffers is dangerous for consistency. If any of the tapes happen to be missing or damaged, then you will not
o Intention lists: specify audit trail before it actually be able to perform the full restoration.
occurs. Differential backups
o Recovery among interacting processes is hard. You can
either prevent the interaction or synchronize with A differential backup is similar to an incremental backup in that it
respect to recovery. starts with a full backup, and subsequent backups only contain data
o Error detection is difficult, and can be costly. that has changed. The difference is that while an incremental backup
only includes the data that has changed since the previous backup, a
Backup types differential backup contains all of the data that has changed since the
last full backup.
Full backups
Suppose for example that you wanted to create a full backup on
A full backup is exactly what the name implies. It is a full copy of your Monday and differential backups for the rest of the week. Tuesday's
entire data set. Although full backups arguably provide the best backup would contain all of the data that has changed since Monday.
61
It would therefore be identical to an incremental backup at this point. incremental-forever backup begins by taking a full backup of the data
On Wednesday, however, the differential backup would backup any set. After that point, only incremental backups are taken.
data that had changed since Monday.
What makes an incremental-forever backup different from a normal
The advantage that differential backups have over incremental is incremental backup is the availability of data. As you will recall,
shorter restore times. Restoring a differential backup never requires restoring an incremental backup requires the tape containing the full
more than two tape sets. Incremental backups on the other hand, may backup, and every subsequent backup up to the backup that you want
require a great number of tape sets. Of course the tradeoff is that as to restore. While this is also true for an incremental-forever backup,
time progresses, a differential backup tape can grow to contain much the backup server typically stores all of the backup sets on either a
more data than an incremental backup tape. large disk array or in a tape library. It automates the restoration
process so that you don't have to figure out which tape sets need to
Synthetic full backup be restored. In essence, the process of restoring the incremental data
A synthetic full backup is a variation of an incremental backup. Like becomes completely transparent and mimics the process of restoring
any other incremental backup, the actual backup process involves a full backup.
taking a full backup, followed by a series of incremental backups. But Full Backup
synthetic backups take things one step further. Full backup is a method of backup where all the files and folders
What makes a synthetic backup different from an incremental backup selected for the backup will be backed up. When subsequent backups
is that the backup server actually produces full backups. It does this by are run, the entire list of files and will be backed up again. The
combining the existing full backup with the data from the incremental advantage of this backup is restores are fast and easy as the complete
backups. The end result is a full backup that is indistinguishable from a list of files are stored each time. The disadvantage is that each backup
full backup that has been created in the traditional way. run is time consuming as the entire list of files is copied again. Also,
full backups take up a lot more storage space when compared to
As you can imagine, the primary advantage to synthetic full backups is incremental or differential backups.
greatly reduced restore times. Restoring a synthetic full backup Incremental backup
doesn't require the backup operator to restore multiple tape sets as
an incremental backup does. Synthetic full backups provide all of the Incremental backup is a backup of all changes made since the last
advantages of a true full backup, but offer the decreased backup times backup. With incremental backups, one full backup is done first and
and decrease bandwidth usage of an incremental backup. subsequent backup runs are just the changes made since the last
backup. The result is a much faster backup then a full backup for each
Incremental-forever backup backup run. Storage space used is much less than a full backup and
less then with differential backups. Restores are slower than with a full
Incremental-forever backups are often used by disk-to-disk-to-tape backup and a differential backup.
backup systems. The basic idea is that like an incremental backup, and
62
Differential backup deletes. Since the backups are always close at hand they are fast and
Differential backup is a backup of all changes made since the last full convenient to restore.
backup. With differential backups, one full backup is done first and Offsite Backup
subsequent backup runs are the changes made since the last full When the backup storage media is kept at a different geographic
backup. The result is a much faster backup then a full backup for each location from the source, this is known as an offsite backup. The
backup run. Storage space used is much less than a full backup but backup may be done locally at first but once the storage medium is
more then with Incremental backups. Restores are slower than with a brought to another location, it becomes an offsite backup. Examples
full backup but usually faster then with Incremental backups. of offsite backup include taking the backup media or hard drive home,
Mirror Backup to another office building or to a bank safe deposit box.
Mirror backups are as the name suggests a mirror of the source being Beside the same protection offered by local backups, offsite backups
backed up. With mirror backups, when a file in the source is deleted, provide additional protection from theft, fire, floods and other natural
that file is eventually also deleted in the mirror backup. Because of disasters. Putting the backup media in the next room as the source
this, mirror backups should be used with caution as a file that is would not be considered an offsite backup as the backup does not
deleted by accident or through a virus may also cause the mirror offer protection from theft, fire, floods and other natural disasters.
backups to be deleted as well.
Online Backup
Full PC Backup or Full Computer Backup
These are backups that are ongoing or done continuously or
In this backup, it is not the individual files that are backed up but frequently to a storage medium that is always connected to the source
entire images of the hard drives of the computer that is backed up. being backed up. Typically the storage medium is located offsite and
With the full PC backup, you can restore the computer hard drives to connected to the backup source by a network or Internet connection.
its exact state when the backup was done. With the Full PC backup, It does not involve human intervention to plug in drives and storage
not only can the work documents, picture, videos and audio files be media for backups to run. Many commercial data centres now offer
restored but the operating system, hard ware drivers, system files, this as a subscription service to consumers. The storage data centres
registry, programs, emails etc can also be restored. are located away from the source being backed up and the data is sent
Local Backup from the source to the storage data centre securely over the Internet.
Local backups are any kind of backup where the storage medium is Remote Backup
kept close at hand or in the same building as the source. It could be a Remote backups are a form of offsite backup with a difference being
backup done on a second internal hard drive, an attached external that you can access, restore or administer the backups while located
hard drive, CD/ DVD –ROM or Network Attached Storage (NAS). Local at your source location or other location. You do not need to be
backups protect digital content from hard drive failures and virus physically present at the backup storage facility to access the backups.
attacks. They also provide protection from accidental mistakes or For example, putting your backup hard drive at your bank safe deposit
box would not be considered a remote backup. You cannot administer
63
it without making a trip to the bank. Online backups are usually a transaction log (also transaction journal, database log, binary log or
considered remote backups as well. audit trail) is a history of actions executed by a database management
Cloud Backup system to guarantee ACID properties over crashes or hardware
failures. Physically, a log is a file listing changes to the database, stored
This term is often used interchangeably with Online Backup and in a stable storage format.
Remote Backup. It is where data is backed up to a service or storage
facility connected over the Internet. With the proper login credentials, If, after a start, the database is found in an inconsistent state or not
that backup can then be accessed or restored from any other been shut down properly, the database management system reviews
computer with Internet Access. the database logs for uncommitted transactions and rolls back the
FTP Backup changes made by these transactions. Additionally, all transactions that
are already committed but whose changes were not yet materialized
This is a kind of backup where the backup is done via FTP (File Transfer in the database are re-applied. Both are done to ensure atomicity and
Protocol) over the Internet to an FTP Server. Typically the FTP Server is durability of transactions.
located in a commercial data centre away from the source data being
backed up. When the FTP server is located at a different location, this
is another form of offsite backup. Check point
A checkpoint writes the current in-memory modified pages (known as
What backup type is best for you? dirty pages) and transaction log information from memory to disk and,
As with any backup, it is important to consider which backup type is also, records information about the transaction log. The Database
best suited to your own organization's needs. Ask yourself the Engine supports several types of checkpoints: automatic, indirect,
following questions: manual, and internal.
1. What does your service-level agreement dictate in regard to The Database Engine supports several types of checkpoints:
recovery time? automatic, indirect, manual, and internal.
2. What are the policies regarding storing backup tapes offsite? If
backups are shipped offsite, incremental backups are a bad Recovery strategies (Backward and Forward)
idea because you have to get all the tapes back before you can Recovery is needed when a database instance that has failed is
begin a restoration. restarted or a surviving database instance takes over a failed one. In
3. What types of backups does your backup application support? roll-backward recovery, the active transactions at the time of failure
As you can see, synthetic full backups and incremental-forever are aborted and the resourced allocated for those transactions are
backups go a long way toward modernizing the backup process, but released. In roll-forward recovery, the updates recorded in the redo
it's important to make sure you choose the best backup type for your log are transferred to the database so that they are not lost.
organization's data. Mirroring
Log files
64
Database mirroring is the creation and maintenance of redundant record. The system of record is the location where the official version
copies of a database. The purpose is to ensure continuous data of an entity is located. This is often data stored within a relational
availability and minimize or avoid downtime that might otherwise database although other representations, such as an XML structure or
result from data corruption or loss, or from a situation when the an object, are also viable.
operation of a network is partially compromised. Redundancy also A collision is said to occur when two activities, which may or may not
ensures that at least one viable copy of a database will always remain be full-fledged transactions, attempt to change entities within a
accessible during system upgrades. system of record. There are three fundamental ways (Celko 1999) that
two activities can interfere with one another:
Database mirroring is used by Microsoft SQL Server, a relational
database management system (RDBMS) designed for the enterprise 1. Dirty read. Activity 1 (A1) reads an entity from the system of
environment. Two copies of a single database reside on different record and then updates the system of record but does not
computers called server instances, usually in physical locations commit the change (for example, the change hasn’t been
separated by some distance. The principal (or primary) server instance finalized). Activity 2 (A2) reads the entity, unknowingly making
provides the database to clients. The mirror (or secondary) server a copy of the uncommitted version. A1 rolls back (aborts) the
instance acts as a standby that can take over in case of a problem with changes, restoring the entity to the original state that A1 found
the principal server instance. it in. A2 now has a version of the entity that was never
committed and therefore is not considered to have actually
If 100-percent accuracy is required, database mirroring requires that existed.
the mirror server instance always stay current; in other words, the
system must immediately copy every change in the principal's content 2. Non-repeatable read. A1 reads an entity from the system of
to the mirror and vice-versa. In this mode, known as synchronous record, making a copy of it. A2 deletes the entity from the
operation, the mirror is called a hot standby. While database mirroring system of record. A1 now has a copy of an entity that does not
can also work when the content is not fully synchronized, some data officially exist.
loss may occur if one of the server instances fails or becomes 3. Phantom read. A1 retrieves a collection of entities from the
inaccessible. In this mode, called asynchronous operation, the mirror is system of record, making copies of them, based on some sort
called a warm standby. of search criteria such as “all customers with first name Bill.”A2
then creates new entities, which would have met the search
1. Introduction to Collisions criteria (for example, inserts “Bill Klassen” into the database),
In Implementing Referential Integrity and Shared Business Logic I saving them to the system of record. If A1 reapplies the search
discuss the referential integrity challenges that result from there being criteria it gets a different result set.
an object schema that is mapped to a data schema, something that I
like to call cross-schema referential integrity problems. With respect to Collisions will occur the more that data is allowed to go stale in a
collisions things are a little simpler, we only need to worry about the cache and the more concurrent users/threads you have.
issues with ensuring the consistency of entities within the system of
65
Customer objects, you’re working with the Wayne Miller object while I
2. Locking Strategies work with the John Berg object and therefore we won’t collide. When
So what can you do? First, you can take a pessimistic locking approach this is the case optimistic locking becomes a viable concurrency
that avoids collisions but reduces system performance. Second, you control strategy. The idea is that you accept the fact that collisions
can use an optimistic locking strategy that enables you to detect occur infrequently, and instead of trying to prevent them you simply
collisions so you can resolve them. Third, you can take an overly choose to detect them and then resolve the collision when it does
optimistic locking strategy that ignores the issue completely. occur.
Figure 1 depicts the logic for updating an object when optimistic
locking is used. The application reads the object into memory. To do
2.1 Pessimistic Locking
this a read lock is obtained on the data, the data is read into memory,
Pessimistic locking is an approach where an entity is locked in the
and the lock is released. At this point in time the row(s) may be
database for the entire time that it is in application memory (often in
marked to facilitate detection of a collision (more on this later). The
the form of an object). A lock either limits or prevents other users
application then manipulates the object until the point that it needs to
from working with the entity in the database. A write lock indicates
be updated. The application then obtains a write lock on the data and
that the holder of the lock intends to update the entity and disallows
reads the original source back so as to determine if there’s been a
anyone from reading, updating, or deleting the entity. A read lock
collision. The application determines that there has not been a
indicates that the holder of the lock does not want the entity to
collision so it updates the data and unlocks it. Had a collision been
change while the hold the lock, allowing others to read the entity but
detected, e.g. the data had been updated by another process after it
not update or delete it. The scope of a lock might be the entire
had originally been read into memory, then the collision would need
database, a table, a collection of rows, or a single row. These types of
to be resolved.
locks are called database locks, table locks, page locks, and row locks
respectively.
The advantages of pessimistic locking are that it is easy to implement Figure 1. Updating an object following an optimistic locking
and guarantees that your changes to the database are made approach.
consistently and safely. The primary disadvantage is that this
approach isn’t scalable. When a system has many users, or when the
transactions involve a greater number of entities, or when
transactions are long lived, then the chance of having to wait for a
lock to be released increases. Therefore this limits the practical
number of simultaneous users that your system can support.
69
prevent the collisions from occurring. Instead, it aims to detect these stuck in an endless cycle, and since both actions cannot be satisfied,
collisions and resolve them on the chance occasions when they occur. deadlock occurs.
[2]
Livelock:
Pessimistic locking provides a guarantee that database changes are
made safely. However, it becomes less viable as the number of Livelock is a special case of resource starvation. A livelock is similar to
simultaneous users or the number of entities involved in a transaction a deadlock, except that the states of the processes involved constantly
increase because the potential for having to wait for a lock to release change with regard to one another wile never progressing. The general
will increase. [2] definition only states that a specific process is not progressing. For
example, the system keeps selecting the same transaction for rollback
Optimistic locking can alleviate the problem of waiting for locks to causing the transaction to never finish executing. Another livelock
release, but then users have the potential to experience collisions situation can come about when the system is deciding which
when attempting to update the database. transaction gets a lock and which waits in a conflict situation.
70
and acquire a new one. Another problem occurs when a transaction
tries to write a data item which has been read by a younger Example Transaction
transaction. This is called a late write. This means that the data item Funds transfer :
has been read by another transaction since the start time of the begin transaction T1
transaction that is altering it. The solution for this problem is the same read balance1
as for the late read problem. The timestamp must be rolled back and a balance1 = balance1 - 100
new one acquired.[3] if balance1 < 0
o then print “insufficient funds”
Adhering to the rules of the basic timestamping process allows the o abort T1
transactions to be serialized and a chronological schedule of end
transactions can then be created. Timestamping may not be practical write balance1
in the case of larger databases with high levels of transactions. A large read balance2
amount of storage space would have to be dedicated to storing the balance2 = balance2 + 100
timestamps in these cases.[4] write balance2
Transaction commit T1
The basic unit of work in a DBMS Discussing the Example
Properties of a transaction Effect of the abort is to rollback the transaction and undo
o ATOMICITY changes it has made on the DB
o CONSISTENCY
o INDEPENDENCE in this example, transaction was not written to the DB prior to
o DURABILITY abort and so no undo is necessary
Atomicity the is the “all or nothing” property ; a transaction is an Problems with Concurrency
indivisible unit of work
Consistency transactions transform the DB from one consistent state Concurrent transaction can cause three kinds of database problems
to another consistence state o Lost Update
Independence transactions execute independently of one another i.e. o Violation of Integrity Constraints
the partial effect of one transaction is not visible to other transactions. o Inconsistent Retrieval
Lost Update Apparently successful updates can be overwritten be
Durability (aka Persistence) the effect of a successfully completed (i.e. other transactions
committed) transaction are permanently recorded in the DB and
cannot be undone. Begin transaction T1 read balance [ 100 ] balance = balance - 100 if
balance < 0 print “insufficient funds” abort T1 end write balance [ 0 ]
71
Initial Balance = 100 Begin transaction T2 read balance [ 100 ] balance end
= balance + 100 write balance [ 200 ] commit T2 write BalanceX
read BalanceY
Violation of Integrity Constraints BalanceY = BalanceY + 100
Begin transaction T3 write BalanceY
read schedule where date = 4/4/01 commit T1
read surgeon where surgeon.name = schedule.surgeon and T2
surgeon.operation = “Appendectomy” begin transaction T2
if not found then abort T3 read BalanceX
schedule.operation = “Appendectomy” read BalanceY
commit T3 commit T2
Begin transaction T4
read schedule where date = 4/4/01 Concurrency Control
read surgeon where surgeon.name = ‘Tom’ Schedules and Serialisation
if not found then abort T4 Order in a schedule is VERY important
surgeon.surgeon = ‘Tom’ S = [R1(x), R2(x), W1(x), W2(x)]
commit T4 so, is it O.K. to do reads before or after writes, e.g. Lost Update
Inconsistent Retrieval (Dirty Reads) Problem
Most concurrency control systems focus on the transaction Conflicting Operations
which update the DB since they are the only ones which can If two transactions only read a data item, they do not conflict
corrupt the DB and order is not important.
If transaction are allowed to read the partial results of If two transactions either read or write completely separate
incomplete transactions, they can obtain an inconsistent view data items, they do not conflict and order is not important.
of the DB ( dirty or unrepeatable reads ) If one transactions writes a data item and another transaction
Inconsistent Retrieval ( Dirty Reads ) reads or writes the same data item, the order of execution is
T1 important.
begin transaction T1 Serial Schedule
read BalanceX What is a serial schedule?
BalanceX = BalanceX - 100 S = [R1(X), W1(X), R2(X), W2(X), R3(X)]
if BalanceX < 0 then Serialisable Schedule
begin What is a serialisable schedule ?
print ‘insufficient funds’ S = [R1(x), R2(x),W1(x), R3(x), W2(x)]
abort T1 Is this serialisable ?
72
General Solution Timestamp Methods
Constrained Write Rule Optimistic Methods
Transaction should always Read before they Write
Rules for Equivalence of Schedules 1. Network Connections: Databases can support multiple
Each read operation must read the same values in both network connections over multiple ports. I have two
schedules; this effectively means that those values must have recommendations here. First, to reduce complexity and avoid
been produced by the same write operation the both possible inconsistency with network connection settings, I
schedules advise keeping the number of listeners to a minimum: one or
The final state of the database must be the same in both two. Second, as many automated database attacks go directly
schedules; thus the final write operation on each data item is default network ports directly, I recommend moving listeners
the same in both schedules to a non-standard port numbers. This will annoy application
Try these... developers and complicate their setup somewhat, but more
[R1(x), W1(x), R2(x), W2(x)] importantly it will both help stop automated attacks and
[R1(x), R2(x), W1(x), W2(x)] highlight connection attempts to the default ports, which then
[R1(x), R3(y), R2(x), W2(z), W2(y), W1(x), R2(y), W1(z)] indicate either misconfiguration or hardwired attacks.
Conflicting Operations 2. Network Facilities: Some databases use add-on modules to
Read operations cannot conflict with one and other, thus the support network connections, and like the database itself are
order of read operations does not matter. not secure out of the box. Worse, many vulnerability
i.e. [R1(x), R2(x)] = [R2(x), R1(x)] assessment tools omit the network from the scan. Verify that
but the network facility itself is set up properly, that administrative
[R1(x),W1(x),R2(x)] != [R1(x), R2(x), W1(x)] access requires a password, and that the password is not
stored in clear text on the file system.
Conflicting Operations
In terms of schedule equivalence, it is the ordering of 3. Transport Protocols: Databases support multiple transport
CONFLICTING operators which must be the same in both protocols. While features such as named pipes are still
schedules supported, they are open to spoofing and hijacking. I
recommend that you pick a single reliable protocol such as
The conflict between read and write operations is called a TCP/IP), and disable the rest to prevent insecure connections.
read-write conflict and the conflict between two writes is
called a write-write conflict . 4. Private Communication: Use SSL. If the database contains
sensitive data, use SSL. This is especially true for databases in
Concurrency Control Techniques remote or virtual environments. The path between the user or
There are three basic concurrency control techniques : application and the database is not guaranteed to be safe, so
Locking Methods use SSL to ensure privacy. If you have never set up SSL before,
73
get some help – otherwise connecting applications can choose that if the calling application or server has been compromised,
to ignore SSL. all the permissions granted to the calling application – and
possibly all the permissions assigned to any user of the
5. External Procedure Lockdown: All database platforms have connection – are available to an attacker. You should review
external procedures that are very handy for performing these trust relationships and remove them for high-risk
database administration. They enable DBAs to run OS applications.
commands, or to run database functions from the OS. These
procedures are also a favorite of attackers – once they have
hacked either an OS or a database, stored procedures (if
enabled) make it trivial to leverage that access into a
compromise of the other half. This one is not optional. If you
are part of a small IT organization and responsible for both IT 10.0 DATABASE ADMINISTRATION
and database administration, it will make your day-to-day job Purpose of database administration
just a little harder. Database administration refers to the whole set of activities
performed by a database administrator to ensure that a database is
Checking these connection methods can be completed in under and always available as needed. Other closely related tasks and roles are
hour, and enables you to close off the most commonly used avenues database security, database monitoring and troubleshooting, and
for attack and privilege escalation. planning for future growth.
Throughout the history of Information Resource Management, there
A little more advanced: have been questions surrounding the necessity for multiple disciplines
1. Basic Connection Checks: Many companies, as part of their within the IRM domain. Many organizations do not recognize
security policy, do not allow ad hoc connections to production the essential differences between Data Administration and Database
databases. Handy administrative tools like Quest’s Toad are not Administration. As a result, there exists much confusion over the roles
allowed because they do not enforce change control processes. of Data Administration and Database Administration, and
If you are worried about this issue, you can write a login trigger their respective responsibilities. Each discipline is necessary for the
that detects the application, user, and source location of proper management of the corporate resource of information, but
inbound connections – and then terminates unauthorized these activities should never be combined in one person or
sessions. sub-group. Each discipline requires different skills, training and talents,
therefore, most people do not make a successful transition from one
2. Trusted Connections & Service Accounts: All database discipline to the other. Data Administration and its
platforms offer some form of trusted connections. The sub disciplines: Data Modeling, Data Definitions, Planning and
intention is to allow the calling application to verify user Analysis, is a relative newcomer to the field of data processing. It is
credentials, and then pass the credentials or verification token only within the last 10-15 years that the industry has given
through the service account to the database. The problem is serious consideration to the logical management and control of
74
information as a corporate resource. There is a lack of understanding minimal retraining as long as the technology remains constant. A DA,
of the purpose and objectives of Data Administration even among on
experienced data processing professionals. the other hand, has much to learn in an unfamiliar industry to be truly
Following is a chart of the major responsibilities of Data effective. Having an impact on data design and information
Administration and Database Administration: management requires an understanding of the goals, objectives and
tactics of the organization and its core industry (insurance,
Data Administration – Logical Design pharmaceuticals, banking, etc…). Logical Modeling is part of the Data
Perform business requirements gathering Administration function, and is a full-time responsibility for
Analyze requirements those involved in a major development or enhancement project. It is
Model business based on requirements (conceptual and frequently augmented by other data administration functions, such as
logical) developing data element definitions and managing the models
Define and enforce standards and conventions (definition, and associated items in a meta data repository. One role of data
naming, abbreviation) administration is to advocate the planning and coordination of the
Conduct data definition sessions with users information resource across related applications and business
Manage and administer meta data repository and Data areas. By doing so, the amount of data sharing can be maximized, and
Administration CASE (modeling) tools the amount of design and data redundancy can be minimized.
Assist Database Administration in creating physical tables from One way data administrators (also called “data analysts”) can assist in
logical models making data sharable and consistent across applications is to use the
Database Administration – Physical Design / Operational techniques of logical data modeling. Logical data
design is a specialty that requires its own specialists. Developers and
Define required parameters for database definition database administrators are not trained in logical data modeling, and
Analyze data volume and space requirements should not be expected to perform this specialized
Perform database tuning and parameter enhancements task. The overall objective of Data Administration is to plan,
Execute database backups and recoveries document, manage and control the information resources of an entire
Monitor database space requirements organization. The main objective of Data Administration is to
Verify integrity of data in databases integrate and manage corporate-wide information resources. This
Coordinate the transformation of logical structures to properly integration can be achieved by a combination of refined skills and
performing physical structures techniques, proper use of Data Administration tools such as a meta
Perhaps more than any other of the discrete disciplines within IS, Data
Administration requires a concrete grasp of the real business the
company is in, not just the technical aspects of
interaction with a computer. Frequently, a DBA or systems Management of database activity
programmer is arguably portable from one industry to another, with Managing database structure
Managing DBMS structure
75
STRUCTURE OF DBMS • It also enforces constraints to maintain consistency and integrity of
the data.
DBMS (Database Management System) acts as an interface between
the user and the database. The user requests the DBMS to perform • It also synchronizes the simultaneous operations performed by the
various operations (insert, delete, update and retrieval) on the concurrent users.
database. The components of DBMS perform these requested
operations on the database and provide necessary data to the users. • It also controls the backup and recovery operations.
The various components of DBMS are shown below: - 4. Data Dictionary - Data Dictionary is a repository of
description of data in the database. It contains information about
1. DDL Compiler - Data Description Language compiler • Data - names of the tables, names of attributes of each table, length
processes schema definitions specified in the DDL. It includes of attributes, and number of rows in each table.
metadata information such as the name of the files, data items, • Relationships between database transactions and data items
storage details of each file, mapping information and constraints etc. referenced by them which is useful in determining which transactions
2. DML Compiler and Query optimizer - The DML commands are affected when certain data definitions are changed.
such as insert, update, delete, retrieve from the application program • Constraints on data i.e. range of values permitted.
are sent to the DML compiler for compilation into object code for
database access. The object code is then optimized in the best way to • Detailed information on physical database design such as storage
execute a query by the query optimizer and then send to the data structure, access paths, files and record sizes.
manager.
• Access Authorization - is the Description of database users their
3. Data Manager - The Data Manager is the central software responsibilities and their access rights.
component of the DBMS also knows as Database Control System.
• Usage statistics such as frequency of query and transactions.
The Main Functions Of Data Manager Are: –
Data dictionary is used to actually control the data integrity,
• Convert operations in user's Queries coming from the application database operation and accuracy. It may be used as a important part
programs or combination of DML Compiler and Query optimizer which of the DBMS.
is known as Query Processor from user's logical view to physical file
system. Importance of Data Dictionary -
• Controls DBMS information access that is stored on disk. Data Dictionary is necessary in the databases due to following reasons:
• It also controls handling buffers in main memory. • It improves the control of DBA over the information system and
user's understanding of use of the system.
76
• It helps in documentating the database design process by storing 2. Cardinality. This specifies the number of each entity that is involved
documentation of the result of every design phase and design in the relationship. There are 3 types of cardinality for binary and
decisions. unary relationships:
• It helps in searching the views on the database definitions of those One to one (1:1). For example, 1 man is married to 1 woman.
views.
• It provides great assistance in producing a report of which data One to many (1:m). For example, 1 manager manages many
elements (i.e. data values) are used in all the programs. employees, each employee is managed by 1 manager.
• It promotes data independence i.e. by addition or modifications of Many to many (m:n). For example, Each students take many modules,
structures in the database application program are not effected. each module is taken by many students.
5. Data Files - It contains the data portion of the database. How many is many? It doesn't matter! If it's 0, 10 or 100, the way you
implement the relationship is the same.
6. Compiled DML - The DML complier converts the high level
Queries into low level file access commands known as compiled DML.
3. Optionality. Each relationship can be optional or mandatory for each
7. End Users - They are already discussed in previous section. entity. This gives three types for binary relationships:
Optional for both entities
Optional for one entity, mandatory for the other
Database economics and control
Mandatory for both entities
1. Degree. This is the number of entities involved in the relationship
and is usually 2 (a binary relationship). Unary relationships also exist, What is a data administrator?
where only 1 entity is involved - a person is married to another person,
A data administration (also known as a database administration
or an employee manages other employees.
manager, data architect, or information center manager) is a high level
function responsible for the overall management of data resources in
(Remember - 'entity' and 'instance of an entity' are not the same thing.
an organization. In order to perform its duties, the DA must know a
John Smith is a Customer. The entity is Customer, John is an instance
good deal of system analysis and programming.
of an entity.)
These are the functions of a data administrator (not to be confused
Tenary relationships exist (3 entities are involved) as do quaternary (4) with database administrator functions):
or n-ary (n). There are a number of strategies for resolving these which
we will look at some other time. 1. Data policies, procedures, standards
77
2. Planning- development of organization's IT strategy, enterprise What are the functions of a database administrator?
model, cost/benefit model, design of database environment, and
administration plan. 1. Selection of hardware and software
Keep up with current technological trends
3. Data conflict (ownership) resolution
Predict future changes
4. Data analysis- Define and model data requirements, business rules, Emphasis on established off the shelf products
operational requirements, and maintain corporate data dictionary 2. Managing data security and privacy
Protection of data against accidental or intentional loss,
5. Internal marketing of DA concepts
destruction, or misuse
6. Managing the data repository Firewalls
Establishment of user privileges
What is a database administrator?
Complicated by use of distributed systems such as internet
Database administration is more of an operational or technical level access and client/ server technology.
function responsible for physical database design, security How many major threats to database security can you think of?
enforcement, and database performance. Tasks include maintaining 1. Accidental loss due to human error or software/ hardware error.
the data dictionary, monitoring performance, and enforcing 2. Theft and fraud that could come from hackers or disgruntled
organizational standards and security. employees.
3. Improper data access to personal or confidential data.
What is a database steward? 4. Loss of data integrity.
A database steward is an administrative function responsible for 5. Loss of data availability through sabotage, a virus, or a worm.
managing data quality and assuring that organizational applications
meet the enterprise goals. It is a connection between IT and business 3. Managing Data Integrity
units. Data quality issues include security and disaster recovery, Integrity controls protects data from unauthorized use
personnel controls, physical access controls, maintenance controls, Data consistency
and data protection and privacy. For example, in order to increase Maintaining data relationship
security the database steward can have control over who can gain Domains- sets allowable values
access to the data base by assigning a specific privileges to users. Assertions- enforce database conditions
4. Data backup
Now that you have an idea of the different responsibilities involved We must assume that a database will eventually fail
in maintaining a database, we can list and describe the functions of a Establishment procedures
database administrator. o how often should the data be back-up?
o what data should be backed up more frequently?
78
o who is responsible for the back ups? Recovery and Restart Procedures
Back up facilities o switch- mirrored databases
o automatic dump- facility that produces backup copy of o restore/rerun- reprocess transactions against the
the entire database backup
o periodic backup- done on periodic basis such as nightly o transaction integrity- commit or abort all transaction
or weekly changes
o cold backup- database is shut down during backup o backward recovery (rollback)- apply before images
o hot backup- a selected portion of the database is shut o forward recovery (roll forward)- apply after images
down and backed up at a given time (preferable to restore/rerun)
o backups stored in a secure, off-site location 6. Tuning database performance
5. Database recovery Set installation parameters/ upgrade DBMS
Application of proven strategies for reinstallation of database Monitor memory and CPU usage
after crash Input/ output contention
Recovery facilities include backup, journalizing, checkpoint, and o user striping
recovery manager o distribution of heavily accessed files
If there are back up facilities, are there also journalizing, checkpoint, Application tuning by modifying SQL code in applications
and recovery facilities?
Yes 7. Improving query processing performance
Are there any shared administration functions?
Journalizing facilities include: Yes
o audit trail of transactions and database
updates These are share administration functions
o transaction log which records essential data for each
1. Database design
transaction processed against the database
o database change log shows images of updated data. DA is responsible for logical design
The log stores a copy of the image before and after
modification. DBA is responsible for the external model design (subschemas),
Checkpoint facilities: the physical design (construction), and for designing integrity
o when the DBMS refuses to accept a new transaction, controls
the system is in a quiet state
2. Database implementation
o database and transactions are synchronized
o allows the recovery manager to resume processing DBA
from a short period instead of repeating the entire day
o establish security controls
79
o supervise database loading The Database Management System
o specify test procedures The DBMS is software that manages all access to the data. Functions of
o develop programming standards a DBMS include:
o establish back up/ recovery procedures a language for data definition (DDL), so databases, tables etc.
Both can be created
o specify access policies a language for data manipulation (DML), so data records can be
o user training inserted, updated and selected from the tables
3. Operations and maintenance Optimisation of queries
DBA Data security and integrity
o monitor database performance Data recovery and concurrency
o tune and reorganize databases as needed Data Dictionary
o enforce standards and procedures Performance monitoring
Both
o support users
4. Growth and change
Both
o implement change-control procedures
o plan for growth and change
o evaluate new technologies
New functions
1. Data warehouse administration
80
statistical and analysis routines useful for monitoring
performance Two-phase locking (2PL) is a concurrency control method that
report writers that provide a formatted copy of selected data guarantees serializability. It is also the name of the resulting set of
graphic subsystems - data is often best shown graphically database transaction schedules (histories). The protocol utilizes locks,
application generators that ’automatically’ produce data entry applied by a transaction to data, which may block (interpreted as
screens, reports and a menu system for end users to use. signals to stop) other transactions from accessing the same data during
the transaction's life.
By the 2PL protocol locks are applied and removed in two phases:
1. Expanding phase: locks are acquired and no locks are released.
• How should an e-business enterprise store, access, and 2. Shrinking phase: locks are released and no locks are acquired.
distribute data & information about their internal operations &
Two types of locks are utilized by the basic protocol: Shared and
external environment?
Exclusive locks. Refinements of the basic protocol may utilize more
• What roles do database management, data administration, and lock types. Using locks that block processes, 2PL may be subject to
data planning play in managing data as a business resource? deadlocks that result from the mutual blocking of two or more
• What are the advantages of a database management approach transactions.
to organizing, accessing, and managing an organization’s data
resources?
• What is the role of a database management system in an e-
business information system?
• Databases of information about a firm’s internal operations
were formerly the only databases that were considered to be
important to a business. What other kinds of databases are
important for a business today?
• What are the benefits and limitations of the relational
database model for business applications?
• Why is the object-oriented database model gaining acceptance
for developing applications and managing the hypermedia
databases at business websites?
• How have the Internet, intranets, extranets, and the World
Wide Web affected the types and uses of data resources
available to business end users?
81