Dbms - III
Dbms - III
Course Introduction
This is an advanced course on DBMS and you are presumed to have successfully
gone through earlier courses.
In this course, the material comes in two blocks of three units each.
The first block is all about managing large, concurrent database systems. When
very large databases are being operated upon by a number of users, who keep operating
on the data, lot of consistency and integrity problems come into effect. Unfortunately
these problems cannot even be predicted before hand and can not be simulated also.
Hence several precautions have to be taken to ensure that such disasters do not occur.
Also, since these users many times will be operating in remote places, effects of
their systems or transaction failures can be disastrous. In this unit, we discuss about the
analytical way of studying such systems, and methods of ensuring that such errors do not
occur. Basically, we discuss the concept of “transactions” and how to make these
transactions interact with the database so that they do not hurt the database value
accuracy and integrity. We also briefly discuss how to recover from system crashes,
software failures and such other disasters with seriously affecting the database
performance.
The first unit discusses the formal ways of transaction handling, why concurrency
control is needed and what possible errors may creep in an uncontrolled environment.
This discussion leads to the concept of system recovery, creation of system logs,
discussion of desirable properties of transactions etc. The concept of serializability is
discussed.
1
The second unit discusses the various concurrency control techniques, the concept
of system locks-wherein a data item becomes the exclusive property of a transaction for
sometime and the resultant problem of deadlocks. We also discuss about time stamps,
wherein each transaction bears a tag, indicating when it came in to the system and this
helps in concurrency control and recovery processes.
The third unit actually discusses the database recovery technique bases on various
concepts of data logs, use of checkpoints, shadow paging etc with various options
available for single user and multi-user systems. The block ends with a brief discussion
of some of the commonly used data security and authorization methods designed to
maintain the security and integrity of databases.
The second block is all about data warehousing and data mining, Internet
databases, and, the advanced topics in database management systems.
The fourth unit introduces two very important branches of database technology,
which are going to play a significant role in the years to come. They are data
warehousing and data mining. Data warehousing can be seen as a process that requires a
variety of activities to precede it. We introduce key concepts related to data warehousing.
Data mining may be thought as an activity that draws knowledge from an existing data
warehouse. Data mining, the extraction of hidden predictive information from large
databases, is a powerful new technology with great potential to help companies focus on
the most important information in their data warehouses. Data mining tools predict future
trends and behaviors, allowing businesses to make proactive, knowledge-driven
decisions.
The fifth unit introduces the Internet databases. The World Wide Web (WWW, or
Web) is a distributed information system based on hypertext. The Web makes it possible
2
to access a file anywhere on the Internet. A file is identified by a universal resource
locator (URL). These are nothing but pointers to documents. HTML is a simple language
used to describe a document. It is also called a markup Language because HTML
works by augmenting regular text with 'marks' that hold special meaning for a Web
browser handling the document. Many Internet users today have home pages on the Web,
such pages often contain information about user's and world lives. We also introduce
Extensible Markup Language (XML) which is a markup language that was developed to
remedy the shortcomings of HTML.
3
Unit - 1
TRANSACTION PROCESSING CONCEPTS
Structure
1.0 Introduction
1.1 Objectives:
1.2 Transaction and system preliminaries
1.3 A typical multiuser system
1.4 The need for concurrency control
1.4.1 The lost update problem
1.4.2 The temporary update (Dirty read) problem
1.4.3 The Incorrect Summary Problem
1.4.4 Unrepeatable read
1.5 The concept of failures and recovery
1.6 Transaction States and additional operations
1.6.1 The concept of system log
1.6.2.Commit Point of a Transaction
1.7 Desirable Transaction properties. (ACID properties)
1.8.The Concept of Schedules
1 1.8.1.Schedule (History of transaction)
2 1.8.2.Schedules and Recoverability
1.9.Serializability
1.9.1 Testing for conflict serializability of a schedule
1.9.2.View equivalence and view serializability
1.9.3.Uses of serializability
1.10. Summary
1.11. Review Questions & Answers
4
1.0 Introduction
You are then introduced to the concept of a system log, which is a case history of
system updatings. The concept of commit point of a transaction is also introduced.
1.1 Objectives
5
1.2 Transaction and system preliminaries.
The concept of transaction has been devised as a convenient and precise way of
describing the various logical units that form a database system. We have transaction
systems which are systems that operate on very large databases, on which several
(sometimes running into hundreds) of users concurrently operate – i.e. they manipulate
the database transaction. There are several such systems presently in operation in our
country also – if you consider the railway reservation system, wherein thousands of
stations – each with multiple number of computers operate on a huge database, the
database containing the reservation details of all trains of our country for the next several
days. There are many other such systems like the airlines reservation systems, distance
banking systems, stock market systems etc. In all these cases apart from the accuracy
and integrity of the data provided by the database (note that money is involved in almost
all the cases – either directly or indirectly), the systems should provide instant availability
and fast response to these hundreds of concurrent users. In this block, we discuss the
concept of transaction, the problems involved in controlling concurrently operated
systems and several other related concepts. We repeat – a transaction is a logical
operation on a database and the users intend to operate with these logical units trying
either to get information from the database and in some cases modify them. Before we
look into the problem of concurrency, we view the concept of multiuser systems from
another point of view – the view of the database designer.
6
multiprogramming but the converse is not true. Several users may be operating
simultaneously, but not all of them may be operating on the database simultaneously.
Now, before we see what problems can arise because of concurrency, we see what
operations can be done on the database. Such operations can be single line commands or
can be a set of commands meant to be operated sequentially. Those operations are
invariably limited by the “begin transaction” and “end transaction” statements and the
implication is that all operations in between them are to be done on a given transaction.
Another concept is the “granularity” of the transaction. Assume each field in a
database is named. The smallest such named item of the database can be called a field of
a record. The unit on which we operate can be one such “grain” or a number of such
grains collectively defining some data unit. However, in this course, unless specified
otherwise, we use of “single grain” operations, but without loss of generality. To
facilitate discussions, we presume a database package in which the following operations
are available.
i) Read_tr(X: The operation reads the item X and stores it into an assigned
variable. The name of the variable into which it is read can be anything, but
we would give it the same name X, so that confusions are avoided. I.e.
whenever this command is executed the system reads the element required
from the database and stores it into a program variable called X.
ii) Write – tr(X): This writes the value of the program variable currently stored in
X into a database item called X.
Once the read –tr(X) is encountered, the system will have to perform the
following operations.
1. Find the address of the block on the disk where X is stored.
2. Copy that block into a buffer in the memory.
3. Copy it into a variable (of the program) called X.
A write –tr (x) performs the converse sequence of operations.
1. Find the address of the diskblock where the database variable X is stored.
2. Copy the block into a buffer in the memory.
7
3. Copy the value of X from the program variable to this X.
4. Store this updated block back to the disk.
Normally however, the operation (4) is not performed every time a write –tr is
executed. It would be a wasteful operation to keep writing back to the disk every time.
So the system maintains one/more buffers in the memory which keep getting updated
during the operations and this updated buffer is moved on to the disk at regular intervals.
This would save a lot of computational time, but is at the heart of some of the problems
of concurrency that we will have to encounter.
8
Suppose A wants to book 8 seats. Since the number of seats he wants is (say Y)
less than the available seats, the program can allot him the seats, change the number of
available seats (X) to X-Y and can even give him the seat numbers that have been booked
for him.
The problem is that a similar operation can be performed by B also. Suppose he
needs 7 seats. So, he gets his seven seats, replaces the value of X to 3 (10 – 7) and gets
his reservation.
The problem is noticed only when these blocks are returned to main database
(the disk in the above case).
Before we can analyse these problems, we look at the problem from a more
technical view.
1.4.1 The lost update problem: This problem occurs when two transactions that access
the same database items have their operations interleaved in such a way as to make the
value of some database incorrect. Suppose the transactions T1 and T2 are submitted at the
(approximately) same time. Because of the concept of interleaving, each operation is
executed for some period of time and then the control is passed on to the other transaction
and this sequence continues. Because of the delay in updatings, this creates a problem.
This was what happened in the previous example. Let the transactions be called TA and
TB.
TA TB
Read –tr(X)
Read –tr(X) Time
X = X – NA
X = X - NB
Write –tr(X)
write –tr(X)
9
Note that the problem occurred because the transaction TB failed to record the
transactions TA. I.e. TB lost on TA. Similarly since TA did the writing later on, TA lost the
updatings of TB.
This happens when a transaction TA updates a data item, but later on (for some
reason) the transaction fails. It could be due to a system failure or any other operational
reason. Or the system may have later on noticed that the operation should not have been
done and cancels it. To be fair, it also ensures that the original value is restored.
But in the meanwhile, another transaction TB has accessed the data and since it
has no indication as to what happened later on, it makes use of this data and goes ahead.
Once the original value is restored by TA, the values generated by TB are obviously
invalid.
TA TB
Read –tr(X) Time
X=X–N
Write –tr(X)
Read –tr(X)
X=X-N
write –tr(X)
Failure
X=X+N
Write –tr(X)
10
1.4.3 The Incorrect Summary Problem: Consider two concurrent operations, again
called TA and TB. TB is calculating a summary (average, standard deviation or some such
operation) by accessing all elements of a database (Note that it is not updating any of
them, only is reading them and is using the resultant data to calculate some values). In
the meanwhile TA is updating these values. In case, since the Operations are interleaved,
TA, for some of it’s operations will be using the not updated data, whereas for the other
operations will be using the updated data. This is called the incorrect summary problem.
TA TB
Sum = 0
Read –tr(A)
Sum = Sum + A
Read –tr(X)
X=X–N
Write –tr(X)
Read tr(X)
Sum = Sum + X
Read –tr(Y)
Sum = Sum + Y
Read (Y)
Y=Y–N
Write –tr(Y)
In the above example, both TA will be updating both X and Y. But since it first
updates X and then Y and the operations are so interleaved that the transaction T B uses
both of them in between the operations, it ends up using the old value of Y with the new
value of X. In the process, the sum we got does not refer either to the old set of values or
to the new set of values.
1.4.4 Unrepeatable read: This can happen when an item is read by a transaction twice,
(in quick succession) but the item has been changed in the meanwhile, though the
transaction has no reason to expect such a change. Consider the case of a reservation
system, where a passenger gets a reservation detail and before he decides on the aspect of
reservation the value is updated at the request of some other passenger at another place.
11
1.5 The concept of failures and recovery
Any database operation can not be immune to the system on which it operates
(both the hardware and the software, including the operating systems). The system
should ensure that any transaction submitted to it is terminated in one of the following
ways.
a) All the operations listed in the transaction are completed, the changes
are recorded permanently back to the database and the database is
indicated that the operations are complete.
b) In case the transaction has failed to achieve it’s desired objective, the
system should ensure that no change, whatsoever, is reflected onto the
database. Any intermediate changes made to the database are restored
to their original values, before calling off the transaction and
intimating the same to the database.
In the second case, we say the system should be able to “Recover” from the
failure. Failures can occur in a variety of ways.
i) A System Crash: A hardware, software or network error can make the
completion of the transaction an impossibility.
ii) A transaction or system error: The transaction submitted may be faulty –
like creating a situation of division by zero or creating a negative numbers
which cannot be handled (For example, in a reservation system, negative
number of seats convey no meaning). In such cases, the system simply
discontinuous the transaction by reporting an error.
iii) Some programs provide for the user to interrupt during execution. If the
user changes his mind during execution, (but before the transactions are
complete) he may opt out of the operation.
iv) Local exceptions: Certain conditions during operation may force the
system to raise what are known as “exceptions”. For example, a bank
account holder may not have sufficient balance for some transaction to be
done or special instructions might have been given in a bank transaction
that prevents further continuation of the process. In all such cases, the
transactions are terminated.
12
v) Concurrency control enforcement: In certain cases when concurrency
constrains are violated, the enforcement regime simply aborts the process
to restart later.
The other reasons can be physical problems like theft, fire etc or system problems
like disk failure, viruses etc. In all such cases of failure, a recovery mechanism is
to be in place.
Though the read tr and write tr operations described above the most fundamental
operations, they are seldom sufficient. Though most operations on databases comprise of
only the read and write operations, the system needs several additional operations for it’s
purposes. One simple example is the concept of recovery discussed in the previous
section. If the system were to recover from a crash or any other catastrophe, it should
first be able to keep track of the transactions – when they start, when they terminate or
when they abort. Hence the following operations come into picture.
i) Begin Trans: This marks the beginning of an execution process.
ii) End trans: This marks the end of a execution process.
iii) Commit trans: This indicates that transaction is successful and the changes
brought about by the transaction may be incorporated onto the database
and will not be undone at a later date.
iv) Rollback: Indicates that the transaction is unsuccessful (for whatever
reason) and the changes made to the database, if any, by the transaction
need to be undone.
Most systems also keep track of the present status of all the transactions at the present
instant of time (Note that in a real multiprogramming environment, more than one
transaction may be in various stages of execution). The system should not only be able to
keep a tag on the present status of the transactions, but also should know what are the
next possibilities for the transaction to proceed and in case of a failure, how to roll it
back. The whole concept takes the state transition diagram. A simple state transition
diagram, in view of what we have seen so for can appear as follows:
13
Terminate
Termi-
Failure nated
Abort Terminate
Committe
d
Begin End
Active Partially
Transaction Transaction committed Commit
Read/Write
The arrow marks indicate how a state of a transaction can change to a next state. A
transaction is in an active state immediately after the beginning of execution. Then it will
be performing the read and write operations. At this state, the system protocols begin
ensuring that a system failure at this juncture does not make erroneous recordings on to
the database. Once this is done, the system “Commits” itself to the results and thus enters
the “Committed state”. Once in the committed state, a transaction automatically proceeds
to the terminated state.
The transaction may also fail due to a variety of reasons discussed in a previous
section. Once it fails, the system may have to take up error control exercises like rolling
back the effects of the previous write operations of the transaction. Once this is
completed, the transaction enters the terminated state to pass out of the system.
A failed transaction may be restarted later – either by the intervention of the user
or automatically.
14
system is trying to recover from failures. The log information is kept on the disk, such
that it is not likely to be affected by the normal system crashes, power failures etc.
(Otherwise, when the system crashes, if the disk also crashes, then the entire concept
fails). The log is also periodically backed up into removable devices (like tape) and is
kept in archives.
The question is, what type of data or information needs to be logged into the
system log?
Let T refer to a unique transaction – id, generated automatically whenever a new
transaction is encountered and this can be used to uniquely identify the transaction. Then
the following entries are made with respect to the transaction T.
i) [Start-Trans, T] : Denotes that T has started execution.
ii) [Write-tr, T, X, old, new]: denotes that the transaction T has changed the old
value of the data X to a new value.
iii) [read_tr, T, X] : denotes that the transaction T has read the value of the X from
the database.
iv) [Commit, T] : denotes that T has been executed successfully and confirms that
effects can be permanently committed to the database.
v) [abort, T] : denotes that T has been aborted.
These entries are not complete. In some cases certain modification to their purpose and
format are made to suit special needs.
(Note that though we have been talking that the logs are primarily useful for recovery
from errors, they are almost universally used for other purposes like reporting, auditing
etc).
The two commonly used operations are “undo” and “redo” operations. In the undo, if the
transaction fails before permanent data can be written back into the database, the log
details can be used to sequentially trace back the updatings and return them to their old
values. Similarly if the transaction fails just before the commit operation is complete,
one need not report a transaction failure. One can use the old, new values of all write
operation on the log and ensure that the same is entered onto the database.
15
1.4.2 Commit Point of a Transaction:
The next question to be tackled is when should one commit to the results of a
transaction? Note that unless a transaction is committed, it’s operations do not get
reflected in the database. We say a transaction reaches a “Commit point” when all
operations that access the database have been successfully executed and the effects of all
such transactions have been included in the log. Once a transaction T reaches a commit
point, the transaction is said to be committed – i.e. the changes that the transaction had
sought to make in the database are assumed to have been recorded into the database. The
transaction indicates this state by writing a [commit, T] record into it’s log. At this point,
the log contains a complete sequence of changes brought about by the transaction to the
database and has the capacity to both undo it (in case of a crash) or redo it (if a doubt
arises as to whether the modifications have actually been recorded onto the database).
Before we close this discussion on logs, one small clarification. The records of
the log are on the disk (secondary memory). When a log record is to be written, a
secondary device access is to be made, which slows down the system operations. So
normally a copy of the most recent log records are kept in the memory and the updatings
are made there. At regular intervals, these are copied back to the disk. In case of a
system crash, only those records that have been written onto the disk will survive. Thus,
when a transaction reaches commit stage, all records must be forcefully written back to
the disk and then commit is to be executed. This concept is called ‘forceful writing’ of
the log file.
16
ii) Consistency preservation: A transaction is said to be consistency preserving if
it’s complete execution takes the database from one consistent state to another.
We shall slightly elaborate on this. In steady state a database is expected to be
consistent i.e. there are not anomalies in the values of the items. For example
if a database stores N values and also their sum, the database is said to be
consistent if the addition of these N values actually leads to the value of the
sum. This will be the normal case.
Now consider the situation when a few of these N values are being changed.
Immediately after one/more values are changed, the database becomes inconsistent. The
sum value no more corresponds to the actual sum. Only after all the updatings are done
and the new sum is calculated that the system becomes consistent.
A transaction should always ensure that once it starts operating on a database, it’s
values are made consistent before the transaction ends.
iii) Isolation: Every transaction should appear as if it is being executed in
isolation. Though, in a practical sense, a large number of such transactions
keep executing concurrently no transaction should get affected by the
operation of other transactions. Then only is it possible to operate on the
transaction accurately.
iv) Durability; The changes effected to the database by the transaction should be
permanent – should not vanish once the transaction is removed. These
changes should also not be lost due to any other failures at later stages.
Now how does one enforce these desirable properties on the transactions? The
atomicity concept is taken care of, while designing and implementing the transaction. If,
however, a transaction fails even before it can complete it’s assigned task, the recovery
software should be able to undo the partial effects inflicted by the transactions onto the
database.
The preservation of consistency is normally considered as the duty of the database
programmer. A “consistent state” of a database is that state which satisfies the constraints
specified by the schema. Other external constraint may also be included to make the
rules more effective. The database programmer writes his programs in such a way that a
transaction enters a database only when it is in a consistent state and also leaves the state
17
in the same or any other consistent state. This, of course implies that no other transaction
“interferes” with the action of the transaction in question.
This leads us to the next concept of isolation i.e. every transaction goes about
doing it’s job, without being bogged down by any other transaction, which may also be
working on the same database. One simple mechanism to ensure this is to make sure that
no transaction makes it’s partial updates available to the other transactions, until the
commit state is reached. This also eliminates the temporary update problem. However,
this has been found to be inadequate to take care of several other problems. Most
database transaction today come with several levels of isolation. A transaction is said to
have a level zero (0) isolation, if it does not overwrite the dirty reads of higher level
transactions (level zero is the lowest level of isolation). A transaction is said to have a
level 1 isolation, if it does not lose any updates. At level 3, the transaction neither loses
updates nor has any dirty reads. At level 3, the highest level of isolation, a transaction
does not have any lost updates, does not have any dirty reads, but has repeatable reads.
18
some other operations Tj1 (of a transaction Tj) may be interleaved between them.
In short, a schedule lists the sequence of operations on the database in the same
order in which it was effected in the first place.
Readtr(x) transaction 1
Read tr (y) transaction 2
Write tr (y) transaction 2
Read tr(y) transaction 1
Write tr(x) transaction 1
Abort transaction 1
19
For example : r1(x); w2 (x)
W1 (x); r2(x)
w1 (y); w2(y)
Conflict because both of them try to write on the same item.
But r1 (x); w2(y) and r1(x) and r2(x) do not conflict, because in the first case the
read and write are on different data items, in the second case both are trying read
the same data item, which they can do without any conflict.
i) The operations listed in S are exactly the same operations as in T1, T2 ……Tn,
including the commit or abort operations. Each transaction is terminated by
either a commit or an abort operation.
ii) The operations in any transaction. Ti appear in the schedule in the same order
in which they appear in the Transaction.
iii) Whenever there are conflicting operations, one of two will occur before the
other in the schedule.
A “Partial order” of the schedule is said to occur, if the first two conditions of the
complete schedule are satisfied, but whenever there are non conflicting operations in the
schedule, they can occur without indicating which should appear first.
This can happen because non conflicting operations any way can be executed in any
order without affecting the actual outcome.
20
those operations in S that have committed transactions i.e. transaction Ti whose commit
operation Ci is in S.
Put in simpler terms, since non committed operations do not get reflected in the actual
outcome of the schedule, only those transactions, who have completed their commit
operations contribute to the set and this schedule is good enough in most cases.
21
The concept is a simple one. Suppose the transaction T reads an item X from the
database, completes its operations (based on this and other values) and commits the
values. I.e. the output values of T become permanent values of database.
But suppose, this value X is written by another transaction T’ (before it is read by
T), but aborts after T has committed. What happens? The values committed by T are no
more valid, because the basis of these values (namely X) itself has been changed.
Obviously T also needs to be rolled back (if possible), leading to other rollbacks and so
on.
The other aspect to note is that in a recoverable schedule, no committed
transaction needs to be rolled back. But, it is possible that a cascading roll back scheme
may have to be effected, in which an uncommitted transaction has to be rolled back,
because it read from a value contributed by a transaction which later aborted. But such
cascading rollbacks can be very time consuming because at any instant of time, a large
number of uncommitted transactions may be operating. Thus, it is desirable to have
“cascadeless” schedules, which avoid cascading rollbacks.
This can be ensured by ensuring that transactions read only those values which are
written by committed transactions i.e. there is no fear of any aborted or failed transactions
later on. If the schedule has a sequence wherein a transaction T1 has to read a value X by
an uncommitted transaction T2, then the sequence is altered, so that the reading is
postponed, till T2 either commits or aborts.
The third type of schedule is a “strict schedule”, which as the name suggests is highly
restrictive in nature. Here, transactions are allowed neither to read or write a value X
until the last transaction that wrote X has committed or aborted. Note that the strict
schedules largely simplifies the recovery process, but the many cases, it may not be
possible device strict schedules.
22
It may be noted that the recoverable schedule, cascadeless schedules and strict schedules
each is more stringent than it’s predecessor. It facilitates the recovery process, but
sometimes the process may get delayed or even may become impossible to schedule.
1.9 Serializability
23
T1 T2 T2 T2
These now can be termed as serial schedules, since the entire sequence of operation in
one transaction is completed before the next sequence of transactions is started.
In the interleaved mode, the operations of T1 are mixed with the operations of T2. This
can be done in a number of ways. Two such sequences are given below:
T1 T2
read_tr(X )
X=X+N
read_tr(X)
X=X+P
write_tr(X)
read_tr(Y)
Write_tr(X)
Y=Y+N
Write_tr(Y)
24
Interleaved (non-serial schedule):C
T1 T2
read_tr(X)
X=X+N
write_tr(X)
read_tr(X)
X=X+P
Write_tr(X)
read_tr(Y)
Y=Y+N
Write_tr(Y)
Formally a schedule S is serial if, for every transaction, T in the schedule, all operations
of T are executed consecutively, otherwise it is called non serial. In such a non-
interleaved schedule, if the transactions are independent, one can also presume that the
schedule will be correct, since each transaction commits or aborts before the next
transaction begins. As long as the transactions individually are error free, such a
sequence of events are guaranteed to give a correct results.
However, once the operations are interleaved, so that the above cited problems are
overcome, unless the interleaving sequence is well thought of, all the problems that we
25
encountered in the beginning of this block become addressable. Hence, a methodology is
to be adopted to find out which of the interleaved schedules give correct results and
which do not.
The simplest and the most obvious method to conclude that two such schedules
are equivalent is to find out their results. If they produce the same results, then they can
be considered equivalent. i.e. it two schedules are “result equivalent”, then they can be
considered equivalent. But such an oversimplification is full of problems. Two
sequences may produce the same set of results of one or even a large number of initial
values, but still may not be equivalent. Consider the following two sequences:
S1 S2
read_tr(X) read_tr(X)
X=X+X X=X*X
write_tr(X) Write_tr(X)
For a value X=2, both produce the same result. Can be conclude that they are equivalent?
Though this may look like a simplistic example, with some imagination, one can always
come out with more sophisticated examples wherein the “bugs” of treating them as
equivalent are less obvious. But the concept still holds -result equivalence cannot mean
26
schedule equivalence. One more refined method of finding equivalence is available. It is
called “ conflict equivalence”. Two schedules can be said to be conflict equivalent, if the
order of any two conflicting operations in both the schedules is the same (Note that the
conflicting operations essentially belong to two different transactions and if they access
the same data item, and atleast one of them is a write_tr(x) operation). If two such
conflicting operations appear in different orders in different schedules, then it is obvious
that they produce two different databases in the end and hence they are not equivalent.
1. For each transaction Ti, participating in the schedule S, create a node labeled
T1 in the precedence graph.
2. For each case where Tj executes a readtr(x) after Ti executes write_tr(x),
create an edge from Ti to Tj in the precedence graph.
3. For each case where Tj executes write_tr(x) after Ti executes a read_tr(x),
create an edge from Ti to Tj in the graph.
4. For each case where Tj executes a write_tr(x) after Ti executes a write_tr(x),
create an edge from Ti to Tj in the graph.
5. The schedule S is serialisable if and only if there are no cycles in the graph.
If we apply these methods to write the precedence graphs for the four cases of
section 1.8, we get the following precedence graphs.
T1 T2 T1 T2
X
Schedule A Schedule B
27
X
T1 T2
T1 T2
Schedule C Schedule D
Apart from the conflict equivalence of schedules and conflict serializability, another
restrictive equivalence definition has been used with reasonable success in the context of
serializability. This is called view serializability.
Two schedules S and S1 are said to be “view equivalent” if the following conditions are
satisfied.
The concept being view equivalent is that as long as each read operation
of the transaction reads the result of the same write operation in both the
schedules, the write operations of each transaction must produce the same
28
results. Hence, the read operations are said to see the same view of both
the schedules. It can easily be verified when S or S1 operate
independently on a database with the same initial state, they produce the
same end states. A schedule S is said to be view serializable, if it is view
equivalent to a serial schedule.
1.9.3.Uses of serializability:
29
utilization, ability to cater to larger no of concurrent users) with the guarantee of
correctness.
But all is not well yet. The scheduling process is done by the operating system
routines after taking into account various factors like system load, time of
transaction submission, priority of the process with reference to other process and
a large number of other factors. Also since a very large number of possible
interleaving combinations are possible, it is extremely difficult to determine
before hand the manner in which the transactions are interleaved. In other words
getting the various schedules itself is difficult, let alone testing them for
serializability.
Hence, instead of generating the schedules, checking them for serializability and
then using them, most DBMS protocols use a more practical method – impose
restrictions on the transactions themselves. These restrictions, when followed by
every participating transaction, automatically ensure serializability in all
schedules that are created by these participating schedules.
1.10. Summary
The unit began with the discussion of transactions and their role in data base
updatings. The transaction, which is a logical way of describing a basic database
operation, is handy in analyzing various database problems. We noted that basically a
30
transaction does two operations- a readtr(X) and a writetr(X), though other operations are
added later on for various other purposes.
It was noted that in order to maintain system efficiency and also for other
practical reasons, it is essential that concurrent operations are done on the database. This
in turn leads to various problems – like the lost update problem, the temporary update
problem the incorrect summary problem etc.
Further, it was possible for us, using these concepts, to talk about a “schedule” of
a set of transactions and also methods of analyzing the recoverability properties of the
schedules by finding out whether the schedule was “serializable” or not. Different
methods of testing the serializability and also their effect on recoverability or otherwise
of the system were discussed.
31
6. What are the four desirable of transactions-commonly called ACID properties?
7. What is a schedule?
8. What is a serializable schedule?
9. State how a precedence graph helps in deciding serializability?
10.What is roll back?
Answers
32
Unit 2
CONCURRENCY CONTROL TECHNIQUES
Structure:
2.0 Introduction
2.1.Objectives
2.2 Locking techniques for concurrency control
2.3 types of locks and their uses
2.3.1: Binary locks
2.4 Shared/Exclusive locks
2.5 Conversion Locks
2.6 Deadlock and Starvation:
2.6.1 Deadlock prevention protocols
2.6.2 Deadlock detection & timeouts
2.6.3 Starvation
2.7 Concurrency control based on Time Stamp ordering
2.7.1 The Concept of time stamps
2.7.2 An algorithm for ordering the time stamp
2.7.3 The concept of basic time stamp ordering
2.7.4 Strict time Stamp Ordering:
2.8 Multiversion concurrency control techniques
2.8.1 Multiversion Technique based on timestamp ordering
2.8.2 Multiversion two phase locking certify locks
2.9 Summary
2.10 Review Questions & Answers
2.0. Introduction
In this unit, you are introduced to the concept of locks – A lock is just that
– you can lock an item such that only you can access that item. This concept becomes
useful in read and write operations, so that a data that is currently being written into is not
33
accessed by any other transaction until the writing process is complete. The transaction
writing the data simply locks up the item and returns only after it’s operations are
complete – possibly after it has committed itself to the new value.
We discuss about a binary lock – which can either lock or unlock the item.
There is also a system of shared / exclusive lock in which the write locked item can
be shared by other transactions in the read mode only. Then there is also a concept of
two – phase locking to ensure that serializability is maintained by way of locking.
You are also introduced to the concept of time stamps. Each transaction
carries a value indicating when it came in to the system. This can help, in various
operations of concurrency control, recoverability etc.. By ordering the schedules in terms
of their time stamps, it is possible to ensure serializability. We see the various algorithms
that can do this ordering.
2.1 Objectives
When you complete this unit, you will be able to understand,
34
2.2.Locking techniques for concurrency control
Many of the important techniques for concurrency control make use of the
concept of the lock. A lock is a variable associated with a data item that describes
the status of the item with respect to the possible operations that can be done on it.
Normally every data item is associated with a unique lock. They are used as a
method of synchronizing the access of database items by the transactions that are
operating concurrently. Such controls, when implemented properly can overcome
many of the problems of concurrent operations listed earlier. However, the locks
themselves may create a few problems, which we shall be seeing in some detail in
subsequent sections.
2.3.1: Binary locks: A binary lock can have two states or values ( 1 or 0) one of them
indicate that it is locked and the other says it is unlocked. For example if we presume 1
indicates that the lock is on and 0 indicates it is open, then if the lock of item(X) is 1 then
the read_tr(x) cannot access the time as long as the lock’s value continues to be 1. We
can refer to such a state as lock (x).
The concept works like this. The item x can be accessed only when it is free to be
used by the transactions. If, say, it’s current value is being modified, then X cannot be
(infact should not be) accessed, till the modification is complete. The simple mechanism
is to lock access to X as long as the process of modification is on and unlock it for use by
the other transactions only when the modifications are complete.
So we need two operations lockitem(X) which locks the item and unlockitem(X)
which opens the lock. Any transaction that wants to makes use of the data item, first
checks the lock status of X by the lockitem(X). If the item X is already locked, (lock
status=1) the transaction will have to wait. Once the status becomes = 0, the transaction
accesses the item, and locks it (makes it’s status=1). When the transaction has completed
35
using the item, it issues an unlockitem (X) command, which again sets the status to 0, so
that other transactions can access the item.
Notice that the binary lock essentially produces a “mutually exclusive” type of
situation for the data item, so that only one transaction can access it. These operations
can be easily written as an algorithm as follows:
Lockitem(X):
Start: if Lock(X)=0, /* item is unlocked*/
Then Lock(X)=1 /*lock it*/
Else
{
wait(until Lock(X)=0) and
the lock manager wakes up the transaction)
go to start
}
Unlock item(X):
Lock(X)← 0; ( “unlock the item”)
{ If any transactions are waiting,
Wakeup one of the waiting transactions }
The only restrictions on the use of the binary locks is that they should be
implemented as indivisible units (also called “ critical sections” in operating systems
terminology). That means no interleaving operations should be allowed, once a lock or
36
unlock operation is started, until the operation is completed. Otherwise, if a transaction
locks a unit and gets interleaved with many other transactions, the locked unit may
become unavailable for long times to come with catastrophic results.
To make use of the binary lock schemes, every transaction should follow certain
protocols:
1. A transaction T must issue the operation lockitem(X), before issuing a
readtr(X) or writetr(X).
2. A transaction T must issue the operation unlockitem(X) after all readtr(X) and
write_tr(X) operations are complete on X.
3. A transaction T will not issue a lockitem(X) operation if it already holds the
lock on X (i.e. if it had issued the lockitem(X) in the immediate previous
instance)
4. A transaction T will not issue an unlockitem(X) operation unless it holds the
lock on X.
Between the lock(X) and unlock(X) operations, the value of X is held only by
the transaction T and hence no other transaction can operate on X, thus many
of the problems discussed earlier are prevented.
While the operation of the binary lock scheme appears satisfactory, it suffers from
a serious drawback. Once a transaction holds a lock (has issued a lock operation), no
other transaction can access the data item. But in large concurrent systems, this can
become a disadvantage. It is obvious that more than one transaction should not go on
writing into X or while one transaction is writing into it, no other transaction should be
reading it, no harm is done if several transactions are allowed to simultaneously read the
item. This would save the time of all these transactions, without in anyway affecting the
performance.
37
This concept gave rise to the idea of shared/exclusive locks. When only read
operations are being performed, the data item can be shared by several transaction, only
when a transaction wants to write into it that the lock should be exclusive. Hence the
shared/exclusive lock is also sometimes called multiple mode lock. A read lock is a
shared lock (which can be used by several transactions), whereas a writelock is an
exclusive lock. So, we need to think of three operations, a read lock, a writelock and
unlock. The algorithms can be as follows:
Readlock(X):
Writelock(X)
Start: If lock(X) = “unlocked”
Then Lock(X) “unlocked”.
Else { wait until Lock(X) = “unlocked” and
The lock manager wakes up the transaction}
Go to start
End;
38
The writelock operation:
Unlock(X)
If lock(X) = “write locked”
Then { Lock(X) “unlocked”’
Wakeup one of the waiting transaction, if any
}
else if Lock(X) = “read locked”
then { no of reads(X) no of reads –1
if no of reads(X)=0
then { Lock(X) = “unlocked”
wakeup one of the waiting transactions, if any
}
}
The algorithms are fairly straight forward, except that during the unlocking
operation, if a number of read locks are there, then all of them are to be unlocked before
the unit itself becomes unlocked.
To ensure smooth operation of the shared / exclusive locking system, the system
must enforce the following rules:
39
4. A transaction T will not issue a readlock(X) operation if it already holds a
readlock or writelock on X.
5. A transaction T will not issue a writelock(X) operation if it already holds a
readlock or writelock on X.
Before we close the section, it should be noted that use of binary locks does not
by itself guarantee serializability. This is because of the fact that in certain combinations
of situations, a key holding transaction may end up unlocking the unit too early. This can
happen because of a variety of reasons, including a situation wherein a transaction feels it
is no more needing a particular data unit and hence unlocks, it but may be indirectly
writing into it at a later time (through some other unit). This would result in ineffective
locking performance and the serializability is lost. To guarantee such serializability, the
protocol of two phase locking is to be implemented, which we will see in the next
section.
40
2.5 Two phase locking:
readlock(Y)
readtr(Y) Phase I
writelock(X)
-----------------------------------
unlock(Y)
readtr(X) Phase II
X=X+Y
writetr(X)
unlock(X)
The two phase locking, though provides serializability has a disadvantage. Since
the locks are not released immediately after the use of the item is over, but is retained till
all the other needed locks are also acquired, the desired amount of interleaving may not
be derived – worse, while a transaction T may be holding an item X, though it is not
using it, just to satisfy the two phase locking protocol, another transaction T1 may be
genuinely needing the item, but will be unable to get it till T releases it. This is the price
that is to be paid for the guaranteed serializability provided by the two phase locking
system.
41
2.6 Deadlock and Starvation:
T11 T21
readlock(Y)
T11 T21
readtr(Y)
readlock(X) The status graph
readtr(X)
writelock(X)
writelock(Y)
42
get one more item and is not releasing other items held by it. The solution is to develop a
protocol wherein a transaction will first get all the items that it needs & then only locks
them. I.e. if it cannot get any one/more of the items, it does not hold the other items also,
so that these items can be useful to any other transaction that may be needing them.
Their method, though prevents deadlocks, further limits the prospects of concurrency.
A better way to deal with deadlocks is to identify the deadlock when it occurs and
then take some decision. The transaction involved in the deadlock may be blocked or
aborted or the transaction can preempt and abort the other transaction involved. In a
typical case, the concept of transaction time stamp TS(T) is used. Based on when the
transaction was started, (given by the time stamp, larger the value of TS, younger is the
transaction), two methods of deadlock recovery are devised.
It may be noted that in both cases, the younger transaction will get aborted. But
the actual method of aborting is different. Both these methods can be proved to be
deadlock free, because no cycles of waiting as seen earlier are possible with these
arrangements.
There is another class of protocols that do not require any time stamps. They
include the “no waiting algorithm” and the “cautious waiting” algorithms. In the no-
waiting algorithm, if a transaction cannot get a lock, it gets aborted immediately (no-
waiting). It is restarted again at a later time. But since there is no guarantee that the new
43
situation. Is dead lock free, it may have to aborted again. This may lead to a situation
where a transaction may end up getting aborted repeatedly.
To overcome this problem, the cautious waiting algorithm was proposed. Here,
suppose the transaction Ti tries to lock an item X, but cannot get X since X is already
locked by another transaction Tj. Then the solution is as follows: If Tj is not blocked
(not waiting for same other locked item) then Ti is blocked and allowed to wait.
Otherwise Ti is aborted. This method not only reduces repeated aborting, but can also be
proved to be deadlock free, since out of Ti & Tj, only one is blocked, after ensuring that
the other is not blocked.
The second method of dealing with deadlocks is to detect deadlocks as and when
they happen. The basic problem with the earlier suggested protocols is that they assume
that we know what is happenings in the system – which transaction is waiting for which
item and so on. But in a typical case of concurrent operations, the situation is fairly
complex and it may not be possible to predict the behavior of transaction.
In such cases, the easier method is to take on deadlocks as and when they happen
and try to solve them. A simple way to detect a deadlock is to maintain a “wait for
”graph. One node in the graph is created for each executing transaction. Whenever a
transaction Ti is waiting to lock an item X which is currently held by Tj, an edge (Ti→Tj)
is created in their graph. When Tj releases X, this edge is dropped. It is easy to see that
whenever there is a deadlock situation, there will be loops formed in the “wait-for” graph,
so that suitable corrective action can be taken. Again, once a deadlock has been detected,
the transaction to be aborted is to be chosen. This is called the “victim selection” and
generally newer transactions are selected for victimization.
Another easy method of dealing with deadlocks is the use of timeouts. Whenever
a transaction is made to wait for periods longer than a predefined period, the system
assumes that a deadlock has occurred and aborts the transaction. This method is simple
44
& with low overheads, but may end up removing the transaction, even when there is no
deadlock.
2.6.3 Starvation:
The other side effect of locking in starvation, which happens when a transaction
cannot proceed for indefinitely long periods, though the other transactions in the system,
are continuing normally. This may happen if the waiting schemes for locked items is
unfair. I.e. if some transactions may never be able to get the items, since one or the other
of the high priority transactions may continuously be using them. Then the low priority
transaction will be forced to “starve” for want of resources.
2.7.2 An algorithm for ordering the time stamp: The basic concept is to order the
transactions based on their time stamps. A schedule made of such transactions is then
45
serializable. This concept is called the time stamp ordering (To). The algorithm should
ensure that whenever a data item is accessed by conflicting operations in the schedule,
the data is available to them in the serializability order. To achieve this, the algorithm
uses two time stamp values.
1. Read_Ts (X): This indicates the largest time stamp among the transactions that
have successfully read the item X. Note that the largest time stamp actually refers
to the youngest of the transactions in the set (that has read X).
2. Write_Ts(X): This indicates the largest time stamp among all the transactions that
have successfully written the item-X. Note that the largest time stamp actually
refers to the youngest transaction that has written X.
The above two values are often referred to as “read time stamp” and “write time stamp”
of the item X.
2.7.3 The concept of basic time stamp ordering: When ever a transaction tries to read or
write an item X, the algorithm compares the time stamp of T with the read time stamp or
the write stamp of the item X, as the case may be. This is done to ensure that T does not
violate the order of time stamps. The violation can come in the following ways.
1. Transaction T is trying to write X
a) If read TS(X) > Ts(T) or if write Ts (X) > Ts (T) then abort and roll back
T and reject the operation. In plain words, if a transaction younger than T
has already read or written X, the time stamp ordering is violated and
hence T is to be aborted and all the values written by T so far need to be
rolled back, which may also involve cascaded rolling back.
b) If read TS(X) < TS(T) or if write Ts(X) < Ts(T), then execute the write
tr(X) operation and set write TS(X) to TS(T). i.e. allow the operation and
the write time stamp of X to that of T, since T is the latest transaction to
have accessed X.
46
b) If write TS(X) < = TS(T), execute read tr(X) and set read Ts(X) to the
larger of the two values, namely TS(T) and current read_TS(X).
This algorithm ensures proper ordering and also avoids deadlocks by penalizing the older
transaction when it is trying to overhaul the operation done by an younger transaction.
Of course, the aborted transaction will be reintroduced later with a “new” time stamp.
However, in the absence of any other monitoring protocol, the algorithm may create
starvation in the case of some transactions.
This variation of the time stamp ordering algorithm ensures that the schedules are
“strict” (so that recoverability is enhanced) and serializable. In this case, any transaction
T that tries to read or write such that write TS(X) < TS(T) is made to wait until the
transaction T’ that originally wrote into X (hence whose time stamp matches with the
writetime time stamp of X, i.e. TS(T’) = write TS(X)) is committed or aborted. This
algorithm also does not cause any dead lock, since T waits for T’ only if TS(T) > TS(T’).
Whenever a transaction writes a data item, the new value of the item is made
available, as also the older version. Normally the transactions are given access to the
newer version, but in case of conflicts the policy is to allow the “older” transaction to
have access to the “older” version of the item.
47
The obvious drawback of this technique is that more storage is required to
maintain the different versions. But in many cases, this may not be a major drawback,
since most database applications continue to retain the older versions anyway, for the
purposes of recovery or for historical purposes.
Whenever a transaction T writes into X, a new version XK+1 is created, with both write.
TS(XK+1) and read TS(Xk+1) being set to TS(T). Whenever a transaction T reads into X,
the value of read TS(Xi) is set to the larger of the two values namely read TS(Xi) and
TS(T).
To ensure serializability, the following rules are adopted.
i) If T issues a write tr(X) operation and Xi has the highest write TS(Xi) which is less than
or equal to TS(T), and has a read TS(Xi) >TS(T), then abort and roll back T, else create a
new version of X, say Xk with read TS(Xk) = write TS(Xk) = TS(T)
In plain words, if the highest possible write timestamp among all versions is less
than or equal to that of T, and if it has been read by a transaction younger than T, then, we
have no option but to abort T and roll back all it’s effects otherwise a new version of X is
created with it’s read and write timestamps initiated to that of T.
ii) If a transaction T issues a read tr(X) operation, find the version Xi with the highest
write TS(Xi) that is also less than or equal to TS(T) then return the value of Xi to T and
set the value of read TS(Xi) to the value that is larger amongst TS(T) and current read
TS(Xi).
48
This only means, try to find the highest version of Xi that T is eligible to read, and
return it’s value of X to T. Since T has now read the value find out whether it is the
youngest transaction to read X by comparing it’s timestamp with the current read TS
stamp of X. If X is younger (if timestamp is higher), store it as the youngest timestamp
to visit X, else retain the earlier value.
49
readlock it will not be granted, because the action now has shifted to the first row, second
column element. In the modified (multimode) locking system, the concept is extended by
adding one more row and column to the tables.
Read Write Certify
Read Yes Yes No
Write Yes No No
Certify No No No
The multimode locking system works on the following lines. When one of the
transactions has obtained a write lock for a data item, the other transactions may still be
provided with the read locks for the item. To ensure this, two versions of the X are
maintained. X(old) is a version which has been written and committed by a previous
transaction. When a transaction T wants a write lock to be provided to it, a new version
X(new) is created and handed over to T for writing. While T continues to hold the lock
for X(new) other transactions can continue to use X(old) under read lock.
Once T is ready to commit it should get exclusive “certify” locks on all items it
wants to commit by writing. Note that “write lock” is no more an exclusive lock under
our new scheme of things, since while one transaction is holding a write lock on X,
one/more other transactions may be holding the read locks of the same X. To provide
certify lock, the system waits till all other read locks are cleared on the item. Note that
this process has to repeat on all items that T wants to commit.
Once all these items are under the certify lock of the transaction, it can commit to
it’s values. From now on, the X(new) become X(old) and X(new) values will be created
only if another T wants a write lock on X. This scheme avoids cascading rollbacks. But
since a transaction will have to get exclusive certify rights on all items, before it can
commit, a delay in the commit operation is inevitable. This may also leads to
complexities like dead locks and starvation.
50
2.9 Summary
This unit introduced you to two very important concepts of concurrency control –
namely the locking techniques and time stamp technique. In the locking technique, the
data item, currently needed by a transaction, is kept locked until it completes with it’s
use, possibility till the transaction either commits or aborts. This would ensure that the
other transactions do not either access or update the data erroneously, This can be
implemented very easily by introducing a binary bit. 1 indicating it is locked and 0
indicates it is available. Any item that needs a locked item will have to simply wait.
Obviously this introduces time delays. Some delays can be reduced by noting that a write
locked data item can be simultaneously readlocked by other transactions. This concept
leads to the use of shared locks. It was also shown that locking can be used to ensure
serializability. But when different transactions keep different items locked with them, the
situations of dead lock and starvation may crop in. Various methods of identifying the
deadlocks and breaking them (mostly by penalizing one of the participating transactions
were discussed.
We also looked into the concept of timestamps – wherein the transaction bears a
stamp which indicates when it came into the system. This can be used in order to ensure
serializability – by ordering the transactions based on time stamps – we saw several such
algorithms. The time stamps can also be used in association with the system log to
ensure roll back operations continue satisfactorily.
51
7. What is a wait for graph? What is it’s use?
8. What is starvation?
9. What is a timestamp?
10.What is multiversion concurrency control?
Answers
52
Unit 3
DATABASE RECOVERY TECHNIQUES
Structure
3.0 Introduction
3.1 Objectives
3.2 Concept of recovery
3.2.1 The role of the operating system in recovery:
3.3 Write ahead logging
3.4 Role of check points in recovery
3.5 Recovery techniques based on Deferred Update:
3.6 An algorithm for recovery using the deferred update in a single user environment
3.7 Deferred update with Concurrent execution
3.8 Recovery techniques on immediate update
3.8.1 A typical UNDO/REDO algorithm for a immediate update single user
environment
3.8.2 The UNDO/REDO recovery based on immediate update with concurrent
execution:
3.9 Shadow paging
3.10 Backup and Recovery in the case of catastrophic failures
3.11 Some aspects of database security and authorisation
3.12 Summary
3.13 Review Questions & Answers
3.0 Introduction
In this unit, you are introduced to some of the database recovery techniques. You
are introduced to the concept of caching of disk blocks and the mode of operation of
these cached elements to aid the recovery process. The concept of “ in place updating”
53
(wherein this updated on the original disk location) as compared to shadowing (where a
new location is used will be discussed).
The actual recovery process depends on whether the system uses the write ahead
logging or not. Also, the updated data may written back to the disk even before the
system commits (which is called a “steal” approach or may wait till commit operation
takes place (which is a “no steal” approach). Further you are introduced to the concept of
check pointing, which does a lot to improve the efficiency of the roll back operation.
Based on these concepts, we write simple algorithms that do the roll back operation for
single user and multiuser systems.
Finally, we look into the preliminaries of database security and access control.
The types of privileges that the DBA can provide at the discretionary level and also the
concept of level wise security mechanism are discussed.
3.1 Objectives
When you complete this unit, you will be able to understand,
54
2. In situations where the database is not damaged but has lost consistency because
of transaction failures etc, the method is to retrace the steps from the state of the
crash (which has created inconsistency) until the previously encountered state of
consistency is reached. The method normally involves undoing certain operation,
restoring previous values using the log etc.
In general two broad categories of these retracing operations can be identified. As
we have seen previously, most often, the transactions do not update the database
as and when they complete the operation. So, if a transaction fails or the system
crashes before the commit operation, those values need not be retraced. So no
“undo” operation is needed. However, if one is still interested in getting the
results out of the transactions, then a “Redo” operation will have to be taken up.
Hence, this type of retracing is often called the “no-undo /Redo algorithm”. The
whole concept works only when the system is working on a “deferred update”
mode.
However, this may not be the case always. In certain situations, where the system
is working on the “immediate update” mode, the transactions keep updating the
database without bothering about the commit operation. In such cases however,
the updating will be normally onto the disk also. Hence, if a system fails when
the immediate updating are being made, then it becomes necessary to undo the
operations using the disk entries. This will help us to reach the previous
consistent state. From there onwards, the transactions will have to be redone.
Hence, this method of recovery is often termed as the Undo/Redo algorithm.
3.2.1 The role of the operating system in recovery: In most cases, the operating system
functions play a critical role in the process of recovery. Most often the system maintains
a copy of some parts of the DBMS (called pages) in a fast memory called the cache.
Whenever data is to be updated, the system first checks whether the required record is
available in cache. If so, the corresponding record in the cache is updated. Since the
cache size is normally limited, it cannot hold the entire “DBMS, but holds only a few
pages. When a data, located in a page that is not currently with the cache is to be updated,
55
the page is to be brought to cache. To do this, some page of the cache will have to be
written back to the disk to make room for this new page.
When a new page is brought to the cache, each record in it is associated with a bit, called
the “dirty bit”. This indicates whether the bit has been modified or not. Initially its value
is 0 and when and if it is modified by a transaction, the bit is made 1. Note that when the
page is written back to the disk, only those records whose dirty bits are 1 are to be
updated. (This of course implies “inplace Writing”. I.e. the page is sent back to it’s
original space in the disk, where the “not updated data” is still in place. Otherwise, the
entire page needs to be rewritten at a new location on the disk).
In some cases, a “shading” concept is used, wherein the updated page is written else
where in the disk, so that both the previous and updated versions are available on the
disk.
When in place updating is being used, it is necessary to maintain a log for recovery
purposes. Normally before the updated value is written on to the disk, the earlier value
(called Before Image Value (BFIM)) is to noted down elsewhere in the disk for recovery
purposes. This process of recording entries is called the “write – ahead logging” (write
ahead of logging). It is to be noted that the type of logging also depends on the type of
recovery. If no undo / Redo type of recovery is being used, then only those values which
could not be written back before the crash, need to be logged. But in a undo / Redo types,
the values before the image was created as well as those that were computed, but could
not be written back need to be logged.
Two other update mechanisms need brief mention. The cache pages, updated by the
transaction, cannot be written back to the disk, by the DBMS manager, until and unless
the transaction commits. If the system strictly follows this approach, then it is called a
“no steal “ approach. However, in some cases, the protocol allows the writing of the
56
updated buffer back to the disk, even before the transaction commits. This may be done,
for example, when some other transaction is in need of the results. This is called the
“steal” approach.
Secondly, if all pages are updated once the transaction commits, then it is a “force
approach”, otherwise it is called a “no force” approach.
Most protocols make use of steal / no force strategies, so that there is no urgency of
writing back to the buffer once the transaction commits.
However, just the before image (BIM) and After image (AIM) values may not be
sufficient for successful recovery. A number of lists, including the list of active
transaction (those that have started operating, but have not committed yet), committed
transactions as also aborted transactions need to be maintained, to avoid a brute force
method of recovery.
A “Check point”, as the name suggests, indicates that everything is fine up to the
point. In a log, when a check point is encountered, it indicates that all values up to that
have been written back to the DBMS on the disk. Any further crash / system failure will
have to take care of the data appearing beyond this point only. Put the other way, all
transactions that have their commit entries in the log before this point need no rolling
back.
The recovery manager of the DBMS will decide at what intervals, check points need to
be inserted (in turn, at what intervals data is to be written back to the disk). It can be
either after specific periods of time (say M minutes) or specific number of transaction (t
transactions) etc., When the protocol decides to check point it does the following:-
57
a) Suspend all transaction executions temporarily.
b) Force write all memory buffers to the disk.
c) Insert a check point in the log and force write the log to the disk.
d) Resume the execution of transactions.
The force writing need not only refer to the modified data items, but can include the
various lists and other auxiliary information indicated previously.
However, the force writing of all the data pages may take some time and it would be
wasteful to halt all transactions until then. A better way is to make use of the “Fuzzy
check pointing” where in the check point is inserted and while the buffers are being
written back (beginning from the previous check point) the transactions are allowed to
restart. This way the i/o time is saved. Until all data up to the new check point is written
back, the previous check point is held valid for recovery purposes.
However, in practice, most transactions are very long and it is dangerous us to hold all
their updates in the buffer, since the buffers can run out of space and may need a page
replacement. To avoid such situations, where in a page is removed inadvertently, a
simple two pronged protocol is used.
1. A transaction cannot change the DBMS values on the disk until it commits.
2. A transaction does not reach commit stage until all it’s update values are written
on to the log and log itself in force written on to the disk.
58
Notice that in case of failures, recovery is by the No UNDO/REDO techniques, since all
data will be in the log if a transaction fails after committing.
3.6 An algorithm for recovery using the deferred update in a single user
environment.
In a single user entrainment, the algorithm is a straight application of the REDO
procedure.i.e. it uses two lists of transactions: The committed transactions since the last
check point and the currently active transactions when the crash occurs, apply the REDO
to all write tr operations of the committed transactions from the log. And let the active
transactions run again.
The assumption is that the REDO operations are “idem potent”. I.e. the operations
produce the same results irrespective of the number of times they are redone provided,
they start from the same initial state. This is essential to ensure that the recovery
operation does not produce a result that is different from the case where no crash was
there to begin with.
(Through this may look like a trivial constraint, students may verify themselves that not
all DBMS applications satisfy this condition).
Also since there was only one transaction active (because it was a single user system) and
it had not updated the buffer yet, all that remains to be done is to restart this transaction.
To simplify the matters, we presume that we are in talking of strict and serializable
schedules. I.e. there is strict two phase locking and they remain effective till the
59
transactions commit themselves. In such a scenario, an algorithm for recovery could be
as follows:-
Use two lists: The list of committed transactions T since the last check point and the list
of active checkpoints T1 REDO all the write operations of committed transactions in the
order in which they were written into the log. The active transactions are simply
cancelled and resubmitted.
Note that once we put the strict serializability conditions, the recovery process does
not vary too much from the single user system.
Note that in the actual process, a given item x may be updated a number of times,
either by the same transaction or by different transactions at different times. What is
important to the user is it’s final value. However, the above algorithm simply updates the
value whenever it’s value was updated in the log. This can be made more efficient by the
following manner. Instead of starting from the check point and proceeding towards the
time of the crash, traverse the log from the time of the crash backwards. Whenever a
value is updated, for the first time, update it and maintain the information that it’s value
has been updated. Any further updating of the same can be ignored.
This method though guarantees correct recovery has some drawbacks. Since the
items remain locked with the transactions until the transaction commits, the concurrent
execution efficiency comes down. Also lot of buffer space is wasted to hold the values,
till the transactions commit. The number of such values can be large, when the long
transactions are working in concurrent mode, they delay the commit operation of one
another.
60
rule, the update operate is accompanied by writing on to the log(on the disk), using a
write ahead logging protocol.
This helps in undoing the update operations whenever a transaction fails. This
rolling back can be done by using the data on the log. Further, if the transaction is made
to commit only after writing on to the log, there is no need for a redo of these operations
after the transaction has failed, because the values are available in the log. This concept
is called the UNDO/NO-REDO recovery algorithm. On the other hand, if some
transaction commits before writing all it’s values, then a general UNDO/REDO type of
recovery algorithm is necessary.
Here, at the time of failure, the changes envisaged by the transaction may have
already been recorded in the database. These must be undone. A typical procedure for
recovery should follow the following lines:
a) The system maintains two lists: The list of committed transactions since the
last checkpoint and the list of active transactions (only one active transaction,
infact, because it is a single user system).
b) In case of failure, undo all the write_tr operations of the active transaction, by
using the information on the log, using the UNDO procedure.
c) For undoing a write_tr(X) operation, examine the corresponding log entry
writetr(T,X,oldvalue, newvalue) and set the value of X to oldvalue. The
sequence of undoing must be in the reverse order, in which operations were
written on to the log.
d) REDO the writetr operations of the committed transaction from the log in the
order in which they were written in the log, using the REDO procedure.
61
3.8.2 The UNDO/REDO recovery based on immediate update with concurrent
execution:
In the concurrent execution scenario, the process becomes slightly complex. In
the following algorithm, we presume that the log includes checkpoints and the
concurrency protocol uses strict schedules. I.e. the schedule does not allow a transaction
to read or write an item until the transaction that wrote the item previously has
committed. Hence, the danger of transaction failures are minimal. However, deadlocks
can force abort and UNDO operations. The simplistic procedure is as follows:
a) Use two lists maintained by the system: The committed transactions list(since
the last check point) and the list of active transactions.
b) Undo all writetr(X) operations of the active transactions which have not yet
committed, using the UNDO procedure. The undoing operation must be in
the reverse order of writing process in the log.
c) Redo all writetr(X) operations of the committed transactions from the log in
the order in which they were written into the log.
Normally, the process of redoing the writetr(X) operations begins at the end of the
log and proceeds in the reverse order, so that when a X is written into more than once in
the log, only the latest entry is recorded, as discussed in a previous section.
3.9 Shadow paging
It is not always necessary that the original database is updated by overwriting the
previous values. As discussed in an earlier section, we can make multiple versions of the
data items, whenever a new update is made. The concept of shadow paging illustrates
this:
Current Directory Pages Shadow Directory
1 Page 2 1
2 Page 5 2
3 Page 7 3
4 Page 7(new) 4
5 Page5 (New) 5
6 Page 2 (new) 6
7 7
8
62
In a typical case, the database is divided into pages and only those pages that need
updation are brought to the main memory(or cache, as the case may be). A shadow
directory holds pointers to these pages. Whenever an update is done, a new block of the
page is created (indicated by the suffice(new) in the figure) and the updated values are
included there. Note that (i) the new pages are created in the order of updatings and not
in the serial order of the pages. A current directory holds pointers to these new pages.
For all practical purposes, these are the “valid pages” and they are written back to the
database at regular intervals.
Now, if any roll back is to be done, the only operation to be done is to discard the
current directory and treat the shadow directory as the valid directory.
One difficulty is that the new, updated pages are kept at unrelated spaces and
hence the concept of a “continuous ” database is lost. More importantly, what happens
when the “new” pages are discarded as a part of UNDO strategy? These blocks form
”garbage” in the system. (The same thing happens when a transaction commits the new
pages become valid pages, while the old pages become garbage). A mechanism to
systematically identify all these pages and reclaim them becomes essential.
All the methods discussed so far presume one condition – i.e. the system failure is
not catastrophic – i.e. the log and the shadow directory etc.. Stored on the disk are
immune from failure and are available for the UNDO/REDO operation. But what
happens when the disk also crashes?
63
However, even this may become a laborious process. So, often the logs are also
copied and kept as backup. Note that the size of the logs can be much smaller than the
actual size of the database. Hence, between two scheduled database backups, several log
backups can be taken and stored separately.
In case of failures, the backup restores the situation, as it was, when the last
backup was taken. The logs taken since then can be used to reflect the changes done up
to the time last log was backup (not up to the time log was taken). From then on, of
course, the transactions will have to operate again.
64
unauthorized access of the system by outsiders. This comes under the purview of the
security systems.
Another type of security enforced in the “statistical database security” often large
databases are used to provide statistical informations about various aspects like, say
income levels, qualifications, health conditions etc. These are derived by collecting a
large number of individual data. A person who is doing the statistical analysis may be
allowed access to the “statistical data” which is an aggregated data, but he should not be
allowed access to individual data. I.e. he may know, for example, the average income
level of a region, but cannot verify the income level of a particular individual. This
problem is more often encountered in government and quasi-government organizations
and is studied under the concept of “statistical database security”.
It may be noted that in all these cases, the role of the DBA becomes critical. He
normally logs into the system under a DBA account or a superuser account, which
provides full capabilities to manage the Database, ordinarily not available to the other
uses. Under the superuser account, he can manage the following aspects regarding
security.
65
can include in the log entries details regarding the user’s name and account number who
has created/used the transactions which are writing the log details, one can have record of
the accesses and other usage made by the user. This concept becomes useful in followup
actions, including legal examinations, especially in sensitive and high security
installations.
Another concept is the creation of “views”. While the database record may have
large number of fields, a particular user may be authorized to have information only
about certain fields. In such cases, whenever he requests for the data item, a “view” is
created for him of the data item, which includes only those fields which he is authorized
to have access to. He may not even know that there are many other fields in the records.
The concept of views becomes very important when large databases, which cater
to the needs of various types of users are being maintained. Every user can have and
operate upon his view of the database, without being bogged down by the details. It also
makes the security maintenance operations convenient.
3.12 Summary
We started with the concept on need of recovery techniques. W e saw how the
operating uses cache memory and how this concept can be used to recover the databases.
The two concepts of inplace updating and shadowing and how the roll back is to be done
in each case was discussed.
Definitions and details of steal/ nonsteal approach, force/ nonforce approach etc..
were given. We also saw the mechanism of introducing check points, how they help in
the recovery process and the various trade offs. Simple algorithms for the actual recovery
operation were described.
The last section described the need for database security, the various methods of
providing it by access control methods and the role of the DBA were discussed.
66
3.13 Review Questions
Answers
1. The updating is postponed until after the transaction reaches its commit point.
2. It is a fast memory between the main memory and the system.
3. It is a directory entry which tells us whether or not a particular cache buffer is
modified.
4. The buffers write the updatings back to the original location on the disk
5. The protocol allows the writing of an updated buffer on to the disk even before
the commit operation.
6. It is a record to indicate the point upto which the log has been updated and any
roll back need not proceed beyond this point.
7. It is a mechanism wherein updated data is written into separate buffers and a
“Shadow directory” keep track of these buffers.
8. By using the logs stored on removable devices like a tape.
9. The data and users are divided into different levels and their security policy
automatically gets defined.
10. It is an account by getting into which the DBA can change the security parameters
like privileges and security levels.
67
Block Summary
In this block, we learn about transaction and transaction processing systems. The
concept of transaction provides us a mechanism for describing the logical operations of a
database processing. What we are essentially looking at are huge databases which are
used by hundreds of users (many of them concurrently). In such cases, several
complexities may arise, when the same data unit is being accessed by different users at
different parts of time for different purposes (reading or writing). Also not all
transactions (which for the time being can be assumed to be a sequence of operations)
succeed all the time. Hence, the data that is being used by one transaction might have
been updated by another transaction earlier. While the first transaction may go ahead
with the data it has procured from the database, the original transaction which updated
the database may want to rescind back to the earlier value itself. In such a case, what
happens to the computations undertaken based on the updated data value?
Reference Book:
Unit 4
Data Warehousing and Data Mining
68
Structure
4.0 Introduction
4.1 Objectives
4.2 Concepts of Data Warehousing
4.2.1 Data Warehousing Terminology and Definitions
4.2.2 Characteristics of Data Warehouses
4.2.3 Data Modeling for Data Warehouses
4.2.4 How to Build a Data Warehouse
4.2.5 Typical Functionality of Data Warehouses
4.3 Data Mining
4.3.1 The Foundations of Data Mining
4.3.2 An Overview of Data Mining Technology
4.3.3 Profitable applications of Data Mining
4.4 Summary
4.5 Answers to Model Questions
4.0 Introduction
There is no doubt that corporate data has grown both in volume and complexity
during the last 15 years. In the 1980's, it was not uncommon for businesses to work with
data in megabytes and gigabytes. Today, that is the size of one PC hard drive.
Contemporary corporate systems manage data in measures of terabytes and petabytes.
This trend towards increased information storage is clearly not reversing. Also increasing
processing power (advances in hardware) and sophistication of analytical tools and
techniques have resulted in the development of data warehouses. These data warehouses
provide storage, functionality, and responsiveness to queries beyond the capabilities of
69
transaction-oriented databases. Accompanying this ever-increasing power has come a
great demand to improve the data access performance of databases. Traditional databases
balance the requirement of data access with the need to ensure integrity of data.
In modern organizations, users of data are often completely removed from the data
sources. Many people only need read-access to data, but still need a very rapid access to
a larger volume of data than can conveniently be downloaded to the desktop. Often such
data comes from multiple databases. Because many of the analyses performed are
recurrent and predictable, software vendors and systems support staff have begun to
design systems to support these functions.
The market for such support has been growing rapidly since the mid-1990s. As
managers and middle level users become increasingly aware of the growing
sophistication of analytic capabilities of these databased systems, they looked for
increasingly for more sophisticated support for their key organizational decisions, which
are to be taken during their daily activities.
4.1 Objectives
70
• Concepts of Data warehousing
• Characteristics of Data warehousing
• Building a Data warehouse
• Concepts of Data mining
• Foundations of Data mining
• Overview of Data mining technology
71
Data warehousing is characterized as " a subject-oriented, integrated, nonvolatile,
time-variant collection of data in support of management's decisions". Data warehouses
provide access to data for complex analysis, knowledge discovery, and decision making.
72
databases from different data models and sometimes files acquired from independent
systems and platforms.
73
Back Flushing
DATA WAREHOUSE
EIS
META DATA DATA
MINING
74
• Also supports flexible reporting of Data
Because of the reason that the data warehouses encompass large volumes of data,
data warehouses are generally an order of magnitude, sometimes two orders of
magnitude, larger than the source databases. The large volume of data, likely to be in
terabytes or petabytes is an issue that has been dealt with through data marts, enterprise-
wide data warehouses, virtual data warehouses, central data warehouses, and distributed
data warehouses.
• Central Data Warehouses: Central Data Warehouses are what most people
think of when they first are introduced to the concept of data warehouse. The
central data warehouse is a single physical database that contains all of the
75
data for a specific functional area, department, division, or enterprise. Central
Data Warehouses are often selected where there is a common need for
informational data and there are large numbers of end-users already connected
to a central computer or network. A Central Data Warehouse may contain data
for any specific period of time. Usually, Central Data Warehouses contain data
from multiple operational systems. Central Data Warehouses are real. The data
stored in the data warehouse is accessible from one place and must be loaded
and maintained on a regular basis. Normally, data warehouses are built around
advanced RDBMs or some form of multi-dimensional informational database
server.
76
sold for a particular time period. Products sold could be shown as rows, with sales
revenues obtained for each region comprising the columns. Adding a time dimension,
such as an organization's fiscal quarters, would produce a three-dimensional matrix,
which could be represented using a data cube.
In case of three-dimensional data cube that organizes product sales data by fiscal
quarters and sales regions. Each cell could contain data for a specific product sold
specific fiscal quarter, and specific region. By including additional dimensions, a data
hypercube could be produced, although more than three dimensions difficult to be
visualized at all or difficult to present graphically. The data can be queried directly in any
combination of dimensions, bypassing complex database queries. Tools exist for viewing
data according to the user's choice of dimensions. Changing form one-dimensional
hierarchy (orientation) to another is easily accomplished in a data cube by a technique
called pivoting, this technique is also called rotation. In this technique the data cube can
be thought of as rotating to show a different orientation of the axes, the user needs. For
example, you might pivot the data cube to show regional sales revenues as rows the fiscal
quarter revenue totals as columns, and the company's products in the third dimension.
Hence, this technique is equivalent to having a regional sales table for each product
separately, where each table shows quarterly sales for that product region by region.
77
contains some measured or observed variable(s) and identifies them with pointers to
dimension tables. The fact table contains the data and the dimensions identify each tuple
in that data stored. Two common multidimensional schemas, associated with
multidimensional storage model are the star schema and the snowflake schema. The star
schema consists of a fact table with a single table for each dimension. The snowflake
schema is a variation on the star schema in which the dimensional tables from a star
schema are organized into a hierarchy by normalizing them. Some of the installations are
normalizing data warehouses up to the third normal form so that they can access the data
warehouse to the finest level of detail, depending on the information required.
Data warehouse storage also makes use of indexing techniques to support high
performance data access. A technique called bitmap indexing constructs a bit vector for
each value in a domain (column) being indexed. It works very well for domains of low-
cardinality. There is a 1 bit placed in the jth position in the vector of the jth row contains
the value being indexed. For example, imagine an inventory of 50,000 cars with a
bitmap index on car size. If there are two car sizes---economy, and compact , then there
will be two bit vectors, each containing 50,000 bits. Bitmap indexing can provide
considerable input/output and storage space advantages in low-cardinality domains. With
bit victors a bitmap index can provide dramatic improvements in comparison,
aggregation, and join performance. In a star schema, dimensional data can be indexed to
tuples in the fact in the fact table by join indexing. Join indexes are traditional indexes to
maintain relationships between primary key and foreign key values. They relate the
values of a dimension of a star schema to rows in the fact table. For example, consider a
sales fact table that has city and fiscal quarter as dimensions. If there is a join index
maintains the tuple IDs of tuples containing that city. Join indexes may involve multiple
dimensions.
Data warehouse storage can facilitate access to summary data by taking further
advantage of the nonvolatility of data warehouses and a degree of predictability of the
analyses that will be performed using them. Two approaches have been used for this
purpose:
78
(1) Smaller tables which includes summary data such as quarterly sales or revenue
by product line, and
(2) Encoding of level (e.g., daily sales, weekly sales, quarterly, and annual) into
existing tables.
• The extract step is the first step of getting data into the data warehouse
environment. Extracting means reading and understanding the source data,
and copying the parts that are needed. The data must be extracted from
multiple, heterogeneous sources, for example, databases or other data feeds
such as those containing financial market data or environmental data.
79
• Data must be formatted for consistency within the warehouse. Names,
meanings, and domains of data from unrelated sources must be reconciled.
For instance, subsidiary companies of a large corporation may have different
fiscal calendars with quarters ending on different dates, making it difficult to
aggregate financial data by quarter. Various credit cards may report their
transactions differently, making it difficult to compute all credit sales. These
format inconsistencies must be resolved.
• The data must be cleaned to ensure validity. Data cleaning is an involved and
complex process that has been identified as the largest labor-demanding
component of data warehouse construction. For input data, cleaning must
occur before the data are loaded into the warehouse. There is nothing about
cleaning data that is specific to data warehousing and that could not be applied
to a host database. Cleaning the data by correcting misspellings, resolving
domain conflicts, dealing with missing data elements, and parsing into
standard formats. However, since input data must be examined and formatted
consistently, data warehouse builders should take this opportunity to check for
validity and quality. Recognizing erroneous and incomplete data is difficult to
automate, and cleaning that requires automatic error correction can be even
tougher. Some aspects, such as domain checking, ear easily coded into data
cleaning routines, but automatic recognition of other data problems can be
more challenging. After such problems have been taken care of, similar data
from different sources must be coordinated for loading into the warehouse.
As data managers in the organization discover that their data are being cleaned
for input into the warehouse, they will likely want to upgrade their data with
the cleaned data. The process of returning cleaned data to the source is called
backflushing.
• The data must be fitted into the data model of the warehouse. Data from the
various sources must be installed in the data model of the warehouse. Data
80
may have to be converted from relational, object-oriented, or legacy databases
(network and/or hierarchical) to a multidimensional model.
• The data must be loaded or populated into the warehouse. The large volume
of data in the warehouse makes loading the data a significant task. Monitoring
tools for loads as well as methods to recover from incomplete or incorrect
loads are required. With the huge volume of data in the warehouse,
incremental updating is usually the only feasible approach.
Although adequate time can be devoted initially for constructing the warehouse,
the large volume of data in the warehouse generally makes it impossible to simply reload
the warehouse in its entirely later on. Alternatives include selective (partial) refreshing of
data and separate warehouse versions. When the warehouse uses an incremental data
refreshing mechanism, data may need to be periodically purged; for example, a
81
warehouse that maintains data on the previous twenty business quarters may periodically
purge its data each year.
82
business metadata, includes the relevant business rules required and organizational details
involved supporting the warehouse.
For a distributed warehouse, all the issues of distributed databases are relevant, for
example, partitioning, communications, replication, and consistency concerns. A
distributed architecture can provide benefits particularly important to warehouse
performance, such as improved load balancing, scalability of performance, and higher
availability. A single replicated metadata repository would reside at each of the
distribution site.
The plan behind the federated warehouse is like that of the federated database: a
decentralized confederation of autonomous data warehouses, each with its own metadata
repository. Given the magnitude of the challenge inherent to data warehouses, it is likely
that such federations will consist of smaller-scale components, such as data marts. Large
organizations may choose to federate data marts rather than build huge data warehouses.
83
spreadsheet applications (e.g., MS Excel) as well as for OLAP applications programs.
These offer preprogrammed functionality's such as the following:
Because of the reason that the data warehouses are free from the restrictions of the
transactional environment there is an increased efficiency in query processing. Among
the tools and techniques used are query transformation, index intersection and union,
special ROLAP (relational OLAP) and MOLAP (multidimensional OLAP) functions,
SQL extensions, advanced join methods, and intelligent scanning.
Improved performance has also been obtained with parallel processing. Parallel
server architectures include symmetric multiprocessor (SMP), cluster, and massively
parallel processing (MPP), and combinations of these.
84
analysis can be performed by advanced spreadsheets, by sophisticated statistical analysis
software, or by custom-written programs. Techniques such as lagging, moving averages,
and regression analysis are also commonly employed. Artificial intelligence techniques,
which may include genetic algorithms and neural networks, are used for classification
and are employed to discover knowledge from the data warehouse that may be
unexpected or difficult to specify in queries.
Data Warehousing and Database Views. Some people have considered data
warehouses to be an extension of database views. Materialized views as one way of
meeting requirements for improved access to data. Materialized views have been
explored for their performance enhancement. Views, however, provide only a subset of
the functions and capabilities of data warehouses. Views and data warehouses are alike
in that they both read-only extracts from databases and subject-orientation. However,
data warehouses are different from views in the following ways:
85
Check your progress 1
True False
True False
True False
Question 4: Roll-up display moves up the hierarchy, grouping into larger units
along a dimension.
True False
Question 5: For a distributed warehouse, all the issues of distributed databases are
irrelevant.
True False
86
Question 6: Data warehouses can be indexed to optimize performance.
True False
Data mining, the extraction of hidden predictive information from large databases,
is a powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict future trends
and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems. Data mining
tools can answer business questions that traditionally were too time consuming to
resolve. They scour databases for hidden patterns, finding predictive information that
experts may miss because it lies outside their expectations.
Over the last three and a half decades, many organizations have generated a large
amount of machine-readable data in the form of files and databases. These data were
collected due to traditional database operations. To process this data, we have the
database technology available to us that supports query languages like SQL (Structured
Query Language). The problem with SQL is that it is a structured language that assumes
the user is aware of the database schema. The description of the database is called the
database schema, which is specified during the database design and is not expected to
change frequently. SQL supports operations of relational algebra that allow a user to
select from tables (rows and columns of data) or join related information from tables
based on common fields.
Most companies already collect and refine massive quantities of data. Data mining
techniques can be implemented rapidly on existing software and hardware platforms to
enhance the value of existing information resources, and can be integrated with new
products and systems as they are brought on-line. When implemented on high
performance client/server or parallel processing computers, data mining tools can analyze
87
massive databases to deliver answers to questions such as, "Which clients are most likely
to respond to my next promotional mailing, and why?".
In the previous section we looked into the data warehousing technology affords
types of functionality, that of consolidation, aggregation, and summarization of data. It
lets us view the same information along multiple dimensions.
In this section, we will focus our attention on yet another very popular area of
interest known as data mining. As the term connects, data mining refers to the mining or
discovery of new information in terms of patterns or rules from vast amounts of data. To
be practically useful, data mining must be carried out efficiently on large files and
databases.
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies
that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:
88
engines can now be met in a cost-effective manner with parallel multiprocessor computer
technology. Data mining algorithms embody techniques that have existed for at least 15
years, but have only recently been implemented as mature, reliable, understandable tools
that consistently outperform older statistical methods.
In the evolution from business data to business information, each new step has
built upon the previous one. For example, dynamic data access is critical for drill-through
in data navigation applications, and the ability to store large databases is critical to data
mining.
The core components of data mining technology have been under development for
decades, in research areas such as statistics, artificial intelligence, and machine learning.
Today, the maturity of these techniques, coupled with high-performance relational
database engines and broad data integration efforts, make these technologies practical for
current data warehouse environments.
89
Data mining as part of the Knowledge Discovery Process. Knowledge
Discovery in Databases, frequently abbreviated as KDD, typically encompasses more
than data mining. The knowledge discovery process or short KDD process, can roughly
be comprises of six different phases; data selection, data cleaning, enrichment, data
transformation or encoding, data mining, and the reporting and display of the discovered
information. The raw data first undergoes a data selection step, in which we identify the
target dataset and relevant attributes. In the data cleaning phase, we remove the noise,
transform field values to common units, generate new fields through combination of
existing fields, and bring the data into the relational schema that is used as input to the
data mining activity. Data enrichment and transformation or encoding phase is carried out
on such data. In the data mining step, the actual patterns required are extracted. In the last
phase the reports are generated and displayed.
All the above six phases can be easily explained with the help of an example.
Consider a transaction database maintained by a specialty consumer goods retailer.
Suppose the client data includes customer name, zip code, phone number, date of
purchase, item-code, price, quantity, and total amount. A variety of new knowledge can
be discovered by KDD processing on this client database. During data selection, data
about specific items or categories of items, or from stores in a specific region or area of
the country, may be selected. The data cleaning process then may correct invalid zip
codes or eliminate records with incorrect phone prefixes. Enrichment typically enhances
the data with additional sources of information. For example, given the client names and
phone numbers, the store may be able to produce other data about age, income, and credit
rating and append them to each record. Data transformation and encoding may be done
to reduce the amount of data. For instance, item codes may be grouped in terms of
product categories into audio, video, supplies, electronic gadgets, camera, accessories,
and so on. Zip codes may be aggregated into geographic regions, incomes may be
divided into number of ranges, and so on. If data mining is based on an existing
warehouse for this retail store chain, we would expect that the cleaning has already been
90
applied. It is only after such preprocessing that data mining techniques are used to mine
different rules and patterns. For example, the result of mining may be to discover:
We can see that many possibilities exist for discovering new knowledge about
buying patterns, relating factors such as age, income-group, place of residence, to what
and how much the customers purchase. This information can then be utilized to plan
additional store locations based on demographics, to run store promotions, to combine
items in advertisements, or to plan seasonal marketing strategies. As this retail-store
example discussed earlier shows, data mining must be preceded by significant data
preparation before it can yield useful information that can directly influence business
decisions. The business decisions can be implemented.
The results obtained from data mining may be reported in a variety of formats,
such as listings, graphical outputs, summary tables, or information visualizations.
Data Mining goals and Knowledge Discovery. If we look at, the goals of data mining
fall into the following classes: prediction, identification, classification, and optimization.
91
• Prediction--Data mining can show how certain attributes within the data will
behave in the future. Examples of predictive data mining include the analysis
of buying transactions to predict what consumers will buy under certain
discounts, how much sales volume a store would generate in a given period,
and whether deleting a product line would yield more profits, and weather
adding a new product would yield more profits. In such applications, business
logic is used will be coupled with data mining.
92
used to encode the data appropriately before subjecting it to further data
mining.
• Optimization-- One final and foremost goal of data mining may be to optimize
the use of limited resources such as time, space, money, or materials and to
maximize output variables such as sales or profits under a given set of
constraints.
(1) Association rules--These rules correlate the presence of a set of items with
another range of values for another set of variables.
Examples:
a. When a customer buys a pen, he is likely to buy an inkbottle.
b. When a female retail shopper buys a handbag, she is Likely to buy shoes.
(2) Classification hierarchies-- Here the goal is to work from an existing set of
events or transactions to create a hierarchy of classes.
Examples:
(1) A population may be divided into six ranges of credit worthiness based on a
history of previous credit transactions.
(2) A model may be developed for the factors that determine the desirability of
location for a particular store to be opened on a 1-10 scale.
93
(3) Mutual funds may be classified based on performance data using
characteristics such as their growth, income, and stability in the market.
Example:
If a patient underwent cardiac bypass surgery for blocked arteries and an
aneurysm and later developed high blood urea within a year of surgery, he or she
is likely to suffer from kidney failure near future. Detection of sequential patterns
is equivalent to detecting association among events with certain temporal
relationships.
(4) Patterns within time series-- Similarities can be detected within positions of the
time series. These are the examples follow with the stock market price data as a
time series:
(1) Stocks of a utility company X power and a financial company Y Securities show
the same pattern during 1996 in terms of closing stock price.
(2) Two Products show the same selling patterns in summer season but a different
One in winter season.
Examples:
(1) An entire population of treatment data on a particular disease may be divided into
groups based on the similarity of side effects produced.
(2) The web accesses made by a collection of users against a set of documents (say,
in a digital library) may be analyzed in terms of the keywords of documents to
reveal clusters or categories of users.
94
For most of the applications in data mining , the desired knowledge is a
combination of the above types.
• Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
Many of these technologies have been in use for more than a decade in specialized
analysis tools that work with relatively small volumes of data. These capabilities are now
evolving to integrate directly with industry-standard data warehouse and OLAP
platforms.
95
4.3.3 Profitable Applications of Data Mining
Two critical factors for success with data mining are: a large, well-integrated data
warehouse and a well-defined understanding of the business process within which data
mining is to be applied (such as customer prospecting, retention, campaign management,
and so on).
A credit card company can leverage its vast warehouse of customer transaction
data to identify customers most likely to be interested in a new credit product. Using a
small test mailing, the attributes of customers with an affinity for the product can be
identified. Recent projects have indicated more than a 20-fold decrease in costs for
targeted mailing campaigns over conventional approaches.
A pharmaceutical company can analyze its recent sales force activity and their
results to improve targeting of high-value physicians and determine which marketing
activities will have the greatest impact in the next few months. The data needs to include
competitor market activity as well as information about the local health care systems. The
results can be distributed to the sales force via a wide-area network that enables the
representatives to review the recommendations from the perspective of the key attributes
in the decision process. The ongoing, dynamic analysis of the data warehouse allows best
practices from throughout the organization to be applied in specific sales situations.
96
A large consumer package goods company can apply data mining to improve its
sales process to retailers. Data from consumer panels, shipments, and competitor activity
can be applied to understand the reasons for brand and store switching. Through this
analysis, the manufacturer can select promotional strategies that best reach their target
customer segments.
A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze its
own customer experience, this company can build a unique segmentation identifying the
attributes of high-value prospects. Applying this segmentation to a general business
database such as those provided by Dun & Bradstreet can yield a prioritized list of
prospects by region.
Each of these examples has a clear common ground. They leverage the knowledge
about customers implicit in a data warehouse to reduce costs and improve the value of
customer relationships. These organizations can now focus their efforts on the most
important (profitable) customers and prospects, and design targeted marketing strategies
to best reach them.
True False
True False
97
Question 3: KDD means Knowledge Discovery and Data mining.
True False
Question 4: The results obtained from data mining may be reported in a variety of
formats, such as listings, graphic outputs, summary tables, or visualizations.
True False
True False
Question 6: Rule induction means the extraction of useful if-then rules from data
based on statistical significance.
True False
4.4 Summary
98
accomplish. We also discussed the characteristics of data warehouses, building a data
warehouse and also typical functionality of data warehouses.
Data mining may be thought as an activity that draws knowledge from an existing
data warehouse. Data mining, the extraction of hidden predictive information from large
databases, is a powerful new technology with great potential to help companies focus on
the most important information in their data warehouses. Data mining tools predict future
trends and behaviors, allowing businesses to make proactive, knowledge-driven
decisions. The automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally were too time
consuming to resolve. They scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations. The six
phases associated with data mining was also discussed.
Progress 1
1. True
2. False
3. True
4. True
5. False
6. True
Progress 2
1. True
99
2. True
3. False
4. True
5. True
6. True
-------------------
Unit 5
Internet Databases
100
Structure
5.0 Introduction
5.1 Objectives
5.2 The World Wide Web
5.3 Introduction to HTML
5.4 Databases and the World Wide Web
5.5 Architecture
5.6 Application Servers and Sever-Side Java
5.7 Beyond HTML is it XML?
5.8 XML-QL: Querying XML Data
5.9 Search engines
5.9.1 Search Tools and Methods
5.10Summary
5.11Answers to Model Questions
5.0 Introduction
101
protocol, or it can create a new thread for a Java servlet. JavaBeans and Java Server
Pages are Java based technologies that assist in creating and managing programs
designed to be invoked by a Web server.
5.1 Objectives
102
The World Wide Web (WWW, or Web) is a distributed information systems based
on hypertext. Documents stored under the web can be of several types. One of the most
common types of documents are hypertext documents. These hypertext documents are
formatted according to HTML (Hyper Text Markup Language). The HTML is based on
SGML (Standard Generalized Markup Language). HTML documents contain text, font
specifications and other formatting instructions. Links to other documents can be
associated. Images can also be referenced with appropriate image formatting instructions.
Formatted text with images is visually much more appealing than only a plain text. Thus
the user will see formatted text along with images on the web system.
The Web makes it possible to access a file anywhere on the Internet. A file is
identified by a universal resource locator (URL). These are nothing but pointers to
documents. The following is an example of an URL.
http://www.nie.ac.in/topic/dbbook/advanceddbtopics.html
More clearly the first part of the URL indicates how the document is to be
accessed: "http" indicates that the document is to be accessed using the Hyper Text
Transfer Protocol, which is a protocol used to transfer HTML documents. The second
part gives the unique name of a machine on the Internet. The rest of the URL is the path
name of the file on the machine.
103
organizations today maintain a Web site. The World Wide Web, or Web, is the collection
of web sites that are accessible over the Internet.
An HTML link contains a URL, which identifies the site containing the linked
file. When a user clicks on a link, the Web browser connects to the Web server at the
destination Web site using a connection control called HTTP and submits the link's URL.
When the browser receives a file from a Web server, it checks the file type by examining
the extension of the file name. It displays the file according to the file's type and if
necessary calls an application program to handle the file. For example, a file ending in
.txt denotes an unformatted text file, which the Web browser displays by interpreting the
individual ASCII characters in the file. More sophisticated document structures can be
encoded in HTML, which has become a standard way of structuring Web pages for
display. As another example, a file ending in .doc denotes a Microsoft Word document
and the Web browser displays the file by invoking Microsoft Word.
Thus, the URLs provide a globally unique name for each document that can be
accessed from a Web system. Since URLs are in the human readable from, a human or a
user can use them directly to access a desired document, instead of navigating down a
path from a predefined location. Since they allow the use of an Internet machine name,
they are global in scope, and people can use them to create links across machines.
104
in <BODY> … </BODY>-- contains information about three books. Data about each
book is represented as an unordered list (UL) whose entries are marked with the LI tag.
HTML defines the set of valid tags as well as the meaning of the tags. For example,
HTML specifies that the tag <TITLE> is a valid tag that denotes the title of the
document. As another example, the tag <UL> always denotes an unordered list.
<HTML>
<HEAD>Some important books on DBMS</HEAD>
<BODY>
DBMS:
<UL>
<LI>Author: Raghu Ramakrishnan</LI>
<LI>Title:Database management Systems</LI>
<LI> Published in 2000</LI>
<LI> Published by McGraw Hill</LI>
<LI> Softcover</LI>
</UL>
<UL>
<LI>Author: Elmasri Navathe</LI>
<LI>Title: Fundamentals of Database Systems</LI>
<LI> Published in 2000</LI>
<LI> Published by Addison Wesley</LI>
<LI> Softcover</LI>
</UL>
<UL>
<LI>Author: Silberschatz ,Korth and Sudarshan</LI>
<LI>Title:Database System Concepts</LI>
<LI> Published in 2000</LI>
<LI> Published by McGraw Hill</LI>
<LI> Softcover</LI>
</UL>
105
</BODY>
</HTML>
Audio, Video, and even programs, which are written in Java, a highly portable
language can be included in HTML documents. When a user retrieves such a document
using a suitable browser, images in the document are displayed, audio and video clips are
played, and embedded programs are executed at the user's machine; the result is a rich
multimedia presentation. The ease with which HTML documents can be created there are
now visual editors that automatically generate HTML and accessed using Internet
browsers has fueled the explosive growth of the Web.
Many Internet users today have home pages on the Web, such pages often contain
information about user's and world lives. Many companies are using the Web for day to
day transactions. Interfacing the databases to the World Wide Web is very important. The
Web is the cornerstone of electronic commerce, abbreviated as E-commerce. Many
organizations offer products through their web sites, and customers can place orders by
visiting a Web site. For such applications a URL must identify more than just a file,
however rich the contents of the file; a URL must provide an entry point to services
available on the Web site. It is common for a URL to include a form that users can fill in
to describe what they want. If the requested URL identifies a form, the Web server
returns the form to the browser, which displays the form to the user. After the user fills in
the form, the form is returned to the Web server, and the information filled by the user
can be used as parameters to a program executing at the same site as the Web server.
The use of a Web browser to invoke a program at a remote site leads us to the role
of databases on the Web: The invoked program can generate a request to a database
system. This capability allows to us to easily place a database on a computer network,
and make services that rely upon database access available over the Web. This leads to a
106
new and rapidly growing source of concurrent requests to a DBMS, and with thousands
of concurrent users routinely accessing popular Web sites, new levels of scalability and
robustness must be required.
The diversity of information available on the Web, its distributed nature, and the
new uses that it is being put to lead to challenges for DBMSs that go beyond simply
improved performance in traditional functionality. For instance, we require support for
queries that are run periodically or continuously and that access data from several
distributed sources. As an example, a user may want to be notified whenever a new item
meeting some criteria is offered for sale at one of several Web sites which he uses. Given
many such user profiles, how can we efficiently monitor such type of users and notify
them promptly as the items they are interested in become available in the market. As
another instance of a new class of problems is, the emergence of the XML (Extended
Markup Language) standard for describing data leads to challenges in managing and
querying XML data.
Question 1: The Web makes it possible to access a file anywhere on the Internet.
True False
Question 2: An HTML link contains a URL, which identifies the site containing the
linked file.
True False
Question 3 : Audio, Video, and programs can not be included in HTML documents.
True False
107
Question 4: The Web is the cornerstone of electronic commerce, abbreviated as
E-commerce.
True False
Question 5: The use of a Web browser to invoke a program at a remote site leads us
to the role of databases on the Web.
True False
True False
5.5 Architecture
As an example, consider the sample page shown in the following figure. This
Web page contains a form where a user can fill in the name of an author. If the user
presses the 'send it' button, the Perl script 'lookdbms_books.cgi ' mentioned in Figure is
executed as a separate process. The CGI protocol defines how the communication
between the form and the script is performed.
<HTML>
108
<HEAD>
<TITLE>
The DBMS Book store</TITLE>
</HEAD>
<BODY>
<FORM action="lookdbms_books.cgi" method=post>
Type an author name:
<INPUT type="text" name="authorName" size=35 maxlength=50>
<INPUT type="submit" value="Sent it">
<INPUT type="reset" value="Clear form">
</FORM>
</BODY>
</HTML>
Figure below illustrates the processes created when the CGI protocol is invoked.
HTTP
Web Browser Web Server
C++
Process 1
CGI Application
DBMS
JDBC
Process 2
In the previous section, we discussed how the CGI protocol could be used to
dynamically assemble Web pages whose content is computed on demand in the from of
109
dynamic pages. However, since each page request results in the creation of a new
process this solution does not scale well to a large number of simultaneous requests are
made. This performance problem led to the development of specialized programs called
application servers. An application server has prewritten threads or processes and thus
avoids the startup cost of creating a new process for each request made. Application
servers have evolved into flexible middle tier packages that provide many different
functions in addition to eliminating the process-creation overhead:
110
• Session management: Often users would engage in business processes that take
several steps to complete. Users expect the system to maintain continuity during a
session, and several session identifiers such as cookies, URL extensions, and hidden
fields in HTML forms can be used to identify a session. Application servers provide
functionality to detect when a session starts and when it ends and to keep track of the
sessions of individual users. This is called session management.
C++
Application
Application
Server
DBMS 1
JDBC
Pool of
JDBC/ODBC
Servlets
DBMS 2
111
Servlet API and Java Server Pages (JSP) are important among them. Servlets are small
programs that execute on the server side of a web connection. Just as applets dynamically
extend functionality of a Web browser, servlets dynamically extend the functionality of a
Web server.
The Java Servlet API allows Web developers to extend the functionality of a
Web server by writing small Java programs called servlets that interact with the Web
server through a well-defined API. A Servlet consists of mostly business logic and
routines to format relatively small datasets into HTML. Java servlets are executed in
their own threads. Servlets can continue to execute even after the client request that led
to their invocation is completed and can thus maintain persistent information between
requests. The Web server or application server can manage a pool of Servlet threads, as
shown in the above figure, and can therefore avoid the overhead of process creation for
each request. Since servlets are written in Java, they are portable between Web servers
and thus allows platform-independent development of server-side applications.
Java Server Pages (JSP) are one more platform-independent alternative for
generating dynamic content on the server side. These are Java application components
that are downloaded, on demand, to the part of the system that needs them. While servlets
are very flexible and powerful, slight modifications, for example in the appearance of the
output page, require the developer to change the servlet and the recompile the changes.
JSP is designed to separate application logic from the appearance of the Web page, while
at the same time simplifying and increasing the speed of the development process. JSP
separates content from presentation by using special HTML tags inside a Web page to
112
generate dynamic content. The Web server interprets these tags and replaces them with
dynamic content before returning the page to the browser.
Together, JSP technology and servlets provide an attractive alternative to the other
types of dynamic Web scripting/programming that offers platform independence,
enhanced performance, ease of administration, extensibility into the enterprise and most
importantly, ease of use.
XML emerged from the confluence of other two technologies, SGML and HTML.
The Standard Generalized Markup Language (SGML) is a metalanguage that allows the
definition of data and document interchange languages such as HTML. The SGML is
complex and requires sophisticated programs to make use of its full potential. XML was
developed to have much of the power of SGML while remaining relatively simple.
Nonetheless, XML, like SGML, allows the definition of new document markup
languages. Although XML does not prevent a user from designing tags that encode the
113
display of the data in a Web browser, there is a style language for XML called Extensible
Style Language (XSL). XSL is a standard way of describing how an XML document that
adheres to a certain vocabulary of tags should be displayed.
An XML document contains (or is made of) the following building blocks
Elements
Attributes
Entity references
Comments
Document type declarations (DTDs)
Elements, also called tags, are the primary building blocks of an XML document.
The start of the content of an element ELM is marked with <ELM>, which is called the
start tag, and the end of the content end is marked with </ELM>, called the end tag.
Elements must be properly nested. Start tags that appear inside the content of other tags
must have a corresponding end tag.
Entities are shortcuts for portions of common text or the content of external files
and the usage of an entity in the XML document is called an entity reference. Wherever
an entity reference appears in the document, it is textually replaced by its content. Entity
references start with a '&' and end with a ';'.
We can insert comments anywhere in an XML document. Comments start with <!
- and end with -> . Comments can contain arbitrary text except the string --.
114
In XML, we can define our own markup language. A DTD is a set of rules that
allows us to specify our own set of elements, attributes and entities. Thus, a DTD is
basically a grammar that indicates what tags are allowed, in what order they can appear,
and how they can be nested.
Given that data is encoded in a way that reflects structure in XML documents, we
have the opportunity to use a high level language that exploits this structure so
conveniently retrieve data from within such documents. Such a language would bring
XML data management much closer to database management than the text-oriented
paradigm of HTML documents. Such a language would also allow us to easily translate
XML data between different DTDs, as is required for integrating data from multiple data
sources.
The one specific query language for XML called XML-QL that has strong
similarities to several query languages that have been developed in the database
community. Many relational and object-relational database system vendors are currently
looking into support for XML in their database engines. Several vendors of object-
oriented database management systems already offer database engines that can store
XML data whose contents can be accessed through graphical user interfaces, server-side
Java extensions, or by means of XML-QL queries.
115
The World Wide Web, also known as WWW and the Web, comprises a vast
collection of documents stored in computers all over the world. These specialized
computers are linked to form part of a worldwide communication system called the
Internet. When you conduct a search, you direct your computer’s browser to go to Web
sites where documents are stored and retrieve the requested information for display on
your screen. The Internet is the communication system by which the information travels.
A search engine might well be called a search engine service or a search service.
As such, it consists of three components:
• Spider: Program that traverses the Web from link to link, identifying and
reading pages
• Index: Database containing a copy of each Web page gathered by the spider
• Search and retrieval mechanism: Technology that enables users to query the
index and that returns results in a schematic order
A search begins at a selected search tool’s Web site, reached by means of its
address or URL. Each tool’s Web site comprises a store of information called a database.
This database has links to other databases at other Web sites, and the other Web sites have
116
links to still other Web sites, and so on and so on. Thus, each search tool has extended
search capabilities by means of a worldwide system of links.
There are essentially four types of search tools, each of which has its own search
method. The following describe these search tools .
Tips: Choose a subject search when you want general information on a subject or
topic. Often, you can find links in the references provided that will lead to specific
information you want.
Advantage: It is easy to use. Also, information placed in its database is reviewed and
indexed first by skilled persons to ensure its value.
Disadvantage: Because directory reviews and indexing is so time consuming, the
number of reviews are limited. Thus, directory databases are comparatively small and
their updating frequency is relatively low. Also, descriptive information about each site is
limited and general.
2. A search engine tool searches for information through use of keywords and responds
with a list of references or hits. The search method it employs is known as a keyword
search.
Tip: Choose a keyword search to obtain specific information, since its extensive
database is likely to contain the information sought.
117
Advantage: Its information content or database is substantially larger and more current
than that of a directory search tool.
Disadvantage: Not very exacting in the way it indexes and retrieves information in its
database, which makes finding relevant documents more difficult.
3. A directory with search engine uses both the subject and keyword search methods
interactively as described above. In the directory search part, the search follows the
directory path through increasingly more specific subject matter. At each stop along the
path, a search engine option is provided to enable the searcher to convert to a keyword
search. The subject and keyword search is thus said to be coordinated. The further down
the path the keyword search is made, the narrower is the search field and the fewer and
more relevant the hits.
Tip: Use when you are uncertain whether a subject or keyword search will provide the
best results.
Advantages: Ability to narrow the search field to obtain better results.
Disadvantages: This search method may not succeed for difficult searches.
Tip: Use to speed up the search process and to avoid redundant hits.
Advantage: Tolerant of imprecise search questions and provides fewer hits of likely
greater relevance.
Disadvantage: Not as effective as a search engine for difficult searches.
Examples: Dogpile, Mamma, Metacrawler, SavvySearch
118
A Little About Some Search Engines
Just a word about some of the recommended search engines…in the Internet world
About Google
About Yahoo!
• Although it is probably the oldest, best-known and most visited search site,
most do not realize that is it NOT a search engine!
• Yahoo! is primarily a web directory; it is based upon user submissions, not a
true search engine
• it uses Google as its search engine
• covers in excess of 1 million sites organized into 14 main categories
• pioneered the trend among search companies to become one-stop information
“portals”
• according to a PC World rating of search sites, the portal features of Yahoo!
distract you from getting information quickly
119
About AltaVista
• the largest search engine on the web, covers in excess of 150 million web pages
in its database
• displays 10 hits per screen ranked by relevance to your keywords; brief site
description included
About InfoSeek
120
• 7 out of 10 hits listed were on target
• provided the fewest broken links (about 3 out of 100 hits listed)
• provided virtually no duplicates
• has established a relationship with Disney's GO Network to deliver handpicked
content in the form of a
• "Best of the Net” recommendations feature
• also available is “InfoSeek Desktop” which adds a convenient button to the
desktop taskbar to enable you to search the web from whatever application you
are using
The following is the general procedure, For those just starting to learn the search
process.
Now, conduct the searches to become familiar with each of the four types search tools
described above:
121
Question 1: CGI means common gateway interface, which is a protocol.
True False
Question 2: The Web server delivers static HTML or XML pages directly to the
client i.e. to a Web browser.
True False
True False
True False
Question 5: XML emerged from the confluence of two technologies, SGML and
HTML.
True False
True False
5.10Summary
In this unit we discussed about the Internet databases. The World Wide Web
(WWW, or Web) is a distributed information systems based on hypertext. The Web makes
122
it possible to access a file anywhere on the Internet. A file is identified by a universal
resource locator (URL). These are nothing but pointers to documents. HTML is a simple
language used to describe a document. It is also called a markup Language because
HTML works by augmenting regular text with 'marks' that hold special meaning for a
Web browser handling the document.
Many Internet users today have home pages on the Web, such pages often contain
information about user's and world lives. The use of a Web browser to invoke a program
at a remote site leads us to the role of databases on the Web. The execution of business
logic at the Web server's site, or server-side processing, has become a standard model for
implementing more complicated business processes on the Internet. There are many
different technologies for server-side processing .The Java Servlet API and Java Server
Pages (JSP) are important among them.
Progress 2
1. True
2. True
123
3. True
4. False
5. True
6. True
---------------------------
UNIT 6
Emerging Database Technologies
Structure
6.0 Introduction
6.1 Objectives
6.2 SQL3 Object Model
6.3 Mobile Databases
6.3.1 Mobile Computing Architecture
6.3.2 Types of Data used in Mobile Computing Applications
6.4 Main Memory Databases
6.5 Multimedia Databases
6.5.1 Multimedia database Applications
6.6 Geographic information systems
6.7 Temporal and sequence databases
6.8 Information visualization
6.9 Genome Data management
6.9.1 Biological Science and genetics
6.9.2 The Genome Database
6.10 Digital Libraries
6.11 Summary
6.12 Answers to Model questions
124
6.0 Introduction
In the previous unit we discussed about Internet databases. In this unit we will
discuss about the emerging technologies in databases.
Relational database systems support a small, fixed collection of data types, which
has proven adequate for traditional application domains such as administrative data
processing. in many application domains, however, much more complex kinds of data
must be handled. Keeping this option, ANSI and ISO SQL standardization committees
have for some time been adding features to the SQL specification to support object-
oriented data management. The current version of SQL in progress including these
extensions is often referred to as "SQL3".
Mobile databases in one more emerging technology in the database area. Recent
advances in wireless technology have led to mobile computing, a new dimension in data
communication and data processing. Also availability of portable computers and wireless
communications has created a new breed of nomadic database users. The mobile
computing environment will provide database applications with useful aspects of wireless
technology.
The price of main memory is now low enough that we can buy enough main
memory to hold the entire database for many applications. This leads to the concept of
main memory databases.
In an object-relational DBMS, users can define ADTs (abstract data types) with
appropriate methods, which is an improvement over an RDBMS. Nonetheless,
supporting just ADTs falls short of what is required to deal with very large collections of
multimedia objects, including audio, images, free text, text marked up in HTML or
variants, sequence data, and videos. We need database systems that store data such as
image, video and audio data. Multimedia databases are growing in importance.
125
Geographic information systems (GIS) are used to collect, model, store, and
analyze information describing physical properties of the geographical world.
Currently available DBMSs provide little support for queries over ordered
collections of records, or sequences, and over temporal data. Such queries can be easily
expressed and often efficiently executed by systems that support query languages
designed for sequences. Temporal and Sequence databases is another emerging
technology in databases.
Digital libraries are an important and active research area. Conceptually, a digital
library is an analog of a traditional library-at large collection of information sources in
various media--coupled with the advantages of digital technologies.
6.1 Objectives
126
• Mobile Databases
• Main Memory Databases
• Multimedia Databases
• Geographic Information Systems
• Temporal and sequence databases
• Information visualization
• Genome Data management
• Digital Libraries
ANSI and ISO SQL standardization committees have for some time been adding
features to the SQL specification to support object-oriented data management. The
current version of SQL in progress including these extensions is often referred to as
"SQL3". SQL3 object facilities primarily involve extensions to SQL's type facilities;
however, extensions to SQL table facilities can also be considered relevant. Additional
facilities include control structures to make SQL a computationally complete language
for creating, managing, and querying persistent object-like data structures. The added
facilities are intended to be upward compatible with the current SQL92 standard. This
and other sections of the Features Matrix describing SQL3 concentrate primarily on the
SQL3 extensions relevant to object modeling. However, numerous other enhancements
have been made in SQL as well. In addition, it should be noted that SQL3 continues to
undergo development, and thus the description of SQL3 in this Features Matrix does not
necessarily represent the final, approved language specifications.
The parts of SQL3 that provide the primary basis for supporting object-oriented
structures are:
127
• type constructors for row types and reference types
• type constructors for collection types (sets, lists, and multisets)
• user-defined functions and procedures
• support for large objects (BLOBs and CLOBs)
One of the basic ideas behind the object facilities is that, in addition to the normal
built-in types defined by SQL, user-defined types may also be defined. These types may
be used in the same way as built-in types. For example, columns in relational tables may
be defined as taking values of user-defined types, as well as built-in types. A user-defined
abstract data type (ADT) definition encapsulates attributes and operations in a single
entity. In SQL3, an abstract data type (ADT) is defined by specifying a set of declarations
of the stored attributes that represent the value of the ADT, the operations that define the
equality and ordering relationships of the ADT, and the operations that define the
behavior (and any virtual attributes) of the ADT. Operations are implemented by
procedures called routines. ADTs can also be defined as subtypes of other ADTs. A
subtype inherits the structure and behavior of its supertypes (multiple inheritance is
supported). Instances of ADTs can be persistently stored in the database only by storing
them in columns of tables.
128
in one table and used as a direct reference ("pointer") to a specific row in another table,
just as an object identifier in other object models allows one object to directly reference
another object. The same reference type value can be stored in multiple rows, thus
allowing the referenced row to be "shared" by those rows.
Collection types for sets, lists, and multisets have also been defined. Using these
types, columns of tables can contain sets, lists, or multisets, in addition to individual
values.
Tables have also been enhanced with a subtable facility. A table can be declared as
a subtable of one or more supertables (it is then a direct subtable of these supertables),
using an UNDER clause associated with the table definition. When a subtable is defined,
the subtable inherits every column from its supertables, and may also define columns of
its own. The subtable facility is completely independent from the ADT subtype facility.
The BLOB (Binary Large Object) and CLOB (Character Large Object) types have
been defined to support very large objects. Instances of these types are stored directly in
the database (rather than being maintained in external files).
129
characteristics now have several novel properties, which affect basic assumptions in
many components of a DBMS, including the query engine, transaction manager, and
recovery manager. This feature is especially useful to geographically dispersed
organizations. Typical examples might be weather reporting services, taxi dispatchers,
and traffic police, this is also very useful in financial market reporting and information
brokering applications.
• Users are connected through a wireless link whose bandwidth is ten times less than
Ethernet and 100 times less than ATM networks. Communication costs are therefore
significantly higher in proportion to I / O and CPU costs.
• User's locations are constantly changing, and mobile computers have a limited battery
life. Therefore, the true communication costs reflect connection time and battery
usage in addition to bytes transferred, and change constantly depending on location.
Data is frequently replicated to minimize the cost of accessing it from different
locations.
• As a user moves around, data could be accessed from multiple database servers
within a single transaction. The likelihood of losing connections is also much greater
than in a traditional network. Centralized transaction management may therefore be
impractical, especially if some data is resident at the mobile computers.
130
Base stations are equipped with wireless interfaces and can communicate with mobile
units to support data access.
Mobile Units (MU) (or hosts) and base stations communicate through wireless
channels having bandwidths significantly lower than those of a wired network. Mobile
units are battery powered portable computers that move freely in a geographic mobility
domain, an area that is restricted by the limited bandwidth of wireless communication
channels. To manage the mobility of units, the entire geographic mobility domain is
divided into smaller domains called cells. The mobile computing discipline requires that
the movement of mobile units be restricted within the geographic mobility domain, while
having information access contiguity during movement guarantees that the movement of
a mobile unit across cell boundaries will have no effect on the data retrieval process.
Mobile computing platform can be described under client server architecture. That
means we may sometimes refer to a mobile unit as a client or sometimes as a user, and
the base stations as servers. Each cell is managed by a base station, which contains
transmitters and receivers for responding to the information processing needs of clients
located in the cell. Clients and servers communicate through wireless channels.
Applications which run on mobile units (or hosts) have different data
requirements. Users of these type of applications either engage in office activities or
personal communications, or they may simply receive updates on frequently changing
information around the world.
Vertical applications
131
Horizontal applications
In vertical applications users access data within a specific cell, which is already
defined and access is denied to users outside of that cell. For example, users can obtain
information on the location of near by hotels, or doctors or emergency centers within a
cell or parking availability data at an airport cell.
Data to be used in the above applications may be classified into three categories.
1. Private data: A single user owns this type of data and manages it. No other user may
access this data.
2. Public data: This data can be used by anyone who can have permission to read it.
Only one source updates this type of data. Examples include stock prices or weather
bulletins.
3. Shared data: This data is accessed both in read and write modes by groups of users.
Examples are inventory data for products in a company, which can be updated to
maintain the current status of inventory.
132
systems now have several gigabytes of main memory. This shift prompts a
reexamination of some basic DBMS design decisions, since disk accesses no longer
dominate processing time for a memory resident database:
• Main memory does not survive system crashes, and so we still have to implement
logging and recovery to ensure transaction atomicity and durability. Log records
must be written to stable storage at commit time, and this process could become a
bottleneck. To minimize this problem, rather than commit each transaction as it
completes, we can collect completed transactions as it completes, and commit them in
batches; this is called group commit. Recovery algorithms can also be optimized
since pages rarely have to be written out to make room for other pages.
• A new criterion must be considered while optimizing queries, namely the amount of
space required to execute a plan. It is important to minimize the space overhead
because exceeding available physical memory would lead to swapping pages to disk
(through the operating system's virtual memory mechanisms), greatly slowing down
execution.
• Page-oriented data structures become less important (since pages are no longer the
unit of data retrieval), and clustering is not important (since the cost of accessing any
region of main memory is uniform.)
133
possibility due to such open access, and hence error detection is a priority. Similarly,
process failures must be detected and recovered from, so that the system is not brought to
a halt by an errant process. The fact that data is memory resident can be utilized to
significantly improve performance over disk databases, where accesses have to go
through a buffer manager, and where dirty pages may have to be written to disk at any
time to make space for new pages.
In an object-relational DBMS, users can define ADTs (abstract data types) with
appropriate methods, which is an improvement over an RDBMS. Nonetheless,
supporting just ADTs falls short of what is required to deal with very large collections of
multimedia objects, including audio, images, free text, text marked up in HTML or
variants, sequence data, and videos. Industrial applications such as collaborative
development of engineering designs also require multimedia database management, and
are being addressed by several vendors. Currently the following types of multimedia data
are available on the systems.
• Images: Includes photographs, still images, drawings and etc. these are encoded in
standard formats such as bitmap, JPEG, and MPEG.
134
• Structured audio: A sequence of audio components comprising of note, tone,
duration etc.
• Audio: sample data generated from recordings in a string of bits in digitized form.
Analog recordings are typically converted into digital form before storage.
Multimedia data may be stored, delivered, and utilized in many different ways.
Applications may be categorized based on their data management characteristics. The
following are some of the applications and challenges in this area:
135
such objects. Some more examples include repositories of satellite images,
engineering drawings and design, space photographs, and radiology scanned pictures.
All of the above application areas present major challenges for the design of
multimedia database systems.
136
medical records of patients, and other publishing material. All these type of data involves
maintaining multimedia data.
True False
True False
137
Question 3: In vertical applications of mobile database users access data within a
specific cell, and access is denied to users outside of that cell.
True False
Question 4: Modern CPUs also have very large address spaces, due to 64-bit
addressing.
True False
Question 5: Multimedia data includes audio, images, free text, text marked up in
HTML or variants, sequence data, and videos.
True False
True False
Geographic information systems (GIS) are used to collect, model, store, and
analyze information describing physical properties of the geographical world. Geographic
Information Systems (GIS) contain spatial information about villages, cities, states,
countries, streets, roads, highways, lakes, rivers, hills, and other geographical features,
and support applications to combine such spatial information with non-spatial data. It
also contains nonspatial data, such as census counts, economic data, and sales or
marketing information. Spatial data is stored in either raster or vector formats. In
addition, there is often a temporal dimension, as when we measure rainfall at several
locations over time. An important issue with spatial data sets is how to integrate data
138
from multiple sources, since each source may record data using a different coordinate
system to identify locations.
Now let us consider how spatial data in a GIS is analyzed. Spatial information is
most naturally thought of as being overlaid on maps. Typical queries include "What
cities lies between X and Y " and "What is the shortest route from city X to Y " These
kinds of queries can be addressed using the techniques available. An emerging
application is in-vehicle navigation aids. With Global positioning Systems (GPS)
technology, a car's location can be pinpointed, and by accessing a database of local maps,
a driver can receive directions from his or her current location to a desired destination;
this application also involves mobile database access!.
Currently available DBMSs provide little support for queries over ordered
collections of records, or sequences, and over temporal data. Typical sequence queries
include "Find the weekly moving average of the BSE index" and "Find the first five
consecutively increasing temperature readings" (from a trace of temperature
observations). Such queries can be easily expressed and often efficiently executed by
systems that support query languages designed for sequences. Some commercial SQL
systems now support such SQL extensions.
139
The first example given is also a temporal query. However, temporal queries
involve more than just record ordering. For example, consider the following query: "Find
the longest interval in which the same person managed two different departments." If the
period during which a given person managed a department is indicated by two fields from
and to, we have to reason about a collection of intervals, rather than a sequence of
records. Further, temporal queries require the DBMS to be aware of the anomalies
associated with calendars (such as leap years). Temporal extensions are likely to be
incorporated in future versions of the SQL standard.
A distinct and important class of sequence data consists of DNA sequences, which
are being generated at a rapid pace by the biological community. These are in fact closer
to sequences of characters in text than to time sequences as in the above examples. The
field of biological information management and analysis has become very popular in
recent years, and is called bioinformatics. Biological data, such as DNA sequence data, is
characterized by complex structure and numerous relationships among data elements,
many overlapping and incomplete or erroneous data fragments (because experimentally
collected data from several groups, often working on related problems, is stored in the
databases), a need to frequently change the database schema itself as new kinds of
relationships in the data are discovered, and the need to maintain several versions of data
for archival and reference.
As computers become faster and faster and main memory becomes very cheap, it
becomes increasingly feasible to create visual presentations of data, rather than just text-
based reports, which we used to generate earlier. Data visualization makes it easier for
users to understand the information in large complex datasets. The challenge here is to
make it easy for users to develop visual presentation of their data and to interactively
query such presentations. Although a number of data visualization tools are available,
efficient visualization of large datasets presents many challenges.
140
The need for visualization is especially important in the context of decision
support; when confronted with large quantities of high-dimensional data and various kind
of data summaries produced by using analysis tools such as SQL, OLAP, and data mining
algorithms, the information can be overwhelming. Visualization of the data, together
with the generated summaries, can be a powerful way to sift through this information and
spot interesting trends or patterns. The human eye, after all, is very good at finding
patterns. A good framework for data mining must combine analytic tools to process data,
and bring out latent anomalies or trends, with a visualization environment in which a user
can notice these patterns and interactively drill down to the original data for further
analysis.
141
and the seeking out of relationships in that information. The study of genetics can be
divided into three branches:
The first discovery occurred was in 1869 when Friedrich Miescher discovered
nuclein and its primary component, deoxyribonucleic acid (DNA). In subsequent
research DNA and a related compound, ribonucleic acid (RNA), were found to be
composed of nucleotides (a sugar, a phosphate, and a base, which combined to form
nucleic acid) linked into long polymers via the sugar and phosphate.
The second discovery was the demonstration in 1944 by Oswald Avery that DNA
was indeed the molecular substance carrying genetic information. Genes were thus
shown to be composed of chains of nucleic acids arranged linearly on chromosomes and
to serve three primary functions:
142
6.9.2 The Genome Database (GDB)
This Genome database is created in the year 1989. This database is a catalog of
human gene mapping data. There is a process that associates a piece of information with
a particular location on the human genome. The GDB system is built around SYBASE
RDBMs. SYBASE is a commercial relational database management system, and its data
are modeled using standard Entity-Relationship methods. GDB distributes a Database
Access Toolkit, to improve data integrity and to simplify the programming for application
writers.
Digital libraries are an important and active research area. Conceptually, a digital
library is an analog of a traditional library-at large collection of information sources in
various media--coupled with the advantages of digital technologies. However, digital
libraries differ from their traditional counterparts in significant ways: storage is digital,
remote access is quick and easy, and materials are copied from a master version.
Furthermore, keeping extra copies on hand is easy and is not hampered by budget and
storage restrictions, which are major problems in traditional libraries. Thus, digital
technologies overcome many of the physical and economic limitations of traditional
libraries.
The introduction to the April 1995 Communications of the ACM special issue on
digital libraries describes them as the "opportunity . . . to fulfill the age-old dream of
every human being: gaining ready access to humanity's store of information". We defined
a database quite broadly as a "collection of related data." Unlike the related data in a
database, a digital library encompasses a multitude of sources, many unrelated.
Logically, databases can be components of digital libraries
The magnitude of these data collections as well as their diversity and multiple
formats provide challenges for research in this area. The future progression of the
143
development of digital libraries is likely to move from the present technology of retrieval
of information via the Internet, through Net searches of indexed information in
repositories, to a time of information correlation and analysis by intelligent networks.
Various techniques for collecting information, storing it, and organizing it to support
informational requirements learned in the decades of design and implementation of
databases will provide the baseline for development of approaches appropriate for digital
libraries. Search, retrieval, and processing of the many forms of digital information will
make use of the lessons learnt from database operations carried out already on those
forms of information.
True False
Question 2: Currently available DBMSs provide maximum support for queries over
ordered collections of records, or sequences, and over temporal data.
True False
True False
Question 4: Genetic has emerged as an ideal field for the application of information
technology.
True False
144
Question 5: Schemas in biological databases does not change at a rapid pace.
True False
Question 6: In digital libraries storage is digital, remote access is quick and easy.
True False
6.11 Summary
Relational databases have been in use for over two and a half decades. A large
portion of the applications of the relational databases have been in the commercial world,
supporting such tasks as transaction processing for insurance sectors, banks, stock
exchanges, reservations for a variety of business, inventory and payroll for almost all
companies. In this unit we discussed the emerging database technologies, which have
become increasingly important in the recent years. Sql3 data model, mobile databases,
multimedia databases, main memory databases, geographic information systems,
temporal and sequence databases, information visualization, genome data management
and digital libraries are among the new technology trends.
145
Progress 2
1. True
2. False
3. True
4. True
5. False
6. True
-------------------------
146
Block Summary
In this block, we learn about data warehousing and data mining concepts. We also
discussed about Internet databases. Further we also touched various emerging database
technologies.
The World Wide Web (WWW, or Web) is a distributed information system based
on hypertext. The Web makes it possible to access a file anywhere on the Internet. Many
Internet users today have home pages on the Web; such pages often contain information
about user's and world lives. This leads to Internet databases.
There are many emerging database technologies, which have become increasingly
important in the recent years. Sql3 data model, mobile databases, multimedia databases,
main memory databases, geographic information systems, temporal and sequence
147
databases, information visualization, genome data management and digital libraries are
among the new technology trends.
.
Bibliography
148