4. Advanced Databse module
4. Advanced Databse module
Institute of Technology
Department of Information Systems
i|Page
Table of Contents
Chapter One ................................................................................................................................................ 1
Transaction Management and Concurrency Control.............................................................................. 1
1.1. Transaction.................................................................................................................................... 1
1.1.1. Evaluating Transaction Results ............................................................................................. 1
1.1.2. Transaction Properties........................................................................................................... 2
1.1.3. Transaction Management with SQL ..................................................................................... 7
1.1.4. Transaction Log .................................................................................................................... 7
1.2. Concurrency Control ..................................................................................................................... 8
1.3. Problems of Concurrent Sharing ................................................................................................. 19
1.4. Concept of Serializability............................................................................................................ 23
1.4.1. Types of serializability ........................................................................................................ 24
1.5. Database Recovery...................................................................................................................... 26
1.6. Transaction and Recovery ........................................................................................................... 30
1.6.1. Transaction .......................................................................................................................... 30
1.6.2. Recovery ............................................................................................................................. 31
1.6.3. Recovery techniques and facilities ...................................................................................... 31
Chapter two ............................................................................................................................................... 34
Query Processing and Optimization........................................................................................................ 34
2.1. Overview ..................................................................................................................................... 34
2.2. Query Processing Steps ............................................................................................................... 34
2.3. Query Decomposition ................................................................................................................. 36
2.4. Optimization Process .................................................................................................................. 37
2.4.1. Top-K Optimization ............................................................................................................ 37
2.4.2. Join Minimization ............................................................................................................... 38
2.4.3. Multi Query Optimization and Shared Scans...................................................................... 39
2.4.4. Parametric Query Optimization .......................................................................................... 40
2.5. Approaches to Query Optimization ............................................................................................ 40
2.5.1. Exhaustive Search Optimization ......................................................................................... 40
2.5.2. Heuristic Based Optimization ............................................................................................. 41
2.6. Transformation Rules.................................................................................................................. 41
2.7. Implementing relational Operators ............................................................................................. 45
2.7.1. Relational Algebra .............................................................................................................. 45
2.8. Pipelining .................................................................................................................................... 52
i|Page
2.8.1. Pipelining vs. Materialization ............................................................................................. 53
Chapter 3 ................................................................................................................................................... 55
Database Integrity, Security and Recovery ............................................................................................ 55
3.1. Integrity ....................................................................................................................................... 55
3.1.1. Types of Data Integrity ......................................................................................................... 56
3.1.2. Integrity Constraints ............................................................................................................ 57
3.2. Security ....................................................................................................................................... 60
3.3. Database threats .......................................................................................................................... 61
3.3.1. Threats in a Database .......................................................................................................... 62
3.3.2. Measures of Control ............................................................................................................ 63
3.4. Identification and Authentication................................................................................................ 63
3.5. Categories of Control .................................................................................................................. 64
3.6. Data Encryption .......................................................................................................................... 67
3.6.1. Symmetric and Asymmetric Encryption ............................................................................. 69
Chapter Four ............................................................................................................................................. 74
Distributed Database ................................................................................................................................ 74
4.1. Distributed Database overview .................................................................................................. 74
4.2. Components of Distributed DBMS and types............................................................................. 75
4.2.1. Types of DDBS: .................................................................................................................. 75
4.2.2. DDB Components ............................................................................................................... 76
4.2.3. DDBMS Functions.............................................................................................................. 76
4.3. Distributed Database Design ....................................................................................................... 77
4.3.1. Data Fragmentation ............................................................................................................. 77
4.4. Data Replication.......................................................................................................................... 78
4.5. Data Allocation ........................................................................................................................... 79
4.6. Query Processing and Optimization in Distributed Databases ................................................... 80
4.6.1. Distributed Query Processing ............................................................................................. 80
4.6.2. Data Transfer Costs of Distributed Query Processing ........................................................ 81
4.6.3. Distributed Query Processing Using Semijoin.................................................................... 83
4.7. Query and Update Decomposition .............................................................................................. 85
4.8. Distributed Database Transparency Features .............................................................................. 88
4.8.1. Distribution Transparency................................................................................................... 88
4.8.2. Transaction Transparency ................................................................................................... 89
4.9. Performance Transparency and Query Optimization .................................................................. 91
4.9.1. Distributed Concurrency Control ........................................................................................ 93
i | 4.10.
P a g e The Effect of a Premature COMMIT ...................................................................................... 93
4.10.1. Two-Phase Commit Protocol .............................................................................................. 94
.4.10.2 Phases of Two-Phase Commit Protocol .............................................................................. 94
4.11. Distributed Transaction Management and Recovery ............................................................ 95
4.12. Operating System Support for Transaction Management ....................................................... 97
Chapter 5 ................................................................................................................................................. 100
Object Oriented DBMS .......................................................................................................................... 100
5.1. Object Oriented Concepts ......................................................................................................... 100
5.2. Drawbacks of relational DBMS ................................................................................................ 104
5.3. OO Data modeling and E-R diagramming ................................................................................ 108
5.3.1. E-R Model ......................................................................................................................... 108
5.4. Object Oriented Model.............................................................................................................. 109
5.5. Objects and Attributes............................................................................................................... 110
5.6. Characteristics of Object ........................................................................................................... 110
5.6.1. Object Identity................................................................................................................... 112
Chapter 6 ................................................................................................................................................. 114
Data warehousing and Data Mining Techniques ................................................................................. 114
6.1. Data Warehousing ..................................................................................................................... 114
6.1.1. Introduction ....................................................................................................................... 115
6.1.2. Database & data warehouse: Differences.......................................................................... 115
6.1.3. Benefits ............................................................................................................................. 116
6.2. Online Transaction Processing (OLTP) and Data Warehousing............................................... 118
6.3. Data Mining .............................................................................................................................. 119
6.3.1. Introduction ....................................................................................................................... 119
6.4. Data Mining Techniques ........................................................................................................... 122
i|Page
List of figures
Figure 1:States of Transactions ..................................................................................................................... 6
Figure 2:Pre-claiming Lock Protocol .......................................................................................................... 13
Figure 3:Two-phase locking (2PL) ............................................................................................................. 14
Figure 4:Strict Two-phase locking (Strict-2PL).......................................................................................... 16
Figure 5:Precedence Graph for TS Ordering .............................................................................................. 17
Figure 6: Steps in query processing ............................................................................................................ 35
Figure 7: Types of relational operation ....................................................................................................... 46
Figure 8: Types of constraints ..................................................................................................................... 57
Figure 9: Data encryption process............................................................................................................... 68
Figure 10: Communication network ........................................................................................................... 76
Figure 11: Data replication ......................................................................................................................... 79
Figure 12: Accesses data at a single remote site ......................................................................................... 89
Figure 13: Distributed transaction............................................................................................................... 90
Figure 14: Another Distributed Request ..................................................................................................... 91
Figure 15: The Effect of a Premature COMMIT ........................................................................................ 93
Figure 16:E-R Model ................................................................................................................................ 108
Figure 17: Object Oriented data model with object and attributes ............................................................ 112
Figure 18: Current evolution of Decision Support Systems...................................................................... 114
iv | P a g e
List of Tables
v|Page
Chapter One
Transaction Management and Concurrency Control
1.1. Transaction
A transaction is the execution of a sequence of one or more operations (e.g., SQL queries) on a
shared database to perform some higher level function. They are the basic unit of change in a
DBMS.
Example: Move $100 from Abebe’s bank account to his bookie’s account
1Check whether Abebe has $100.
1|Page
1.1.2. Transaction Properties
¤ Atomicity
All operations of a transaction must be completed
If not, the transaction is aborted
It states that all operations of the transaction take place at once if not, the
transaction is aborted.
There is no midway, i.e., the transaction cannot occur partially. Each transaction
is treated as one unit and either run to completion or is not executed at all.
Abort: If a transaction aborts then all the changes made are not visible.
Commit: If a transaction commits then all the changes made are visible.
Example: Let's assume that following transaction T consisting of T1 and T2. A consists of Rs 600
and B consists of Rs 300. Transfer Rs 100 from account A to account B.
T1 T2
Read(A) Read(B)
A:=A-100 Y:=Y+100
Write(A) Write(B)
After completion of the transaction, A consists of Rs 500 and B consists of Rs 400. If the
transaction T fails after the completion of transaction T1 but before completion of transaction T2,
then the amount will be deducted from A but not added to B. This shows the inconsistent database
state. In order to ensure correctness of database state, the transaction must be executed in entirety.
2|Page
¤ Consistency
Permanence of database’s consistent state
The integrity constraints are maintained so that the database is consistent before and after
the transaction.
The execution of a transaction will leave a database in either its prior stable state or a new
stable state.
The consistent property of database states that every transaction sees a consistent database
instance.
The transaction is used to transform the database from one consistent state to another
consistent state.
For example: The total amount must be maintained before or after the transaction.
Therefore, the database is consistent. In the case when T1 is completed but T2 fails, then
inconsistency will occur.
¤ Isolation
Data used during transaction cannot be used by second transaction until the first is
completed
It shows that the data which is used at the time of execution of a transaction cannot be
used by the second transaction until the first one is completed.
In isolation, if the transaction T1 is being executed and using the data item X, then that
data item can't be accessed by any other transaction T2 until the transaction T1 ends.
The concurrency control subsystem of the DBMS enforced the isolation property.
¤ Durability
Ensures that once transactions are committed, they cannot be undone or lost
The durability property is used to indicate the performance of the database's consistent
state. It states that the transaction made the permanent changes.
3|Page
They cannot be lost by the erroneous operation of a faulty transaction or by the system
failure. When a transaction is completed, then the database reaches a state known as the
consistent state. That consistent state cannot be lost, even in the event of a system's
failure.
The recovery subsystem of the DBMS has the responsibility of Durability property
¤ Serializability
Ensures that the schedule for the concurrent execution of several transactions should
yield consistent results.
When multiple transactions are being executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with some
other transaction.
Serial Schedule − It is a schedule in which transactions are aligned in such a way that
one transaction is executed first. When the first transaction completes its cycle, then the
next transaction is executed. Transactions are ordered one after the other. This type of
schedule is called a serial schedule, as transactions are executed in a serial manner.
This execution does no harm if two transactions are mutually independent and working on
different segments of data; but in case these two transactions are working on the same data, then
the results may vary.
This ever-varying result may bring the database to an inconsistent state. To resolve this problem,
we allow parallel execution of a transaction schedule, if its transactions are either serializable or
have some equivalence relation among them.
4|Page
Equivalence Schedules:- An equivalence schedule can be of the following types:
Result Equivalence:
If two schedules produce the same result after execution, they are said to be result equivalent. They
may yield the same result for some value and different results for another set of values. That's why
this equivalence is not generally considered significant.
View Equivalence
Two schedules would be view equivalence if the transactions in both the schedules perform similar
actions in a similar manner.
For example −
If T reads the initial data in S1, then it also reads the initial data in S2.
If T reads the value written by J in S1, then it also reads the value written by J in S2.
If T performs the final write on the data value in S1, then it also performs the final write on the
data value in S2.
Conflict Equivalence
Two schedules having multiple transactions with conflicting operations are said to be conflict
equivalent if and only if −
Note − View equivalent schedules are view serializable and conflict equivalent schedules are
conflict serializable. All conflict serializable schedules are view serializable too.
5|Page
States of Transactions
Active − In this state, the transaction is being executed. This is the initial state of every
transaction.
Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.
Aborted − If any of the checks fails and the transaction has reached a failed state, then the
recovery manager rolls back all its write operations on the database to bring the database
back to its original state where it was prior to the execution of the transaction. Transactions
in this state are called aborted.
6|Page
The database recovery module can select one of the two operations after a transaction aborts −
7|Page
1.2. Concurrency Control
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database. But before knowing about concurrency
control, we should know about concurrent execution.
Concurrent Execution in DBMS
In a multi-user system, multiple users can access and use the same database at one time,
which is known as the concurrent execution of the database. It means that the same database
is executed simultaneously on a multi-user system by different users.
While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case, concurrent
execution of the database is performed.
The thing is that the simultaneous execution that is performed should be done in an
interleaved manner, and no operation should affect the other executing operations, thus
maintaining the consistency of the database. Thus, on making the concurrent execution of
the transaction operations, there occur several challenging problems that need to be solved.
In a database transaction, the two main operations are READ and WRITE operations. So, there is
a need to manage these two operations in the concurrent execution of the transactions as if these
operations are not performed in an interleaved manner, and the data may become inconsistent. So,
the following problems occur with the Concurrent Execution of the operations:
The problem occurs when two different database transactions perform the read/write operations
on the same database items in an interleaved manner (i.e., concurrent execution) that makes the
values of the items incorrect hence making the database inconsistent.
8|Page
For example:
Consider the below diagram where two transactions TX and TY, are performed on the same account
A where the balance of account A is $300.
At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
At time t2, transaction TX deducts $50 from account A that becomes $250 (only deducted
and not updated/write).
Alternately, at time t3, transaction TY reads the value of account A that will be $300 only
because TX didn't update the value yet.
At time t4, transaction TY adds $100 to account A that becomes $400 (only added but not
updated/write).
At time t6, transaction TX writes the value of account A that will be updated as $250 only,
as TY didn't update the value yet.
Similarly, at time t7, transaction TY writes the values of account A, so it will write as done
at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is lost.
9|Page
Hence data becomes incorrect, and database sets to inconsistent.
The dirty read problem occurs when one transaction updates an item of the database, and somehow
the transaction fails, and before the data gets rollback, the updated database item is accessed by
another transaction. There comes the Read-Write Conflict between both transactions.
For example:
Consider two transactions TX and TY in the below diagram performing read/write operations on
account A where the available balance in account A is $300:
10 | P a g e
But the value for account A remains $350 for transaction TY as committed, which is the
dirty read and therefore known as the Dirty Read Problem.
Unrepeatable Read Problem (W-R Conflict)
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different
values are read for the same database item.
For example:
Consider two transactions, TX and TY, performing the read/write operations on account A, having
an available balance = $300. The diagram is shown below:
At time t1, transaction TX reads the value from account A, i.e., $300.
At time t2, transaction TY reads the value from account A, i.e., $300.
At time t3, transaction TY updates the value of account A by adding $100 to the available
balance, and then it becomes $400.
At time t4, transaction TY writes the updated value, i.e., $400.
After that, at time t5, transaction TX reads the available value of account A, and that will
be read as $400.
11 | P a g e
It means that within the same transaction TX, it reads two different values of account A,
i.e., $ 300 initially, and after updating made by transaction TY, it reads $400. It is an
unrepeatable read and is therefore known as the Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that take place in
concurrent execution, management is needed, and that is where the concept of Concurrency
Control comes into role.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the database.
Thus, for maintaining the concurrency of the database, we have the concurrency control protocols.
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an appropriate
lock on it. There are two types of lock:
Shared lock:
It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.
It can be shared between the transactions because when the transaction holds a lock, then
it can't update the data on the data item.
12 | P a g e
Exclusive lock:
In the exclusive lock, the data item can be both reads as well as written by the transaction.
This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
There are four types of lock protocols available:
1. Simplistic lock protocol
It is the simplest way of locking the data while transaction. Simplistic lock-based protocols allow
all the transactions to get the lock on the data before insert or delete or update on it. It will unlock
the data item after completing the transaction.
Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they
need locks.
Before initiating an execution of the transaction, it requests DBMS for all the lock on all
those data items.
If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.
If all the locks are not granted then this protocol allows the transaction to rolls back and
waits until all the locks are granted.
13 | P a g e
Two-phase locking (2PL)
The two-phase locking protocol divides the execution phase of the transaction into three
parts.
In the first part, when the execution of the transaction starts, it seeks permission for the
lock it requires.
In the second part, the transaction acquires all the locks. The third phase is started as soon
as the transaction releases its first lock.
In the third phase, the transaction cannot demand any new locks. It only releases the
acquired locks.
Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released,
but no new locks can be acquired.
In the below example, if lock conversion is allowed then the following phase can happen:
14 | P a g e
Example:
Table 5:Unlocking and locking work with 2-PL
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
Transaction T2:
15 | P a g e
Strict Two-phase locking (Strict-2PL)
The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the
locks, the transaction continues to execute normally.
The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock
after using it.
Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at
a time.
Strict-2PL protocol does not have shrinking phase of lock release.
16 | P a g e
Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered
the system at 007 times and transaction T2 has entered the system at 009 times. T1 has the
higher priority, so it executes first as it is entered the system first.
The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
17 | P a g e
TS protocol ensures freedom from deadlock that means no transaction ever waits.
But the schedule may not be recoverable and may not even be cascade- free.
Validation phase is also known as optimistic concurrency control technique. In the validation
based protocol, the transaction is executed in the following three phases:
1. Read phase: In this phase, the transaction T is read and executed. It is used to read the
value of various data items and stores them in temporary local variables. It can perform all
the write operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the
actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results
are written to the database or system otherwise the transaction is rolled back.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.
This protocol is used to determine the time stamp for the transaction for serialization using
the time stamp of the validation phase, as it is the actual phase which determines if the
transaction will commit or rollback.
Hence TS(T) = validation(T).
The serializability is determined during the validation process. It can't be decided in
advance.
While executing the transaction, it ensures a greater degree of concurrency and also less
number of conflicts.
Thus it contains transactions which have less number of rollbacks.
18 | P a g e
1.3. Problems of Concurrent Sharing
When multiple transactions execute concurrently in an uncontrolled or unrestricted manner, then
it might lead to several problems. These problems are commonly referred to as concurrency
problems in a database environment. The five concurrency problems that can occur in the
database are:
Temporary Update Problem
Incorrect Summary Problem
Lost Update Problem
Unrepeatable Read Problem
Phantom Read Problem
These are explained as following below.
19 | P a g e
In the above example, if transaction 1 fails for some reason then X will revert back to its previous
value. But transaction 2 has already read the incorrect value of X.
Incorrect Summary Problem:
Consider a situation, where one transaction is applying the aggregate function on some records
while another transaction is updating these records. The aggregate function may calculate some
values before the values have been updated and others after they are updated.
Example:
Table 7: Incorrect Summary Problem
In the above example, transaction 2 is calculating the sum of some records while transaction 1
is updating them. Therefore the aggregate function may calculate some values before they have
been updated and others after they have been updated.
20 | P a g e
Lost Update Problem:
In the lost update problem, an update done to a data item by a transaction is lost as it is
overwritten by the update done by another transaction.
Example:
Table 8: Lost Update Problem
In the above example, transaction 1 changes the value of X but it gets overwritten by the update
done by transaction 2 on X. Therefore, the update done by transaction 1 is lost.
21 | P a g e
Unrepeatable Read Problem:
The unrepeatable problem occurs when two or more read operations of the same transaction read
different values of the same variable.
Example:
Table 9: Unrepeatable Read Problem
In the above example, once transaction 2 reads the variable X, a write operation in transaction 1
changes the value of the variable X.
Thus, when another read operation is performed by transaction 2, it reads the new value of X
which was updated by transaction 1.
Phantom Read Problem:
The phantom read problem occurs when a transaction reads a variable once but when it tries to
read that same variable again, an error occurs saying that the variable does not exist.
22 | P a g e
In the above example, once transaction 2 reads the variable X, transaction 1 deletes the variable
X without transaction 2’s knowledge. Thus, when transaction 2 tries to read X, it is not able to
do it.
Example
Let’s take two transactions T1 and T2,
If both transactions are performed without interfering each other then it is called as serial schedule,
it can be represented as follows −
T1 T2
READ1(A)
WRITE1(A)
READ1(B)
C1
READ2(B)
WRITE2(B)
READ2(B)
C2
23 | P a g e
Non serial schedule − When a transaction is overlapped between the transaction T1 and T2.
Example
Consider the following example −
Table 12: Non serial schedule
T1 T2
READ1(A)
WRITE1(A)
READ2(B)
WRITE2(B)
READ1(B)
WRITE1(B)
READ1(B)
24 | P a g e
A schedule will view serializable if it is view equivalent to a serial schedule. If a schedule is
conflict serializable, then it will be view serializable.
The view serializable which does not conflict serializable contains blind writes. View Equivalent
Two schedules S1 and S2 are said to be view equivalent if they satisfy the following conditions:
1. Initial Read
An initial read of both schedules must be the same. Suppose two schedule S1 and S2. In schedule
S1, if a transaction T1 is reading the data item A, then in S2, transaction T1 should also read A.
DBMS View Serializability
Above two schedules are view equivalent because Initial read operation in S1 is done by T1 and
in S2 it is also done by T1.
2. Updated Read
In schedule S1, if Ti is reading A which is updated by Tj then in S2 also, Ti should read A which
is updated by Tj.
OOPs Concepts in Java
Above two schedules are not view equal because, in S1, T3 is reading A updated by T2 and in S2,
T3 is reading A updated by T1.
Final Write
A final write must be the same between both the schedules. In schedule S1, if a transaction T1
updates A at last then in S2, final writes operations should also be done by T1.
25 | P a g e
Above two schedules is view equal because Final write operation in S1 is done by T3 and in S2,
the final write operation is also done by T3.
Schedule S1
Conflict serializability
It orders any conflicting operations in the same way as some serial execution. A pair of operations
is said to conflict if they operate on the same data item and one of them is a write operation.
That means
Readi(x) readj(x) - non conflict read-read operation
Readi(x) writej(x) - conflict read-write operation.
Writei(x) readj(x) - conflict write-read operation.
Writei(x) writej(x) - conflict write-write operation.
¤ Database recovery is the process of restoring the database to a correct (consistent) state in
the event of a failure. In other words, it is the process of restoring the database to the most
recent consistent state that existed shortly before the time of system failure.
¤ The failure may be the result of a system crash due to hardware or software errors, a media
failure such as head crash, or a software error in the application such as a logical error in
the program that is accessing the database.
¤ Recovery restores a database form a given state, usually inconsistent, to a previously
consistent state
26 | P a g e
¤ DBMS is a highly complex system with hundreds of transactions being executed every
second. The durability and robustness of a DBMS depends on its complex architecture and
its underlying hardware and system software. If it fails or crashes amid transactions, it is
expected that the system would follow some sort of algorithm or techniques to recover lost
data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various
categories, as follows
Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point from
where it can’t go any further. This is called transaction failure where only a few
transactions or processes are hurt.
System errors − Where the database system itself terminates an active transaction because
the DBMS is not able to execute it, or it has to stop because of some system condition.
For example, in case of deadlock or resource unavailability, the system aborts an active
transaction.
System Crash
There are problems − external to the system − that may cause the system to stop abruptly and cause
the system to crash. For example, interruptions in power supply may cause the failure of underlying
hardware or software failure. Examples may include operating system errors.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or any
other failure, which destroys all or a part of disk storage.
27 | P a g e
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories:
Volatile storage − As the name suggests, a volatile storage cannot survive system crashes.
Volatile storage devices are placed very close to the CPU; normally they are embedded
onto the chipset itself. For example, main memory and cache memory are examples of
volatile storage. They are fast but can store only a small amount of information.
Non-volatile storage − These memories are made to survive system crashes. They are
huge in data storage capacity, but slower in accessibility. Examples may include hard-
disks, magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled back.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction:-
Maintaining the logs of each transaction, and writing them onto some stable storage before
actually modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.
28 | P a g e
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe. Log-based recovery works as follows −
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
<Tn, commit>
Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and stored
29 | P a g e
permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner:-
The recovery system reads the logs backwards from the end to the last checkpoint.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>,
it puts the transaction in the redo-list.
If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts
the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
30 | P a g e
1.6.2. Recovery
¤ Database systems, like any other computer system, are subject to failures but the data stored
in it must be available as and when required.
¤ When a database fails it must possess the facilities for fast recovery. It must also have
atomicity i.e. either transactions are completed successfully and committed (the effect is
recorded permanently in the database) or the transaction should have no effect on the
database. There are both automatic and non-automatic ways for both, backing up of data and
recovery from any failure situations.
¤ The techniques used to recover the lost data due to system crash, transaction errors, viruses,
catastrophic failure, incorrect commands execution etc. are database recovery techniques. So
to prevent data loss recovery techniques based on deferred update and immediate update or
backing up data can be used.
¤ Recovery techniques are heavily dependent upon the existence of a special file known as
a system log. It contains information about the start and end of each transaction and any
updates which occur in the transaction. The log keeps track of all transaction operations that
affect the values of database items. This information is needed to recover from transaction
failure.
The log is kept on disk start_transaction(T): This log entry records that transaction T
starts the execution.
read_item(T, X): This log entry records that transaction T reads the value of database
item X.
31 | P a g e
write_item(T, X, old_value, new_value): This log entry records that transaction T
changes the value of the database item X from old_value to new_value. The old value
is sometimes known as a before an image of X, and the new value is known as an
afterimage of X.
commit(T): This log entry records that transaction T has completed all accesses to the
database successfully and its effect can be committed (recorded permanently) to the
database.
abort(T): This records that transaction T has been aborted.
checkpoint: Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point before which
the DBMS was in consistent state, and all the transactions were committed.
A transaction T reaches its commit point when all its operations that access the database have
been executed successfully i.e. the transaction has reached the point at which it will
not abort (terminate without completing). Once committed, the transaction is permanently
recorded in the database. Commitment always involves writing a commit entry to the log and
writing the log to disk. At the time of a system crash, item is searched back in the log for all
transactions T that have written a start_transaction(T) entry into the log but have not written a
commit(T) entry yet; these transactions may have to be rolled back to undo their effect on the
database during the recovery process
Undoing – If a transaction crashes, then the recovery manager may undo transactions i.e.
reverse the operations of a transaction. This involves examining a transaction for the log
entry write_item(T, x, old_value, new_value) and setting the value of item x in the
database to old-value.There are two major techniques for recovery from non-catastrophic
transaction failures: deferred updates and immediate updates.
Deferred update – This technique does not physically update the database on disk until
a transaction has reached its commit point. Before reaching commit, all transaction
updates are recorded in the local transaction workspace. If a transaction fails before
reaching its commit point, it will not have changed the database in any way so UNDO is
not needed. It may be necessary to REDO the effect of the operations that are recorded
in the local transaction workspace, because their effect may not yet have been written in
the database. Hence, a deferred update is also known as the No-undo/redo algorithm
32 | P a g e
Immediate update – In the immediate update, the database may be updated by some
operations of a transaction before the transaction reaches its commit point. However,
these operations are recorded in a log on disk before they are applied to the database,
making recovery still possible. If a transaction fails to reach its commit point, the effect
of its operation must be undone i.e. the transaction must be rolled back hence we require
both undo and redo. This technique is known as undo/redo algorithm.
Caching/Buffering – In this one or more disk pages that include data items to be updated
are cached into main memory buffers and then updated in memory before being written
back to disk. A collection of in-memory buffers called the DBMS cache is kept under
control of DBMS for holding these buffers. A directory is used to keep track of which
database items are in the buffer. A dirty bit is associated with each buffer, which is 0 if
the buffer is not modified else 1 if modified.
Shadow paging – It provides atomicity and durability. A directory with n entries is
constructed, where the ith entry points to the ith database page on the link. When a
transaction began executing the current directory is copied into a shadow directory. When
a page is to be modified, a shadow page is allocated in which changes are made and when
it is ready to become durable, all pages that refer to original are updated to refer new
replacement page.
Some of the backup techniques are as follows :
Full database backup – In this full database including data and database, Meta
information needed to restore the whole database, including full-text catalogs are backed
up in a predefined time series.
Differential backup – It stores only the data changes that have occurred since last full
database backup. When same data has changed many times since last full database
backup, a differential backup stores the most recent version of changed data. For this
first, we need to restore a full database backup.
Transaction log backup – In this, all events that have occurred in the database, like a
record of every single statement executed is backed up. It is the backup of transaction log
entries and contains all transaction that had happened to the database. Through this, the
database can be recovered to a specific point in time.
33 | P a g e
Chapter two
Query Processing and Optimization
2.1. Overview
All database systems must be able to respond to requests for information from the user i.e. process
queries. Obtaining the desired information from a database system in a predictable and reliable
fashion is the scientific art of Query Processing. Getting these results back in a timely manner
deals with the technique of Query Optimization. Query Processing is the activity performed in
extracting data from the database.
¤ As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL.
¤ It gets translated into expressions that can be further used at the physical level of the file
system. After this, the actual evaluation of the queries and a variety of query -optimizing
transformations and takes place.
¤ Thus before processing a query, a computer system needs to translate the query into a human-
readable and understandable language. Consequently, SQL or Structured Query Language is
the best suitable choice for humans. But, it is not perfectly suitable for the internal
representation of the query to the system. Relational algebra is well suited for the internal
representation of a query.
34 | P a g e
The translation process in query processing is similar to the parser of a query. When a user executes
any query, for generating the internal form of the query, the parser in the system checks the syntax
of the query, verifies the name of the relation in the database, the tuple, and finally the required
attribute value. The parser creates a tree of the query, known as 'parse-tree.' Further, translate it
into the form of relational algebra. With this, it evenly replaces all the use of the views when used
in the query.
Thus, we can understand the working of a query processing in the below-described diagram:How
to find Nth Highest Salary in SQL
Suppose a user executes a query. As we have learned that there are various methods of extracting
the data from the database. In SQL, a user wants to fetch the records of the employees whose salary
is greater than or equal to 10000. For doing this, the following query is undertaken:
select emp_name from Employee where salary>10000; Thus, to make the system understand
the user query, it needs to be translated in the form of relational algebra. We can bring this query
in the relational algebra form as:
35 | P a g e
o σsalary>10000 (πsalary (Employee))
o πsalary (σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the translated
relational algebra expression with the instructions used for specifying and evaluating each
operation. Thus, after translating the user query, the system executes a query evaluation plan.
36 | P a g e
2.4. Optimization Process
The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to write
their query efficiently.
Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
For optimizing a query, the query optimizer should have an estimated cost analysis of each
operation. It is because the overall operation cost depends on the memory allocations to
several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
¤ There are various topics which lead the query optimization
A database system is used for fetching data from it. But there are some queries that the user has
given that access results sorted on some attributes and require only top K results for some K. Also,
some queries support bound K, or limit K clause, which accesses the top K results. But, some
queries do not support the bound K. For such queries, the optimizer specifies a hint that indicates
the results of the query retrieved should be the top k results only. It does not matter if the query
generates more number of results, including the top k results. In cases, the value of K is small, and
then also, if the query optimization plan produces the entire set of results, sorts them and generates
the top K results. Such a step is not at all effective as well as inefficient as it may likely to discard
most of the computed intermediate results. Therefore, we use several methods to optimize such
top-k queries.
Using pipelined query evaluation plans for producing the results in sorted order.
Estimating the highest value on the sorted attributes that will appear in the top K result and
introducing the selection predicates used for eliminating the larger values.
37 | P a g e
Anyhow, some extra tuples are generated beyond the top-K results. Such tuples are discarded, and
if too few tuples are generated that do not reach the top K results, then we need to execute the
query again, and also, there is a need to change the selection condition.
There are different join operations used for processing the given user query. In cases, when queries
are generated through views, then for computing the query, it is required to join more number of
relations than the actual requirement. To resolve such cases, we need to drop such relations from
a join. Such type of solution or method is known as Join Minimization. We have discussed only
one of such cases. There are also more numbers of similar cases, and we can apply the join
minimization there also.
Optimization of Updates
An update query is used to make changes in the already persisted data. An update query often
involves subqueries in the set as well as where clauses. So, while optimizing the update, both these
subqueries must also get included. For example, if a user wants to update the score as 97 of a
student in a student table whose roll_no is 102. The following update query will be used:
However, if the updates involve a selection on the updated column, we need to handle such
updations carefully. If the update is done during the selection performed by an index scan, then we
need to re-insert an updated tuple in the index ahead of the scan. Also, several problems can arise
in the updation of the subqueries, whose result is affected by the update.
Halloween Problem
The problem was named so because it was first identified on Halloween Day at IBM. The problem
of an update that affects execution of a query associated with the update, known as the Halloween
problem. But, we can avoid this problem by breaking up the execution plan by executing the
following steps:
38 | P a g e
Creating a list of affected tuples
At last, updating the tuples and indices.
Thus, following these steps increases the execution cost of the query evaluation plan.
We can optimize the update plans by checking if the Halloween problem can occur, and if it cannot
occur, perform the update during the processing of the query. It, however, reduces the update
overheads. We can understand this with an example, suppose that Halloween's problem cannot
occur if the index attributes are not affected by the updates. However, if it does and if the updates
also decrease the value, even if the index is scanned in increasing order, in that case, it will not
encounter the updated tuples again during the scan process. But in such cases, it can update the
index even the query is being executed. Thus, it will reduce the overall cost and lead to an
optimized update.
Another method of optimizing such update queries that result or concludes a large number of
updates is by collecting all the updates as a batch. After collecting, apply these updates batch
separately to each affected index. But, before applying an updates batch to an index, it is required
to sort the batch in index order for that index. Thus, such sorting of the batch will reduce the
amount of random I/O, which are needed to update the indices at a great height.
Therefore, we can perform such optimization of updates in most of the database systems.
We can understand the multi-query optimization as when the user submits a queries batch. The
query optimizer exploits common subexpressions between different queries. It does so to evaluate
them once and reuse them whenever required. Thus, for complex queries also, we can exploit the
subexpression, and it reduces the cost of the query evaluation plan, consequently. So, we need to
optimize the subexpressions for different queries. One way of optimization is the elimination of
the common subexpression, known as Common subexpression elimination. The common
subexpression elimination method optimizes the subexpressions by computing and storing the
result. Further, reusing the result whenever the subexpressions occur. Only a few databases
39 | P a g e
perform the exploitation of common subexpressions among the evaluation plans, which are
selected for each of the batches of queries.
In some database systems, another form of multi-query optimization is implemented. Such a form
of implementation is known as Sharing of relation scans between the queries. Understand the
following steps to know the working of the Shared-scan:
It does not read the relation in a repeated manner from the disk.
It reads data only once from the disk for one time for every query which needs to scan a
relation.
Finally, it pipelines to each of the queries.
Such a method of shared-scan optimization is useful when multiple queries perform a scan on a
fact table or a single large relation.
In the parametric query optimization method, query optimization is performed without specifying
its parameter values. The optimizer outputs several optimal plans for different parametric values.
It outputs the plan only if it is optimal for some possible parameter values. After this, the optimizer
stores the output set of alternative plans. Then the cheapest plan is found and selected. Such
selection takes very less time than the re-optimization process. In this way, the optimizer optimizes
the parameters and leads to an optimized and cost-effective output.
In these techniques, for a query, all possible query plans are initially generated and then the best
plan is selected. Though these techniques provide the best solution, it has an exponential time and
40 | P a g e
space complexity owing to the large solution space. For example, dynamic programming
technique.
Heuristic based optimization uses rule-based optimization approaches for query optimization.
These algorithms have polynomial time and space complexity, which is lower than the
exponential complexity of exhaustive search-based algorithms. However, these algorithms do not
necessarily produce the best query plan.
Perform select and project operations before join operations. This is done by moving the
select and project operations down the query tree. This reduces the number of tuples
available for join.
Perform the most restrictive select/project operations at first before the other operations.
Avoid cross-product operation since they result in very large-sized intermediate tables.
The first step of the optimizer says to implement such expressions that are logically equivalent to
the given expression. For implementing such a step, we use the equivalence rule that describes the
method to transform the generated expression into a logically equivalent expression.
Although there are different ways through which we can express a query, with different costs. But
for expressing a query in an efficient manner, we will learn to create alternative as well as
equivalent expressions of the given expression, instead of working with the given expression. Two
relational-algebra expressions are equivalent if both the expressions produce the same set of tuples
on each legal database instance. A legal database instance refers to that database system which
satisfies all the integrity constraints specified in the database schema. However, the sequence of
the generated tuples may vary in both expressions, but they are considered equivalent until they
produce the same tuples set.
41 | P a g e
Equivalence Rules
The equivalence rule says that expressions of two forms are the same or equivalent because both
expressions produce the same outputs on any legal database instance. It means that we can possibly
replace the expression of the first form with that of the second form and replace the expression of
the second form with an expression of the first form. Thus, the optimizer of the query-evaluation
plan uses such an equivalence rule or method for transforming expressions into the logically
equivalent one.
The optimizer uses various equivalence rules on relational-algebra expressions for transforming
the relational expressions. For describing each rule, we will use the following symbols:
Rule 1: Cascade of σ
This rule states the deconstruction of the conjunctive selection operations into a sequence of
individual selections. Such a transformation is known as a cascade of σ.
42 | P a g e
E1 ⋈ θ E 2 = E 2 ⋈ θ E 1 (θ is in subscript with the join symbol)
However, in the case of theta join, the equivalence rule does not work if the order of attributes is
considered. Natural join is a special case of Theta join, and natural join is also commutative.
However, in the case of theta join, the equivalence rule does not work if the order of attributes is
considered. Natural join is a special case of Theta join, and natural join is also commutative.
Rule 3: Cascade of ∏
This rule states that we only need the final operations in the sequence of the projection operations,
and other operations are omitted. Such a transformation is referred to as a cascade of ∏.
Rule 4: We can combine the selections with Cartesian products as well as theta joins
Rule 4: We can combine the selections with Cartesian products as well as theta joins
In the theta associativity, θ2 involves the attributes from E2 and E3 only. There may be chances
of empty conditions, and thereby it concludes that Cartesian Product is also associative.
43 | P a g e
Under two following conditions, the selection operation gets distributed over the theta-join
operation:
a) When all attributes in the selection condition θ0 include only attributes of one of the expressions
which are being joined.
b) When the selection condition θ1 involves the attributes of E1 only, and θ2 includes the attributes
of E2 only.
Under two following conditions, the selection operation gets distributed over the theta-join
operation:
a) Assume that the join condition θ includes only in L1 υ L2 attributes of E1 and E2 Then, we get
the following expression:
b) Assume a join as E1 ⋈ E2. Both expressions E1 and E2 have sets of attributes as L1 and L2.
Assume two attributes L3 and L4 where L3 be attributes of the expression E1, involved in the θ join
condition but not in L1 υ L2 Similarly, an L4 be attributes of the expression E2 involved only in the
θ join condition and not in L1 υ L2 attributes. Thus, we get the following expression:
E1 υ E2 = E2 υ E1
E1 ꓵ E2 = E2 ꓵ E1
44 | P a g e
However, set difference operations are not commutative.
Rule 10: Distribution of selection operation on the intersection, union, and set difference
operations.
The below expression shows the distribution performed over the set difference operation.
We can similarly distribute the selection operation on υ and ꓵ by replacing with -. Further, we get:
Rule 11: Distribution of the projection operation over the union operation.
This rule states that we can distribute the projection operation on the union operation for the given
expressions.
Relational algebra is a procedural query language. It gives a step by step process to obtain the
result of the query. It uses operators to perform queries.
45 | P a g e
Types of Relational operation
1. Select Operation:
1. Notation: σ p(r)
Where:
46 | P a g e
For example: LOAN Relation
Input:
1. σ BRANCH_NAME="perryride" (LOAN)
Output:
2. Project Operation:
This operation shows the list of those attributes that we wish to appear in the result. Rest
of the attributes are eliminated from the table.
47 | P a g e
It is denoted by ∏.
Where
Input
Output:
48 | P a g e
NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation:
Suppose there are two tuples R and S. The union operation contains all the tuples that are
either in R or S or both in R & S.
It eliminates the duplicate tuples. It is denoted by 𝖴.
1. Notation: R 𝖴 S
4. Set Intersection:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that
are in both R & S.
It is denoted by intersection ∩.
1. Notation: R ∩ S
Input:
49 | P a g e
1. ∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)
Output: CUSTOMER_NAME
Smith
Jones
5. Set Difference:
Suppose there are two tuples R and S. The set intersection operation contains all
tuples that are in R but not in S.
It is denoted by intersection minus (-).
1. Notation: R - S
Input:
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product
50 | P a g e
The Cartesian product is used to combine each row in one table with each row in the other
table. It is also known as a cross product.
It is denoted by X.
Notation: E X D
Example:
EMPLOYEE
1 Smith A
2 Harry C
3 John B
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input:
51 | P a g e
EMPLOYEE X DEPARTMENT
Output:
1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal
7. Rename Operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
1. ρ(STUDENT1, STUDENT)
2.8. Pipelining
52 | P a g e
Pipelining helps in improving the efficiency of the query-evaluation by decreasing the production
of a number of temporary files. Actually, we reduce the construction of the temporary files by
merging the multiple operations into a pipeline. The result of one currently executed operation
passes to the next operation for its execution, and the chain continues till all operations are
completed, and we get the final output of the expression. Such type of evaluation process is known
as Pipelined Evaluation.
Advantages of Pipeline
It reduces the cost of query evaluation by eliminating the cost of reading and writing the
temporary relations, unlike the materialization process.
If we combine the root operator of a query evaluation plan in a pipeline with its inputs, the
process of generating query results becomes quick. As a result, it is beneficial for the users
as they can view the results of their asked queries as soon as the outputs get generated.
Else, the users need to wait for high-time to get and view any query results.
Although both methods are used for evaluating multiple operations of expression, there are few
differences between them. The difference points are described in the below table
Pipelining Materialization
It does not use any temporary relations for It uses temporary relations for storing the results of
storing the results of the evaluated operations. the evaluated operations. So, it needs more
temporary files and I/O.
It is a more efficient way of query evaluation It is less efficient as it takes time to generate the
as it quickly generates the results. query results.
53 | P a g e
It requires memory buffers at a high rate for It does not have any higher requirements for
generating outputs. Insufficient memory memory buffers for query evaluation.
buffers will cause thrashing.
It optimizes the cost of query evaluation. As it The overall cost includes the cost of operations
does not include the cost of reading and writing plus the cost of reading and writing results on the
the temporary storages. temporary storage.
54 | P a g e
Chapter 3
Database Integrity, Security and Recovery
3.1. Integrity
The term data integrity refers to the accuracy and consistency of data. When creating databases,
attention needs to be given to data integrity and how to maintain it. A good database will enforce
data integrity whenever possible.
For example, a user could accidentally try to enter a phone number into a date field. If the system
enforces data integrity, it will prevent the user from making these mistakes.
Maintaining data integrity means making sure the data remains intact and unchanged throughout
its entire life cycle. This includes the capture of the data, storage, updates, transfers, backups, etc.
Every time data is processed there’s a risk that it could get corrupted (whether accidentally or
maliciously).
55 | P a g e
A fires sweeps through the building, burning the database computer to a cinder.
The regular backups of the database has been failing for the past two months…
It’s not hard to think of many more scenarios where data integrity is at risk. Many of these risks
can be addressed from within the database itself (through the use of data types and constraints
against each column for example, encryption, etc), while others can be addressed through other
features of the DBMS (such as regular backups – and testing that the backups do actually restore
the database as expected).
Some of these require other (non-database related) factors to be present, such as an offsite backup
location, a properly functioning IT network, proper training, security policies, etc.
In the database world, data integrity is often placed into the following types:
Entity integrity
Referential integrity
Domain integrity
User-defined integrity
Entity Integrity
Entity integrity defines each row to be unique within its table. No two rows can be the same.
To achieve this, a primary key can be defined. The primary key field contains a unique identifier
– no two rows can contain the same unique identifier.
Referential Integrity
Referential integrity is concerned with relationships. When two or more tables have a
relationship, we have to ensure that the foreign key value matches the primary key value at all
times. We don’t want to have a situation where a foreign key value has no matching primary key
value in the primary table. This would result in an orphaned record.
56 | P a g e
Adding records to a related table if there is no associated record in the primary table.
Changing values in a primary table that result in orphaned records in a related table.
Deleting records from a primary table if there are matching related records.
Domain Integrity
Domain integrity concerns the validity of entries for a given column. Selecting the appropriate
data type for a column is the first step in maintaining domain integrity. Other steps could include,
setting up appropriate constraints and rules to define the data format and/or restricting the range
of possible values.
User-Defined Integrity
User-defined integrity allows the user to apply business rules to the database that aren’t covered
by any of the other three data integrity types.
Types of Constraints
57 | P a g e
1. Domain constraints
Domain constraints can be defined as the definition of a valid set of values for an attribute.
The data type of domain includes string, character, integer, time, date, currency, etc. The
value of the attribute must be available in the corresponding domain.
Example:
The entity integrity constraint states that primary key value can't be null.
This is because the primary key value is used to identify individual rows in relation and if
the primary key has a null value, then we can't identify those rows.
A table can contain a null value other than the primary key field.
Example:
58 | P a g e
3. Referential Integrity Constraints
Example:
4. Key constraints
Keys are the entity set that is used to identify an entity within its entity set uniquely.
An entity set can have multiple keys, but out of which one key will be the primary key. A
primary key can contain a unique and null value in the relational table.
Example:
59 | P a g e
3.2. Security
Database security refers to the range of tools, controls, and measures designed to establish and
preserve database confidentiality, integrity, and availability. This article will focus primarily on
confidentiality since it’s the element that’s compromised in most data breaches.
Database security is a complex and challenging endeavor that involves all aspects of information
security technologies and practices. It’s also naturally at odds with database usability. The more
accessible and usable the database, the more vulnerable it is to security threats; the more
invulnerable the database is to threats, the more difficult it is to access and use.
Why is it important?
By definition, a data breach is a failure to maintain the confidentiality of data in a database. How
much harm a data breach inflicts on your enterprise depends on a number of consequences or
factors:
60 | P a g e
Business continuity (or lack thereof): Some business cannot continue to operate until
a breach is resolved.
Fines or penalties for non-compliance: The financial impact for failing to comply with
global regulations such as the Sarbannes-Oxley Act (SAO) or Payment Card Industry
Data Security Standard (PCI DSS), industry-specific data privacy regulations such as
HIPAA, or regional data privacy regulations, such as Europe’s General Data Protection
Regulation (GDPR) can be devastating, with fines in the worst cases exceeding several
million dollars per violation.
Database security begins with physical security for the systems that host the database
management system (DBMS). Database Management system is not safe from intrusion,
corruption, or destruction by people who have physical access to the computers. Once physical
security has been established, database must be protected from unauthorized access by authorized
users as well as unauthorized users. There are three main objects when designing a secure
61 | P a g e
database system, and anything prevents from a database management system to achieve these
goals would be consider a threat to database security.
Integrity loss − Integrity loss occurs when unacceptable operations are performed upon
the database either accidentally or maliciously. This may happen while creating, inserting,
updating or deleting data. It results in corrupted data leading to incorrect decisions.
Secrecy: Data should not be disclosed to unauthorized users. For example, a student
should not be allowed to see and change other student grades.
Denial of service attack: This attack makes a database server greatly slower or even not
available to user at all. DoS attack does not result in the disclosure or loss of the database
information; it can cost the victims much time and money.
Sniff attack: To accommodate the e-commerce and advantage of distributed systems,
database is designed in a client-server mode. Attackers can use sniffer software to monitor
data streams, and acquire some confidential information. For example, the credit card
number of a customer.
Spoofing attack: Attackers forge a legal web application to access the database, and then
retrieve data from the database and use it for bad transactions. The most common spoofing
attacks are TCP used to get the IP addresses and DNS spoofing used to get the mapping
between IP address and DNS name.
Trojan Horse: It is a malicious program that embeds into the system. It can modify the
database and reside in operating system.
62 | P a g e
3.3.2. Measures of Control
The measures of control can be broadly divided into the following categories −
Access Control − Access control includes security mechanisms in a database management
system to protect against unauthorized access. A user can gain access to the database after
clearing the login process through only valid user accounts. Each user account is password
protected.
Flow Control − Distributed systems encompass a lot of data flow from one site to another
and also within a site. Flow control prevents data from being transferred in such a way that
it can be accessed by unauthorized agents. A flow policy lists out the channels through
which information can flow. It also defines security classes for data as well as transactions.
Data Encryption − Data encryption refers to coding data when sensitive data is to be
communicated over public channels. Even if an unauthorized agent gains access of the
data, he cannot understand it since it is in an incomprehensible format.
RAID: Redundant Array of Independent Disks which protect against data loss due to disk
failure.
Authentication: Access to the database is a matter of authentication. It provides the
guidelines how the database is accessed. Every access should be monitored.
Backup: At every instant, backup should be done. In case of any disaster, Organizations
can retrieve their data.
Anonymous access
Basic password authentication
63 | P a g e
The settings in the database ACLs work together with the "Maximum Internet name & password"
setting for each database to control the level of access that web browser users have to a database
on the Sometime server.
Generally, administrators should not need to change the "Maximum Internet name & password"
settings for databases on the Sometime server. The default settings should function adequately in
most cases
Authentication
Access rights
Integrity constraints
64 | P a g e
¤ Authentication
In a distributed database system, authentication is the process through which only legitimate users
can gain access to the data resources.
Controlling Access to Client Computer − At this level, user access is restricted while
login to the client computer that provides user-interface to the database server. The most
common method is a username/password combination. However, more sophisticated
methods like biometric authentication may be used for high security data.
¤ Access Rights
A user’s access rights refers to the privileges that the user is given regarding DBMS operations
such as the rights to create a table, drop a table, add/delete/update tuples in a table or query upon
the table.
In distributed environments, since there are large number of tables and yet larger number of users,
it is not feasible to assign individual access rights to users. So, DDBMS defines certain roles. A
role is a construct with certain privileges within a database system. Once the different roles are
defined, the individual users are assigned one of these roles. Often a hierarchy of roles are defined
according to the organization’s hierarchy of authority and responsibility.
For example, the following SQL statements create a role "Accountant" and then assigns this role
to user "ABC".
65 | P a g e
GRANT ACCOUNTANT TO ABC;
COMMIT;
For example, let us consider that a table "HOSTEL" has three fields - the hostel number, hostel
name and capacity. The hostel number should start with capital letter "H" and cannot be NULL,
and the capacity should not be more than 150. The following SQL command can be used for data
definition −
66 | P a g e
identify a tuple. Entity integrity constraint states that no two tuples in a table can have identical
values for primary keys and that no field which is a part of the primary key can have NULL value.
For example, in the above hostel table, the hostel number can be assigned as the primary key
through the following SQL statement (ignoring the checks) −
For example, let us consider a student table where a student may opt to live in a hostel. To include
this, the primary key of hostel table should be included as a foreign key in the student table. The
following SQL statement incorporates this −
Encryption helps us to secure data that we send, receive, and store. It can consist text messages
saved on our cell-phone, logs stored on our fitness watch, and details of banking sent by your
online account.
67 | P a g e
It is the way that can climb readable words so that the individual who has the secret access code,
or decryption key can easily read it. For diplomatic information to help in providing data security.
A large volume of personal information is handled electronically and maintained in the cloud or
on servers connected to the web on an ongoing basis. Without our distinctive data bending up in
the networked systematic system of a company, it's almost not possible to go on with the business
of any, which is why it is crucial to know how to help in keeping the information private.
It is the procedure of taking ordinary text, such as a text or email, and climbing it into an unreadable
type of format known as "cipher text." It helps to protect the digital information either saved on or
spread through a network such as the internet on computer systems.in JDK, JRE, and JVM
The cipher text is converted back to the real form when the calculated recipient accesses the
message which is known as decryption. "Secret" encryption key, a lining up of algorithms that
climbed and unscramble info. back to a readable type, must be worked by both the sender and the
receiver to get the code.
68 | P a g e
3.6.1. Symmetric and Asymmetric Encryption
The sequence of numbers used to encrypt and decrypt data is an encryption key. Algorithms are
used to construct encryption keys. It's random and special to each key.
Symmetric encryption and asymmetric encryption are two kinds of encryption schemes. Here's
how distinct they are.
¤ Types of Encryption
There are various types of encryption, and every encryption type is created as per the needs of the
professionals and keeping the security specifications in mind. The most common encryption types
are as follows.
The Data Encryption Standard is example of a low-level encryption. In 1977, the U.S. government
set up the standard. DES is largely redundant for securing confidential data due to advancements
in technology and reductions in hardware costs.
¤ Triple DES
The Triple DES works 3* times the encryption of DES. It means, it first encrypts the data, decrypts
the data, and again encrypt the data. It improves the original DES standard, which for sensitive
data has been considered too poor a form of encryption.
69 | P a g e
¤ RSA
The RSA holds its name from three computer scientists' ancestral initials. For encryption, it utilises
a powerful and common algorithm. Because of its main length, RSA is common and thus
commonly used for safe data transmission.
The U.S. government norm as of 2002 is the Advanced Encryption Standard. Worldwide, AES is
used.
¤ Two-Fish
The Two-fish is exampled as one of the quick encryption algorithms and is of no-cost for anyone
to use. It is usable in hardware and software.
Most legally sites use very known as "secure sockets layer" (SSL), which, when sent to and from
a website, is a procedure of encrypting data. It prevents attackers from accessing the information
when it is in transit.
To confirm that we practice safe the encrypted online transactions, search the padlock icon in URL
bar and the "s" in the "https".
We store confidential information or submit it online. To watch the sites to utilize SSL is
a useful idea whether we are utilizing the internet to perform tasks such as making
transactions, filing our taxes, renewing our driver's license, or doing some other personal
business.
Our job asks it. Our workplace may have protocols for encryption or it may be subject to
encryption-requiring regulations. Encryption is a must in these instances.
70 | P a g e
¤ Why encryption matters?
There are following reasons to use the encryption in our day-to-day life. That are:
Encryption helps protect our privacy online by translating sensitive information into messages
"only for your eyes" intended only for the parties who need them, and no one else. We should
make sure our emails sent over an encrypted network, or either message must be in an encrypted
format. In their Settings menu, most email clients come with the encryption option and if we check
our email with a web browser, take a moment to ensure that SSL encryption is available.
Cybercrime, mostly managed by international corporations, is a global sector. Many of the large-
scale thefts of data we might have read about in the news show that cybercriminals are indeed out
for financial gain to steal personal information.
3. Regulations demand it
The Portability and Transparency Act for Health Insurance (HIPAA) allows healthcare providers
to incorporate safety features that help secure online confidential health information for patients.
The Fair Credit Practices Act (FCPA) and related regulations that help protect customers must be
enforced by retailers. Encryption allows companies to remain consistent with regulatory guidelines
and specifications. It also helps secure their clients' valuable data.
Encryption is intended to secure our data, but it is also possible to use encryption against us.
Targeted ransomware, for example, is a cybercrime that can impact organizations, including
government agencies, of all sizes. Also, ransomware can attack individual users of computers.
71 | P a g e
¤ How do attacks involving ransomware occur?
In order to attempt to encrypted different devices, including computers and servers, attackers
deploy ransomware. Until they give a key to decrypt the encrypted data, the attackers also demand
a ransom. Ransomware attacks on government departments can shut down facilities, making it
impossible, for example, to obtain a permit, obtain a marriage license, or pay a tax bill.
Targeted attacks mostly target large organizations, but we can also experience ransomware attacks.
Some ways we must always keep in our mind to be safe from such attacks.
On all of our computers, including our cell phone, install and use trusted protection apps.
Keep up to date with our protection applications. It will help protect against cyberattacks
on our computers.
Our operating system and other software changes. This will fix vulnerabilities for
protection.
Avoid opening email attachments reflexively. About why? One of the key methods for the
distribution of ransomware is email.
Be careful of any email attachment that advises us to allow macros to display their content.
Macro malware will infect multiple files if macros are allowed.
Back-up the details on an external hard drive. If we are the victim of a ransomware attack,
once the malware has been cleaned up, we will possibly be able to recover our files.
Consider making use of cloud resources. It can help to prevent a ransomware infection, since
previous versions of files are maintained by several cloud providers, enabling us to 'roll back' to
the unencrypted type. Don't pay any ransom. In the hope of getting our files back, we might pay a
ransom, but we might not get them back. There's no assurance that our data will be released by
cybercriminals.
To help protect our confidential personal details, encryption is important. But it can be used against
us in the event of ransomware attacks. Taking steps to help us reap the benefits and prevent the
damage is wise.
72 | P a g e
¤ How is encrypted data deciphered?
With the support of a key, an algorithm, a decoder or something similar, the intended recipient of
the encrypted data will decrypt it. If the data and the encryption process are in the digital domain,
the intended user may use the necessary decryption tool to access the information they need.
For decryption purposes, the item used can be referred to as the key, cipher or algorithm. We will
find specific details about each of them below.
Cipher: The word cipher refers to an algorithm primarily used for the purposes of encryption. A
cipher consists of a series of successive steps at the end of which it decrypts the encrypted
information. Two major types of ciphers exist: stream ciphers and block ciphers.
Algorithm: The processes that are followed by the encryption processes are algorithms. There are
various types of algorithms that are explicitly used to decrypt encrypted files and data: some of
these types include blowfish, triple DES and RSA. In addition to algorithms and ciphers, it is
possible to use brute force to decode an encoded text.
73 | P a g e
Chapter Four
Distributed Database
4.1. Distributed Database overview
74 | P a g e
distributed database without affecting Increased complexity and a more
other modules. extensive infrastructure means extra
labour costs.
Improved communications
Security: remote database
Economics - it costs less to create a
fragments must be secured, the
network of smaller computers with the
infrastructure must also be secured
power of a single large computer.
(eg: by encrypting the network links
Improved availability — a fault in one
between remote sites).
database system will only affect one
Lack of standards
fragment, instead of the entire database
Increased storage requirements
Processor independence
Difficult to maintain integrity:
enforcing integrity over a network
may require too much networking
resources to be feasible.
a) Homogeneous DDBMSs
75 | P a g e
¤ Fully heterogeneous DDBMS: Support different DBMSs that may even support different data
models (relational, hierarchical, or network) running under different computer systems, such
as mainframes and microcomputers
Computer workstations
Network hardware and software
Communications media
Transaction processor (or, application processor, or transaction manager): Software
component found in each computer that requests data
Data processor or data manager: Software component residing on each computer that stores
and retrieves data located at the site, may be a centralized DBMS
DDBMS must perform all the functions of a centralized DBMS, and must handle all necessary
functions imposed by the distribution of data and processing
76 | P a g e
Provide network - wide concurrency control and recovery procedures
Provide data translation in heterogeneous systems
77 | P a g e
Vertically Fragmented Table Contents
Site 1 Site 2
Replication Scenarios:
79 | P a g e
Partitioned data allocation - Database is divided into several disjointed parts (fragments)
and stored at several sites
Replicated data allocation - Copies of one or more database fragments are stored at
several sites
Data distribution over a computer network is achieved through data partition, data
replication, or a combination of both
Query Mapping. The input query on distributed data is specified formally using a query
language. It is then translated into an algebraic query on global relations. This translation is done
by referring to the global conceptual schema and does not take into account the actual distribution
and replica-tion of data. Hence, this translation is largely identical to the one performed in a
centralized DBMS. It is first normalized, analyzed for semantic errors, simplified, and finally
restructured into an algebraic query.
Local Query Optimization. This stage is common to all sites in the DDB. The techniques
are similar to those used in centralized systems.
80 | P a g e
The first three stages discussed above are performed at a central control site, while the last stage
is performed locally.
We discussed the issues involved in processing and optimizing a query in a centralized DBMS in
Chapter 19. In a distributed system, several additional factors further complicate query processing.
The first is the cost of transferring data over the net-work. This data includes intermediate files
that are transferred to other sites for further processing, as well as the final result files that may
have to be transferred to the site where the query result is needed. Although these costs may not
be very high if the sites are connected via a high-performance local area network, they become
quite significant in other types of networks. Hence, DDBMS query optimization algorithms
consider the goal of reducing the amount of data transfer as an optimization criterion in choosing
a distributed query execution strategy.
For each employee, retrieve the employee name and the name of the department for which the
employee works. This can be stated as follows in the relational algebra:
81 | P a g e
The result of this query will include 10,000 records, assuming that every employee is related to a
department. Suppose that each record in the query result is 40 bytes long.
The query is submitted at a distinct site 3, which is called the result site because the query result
is needed there. Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3. There
are three simple strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and
perform the join at site 3. In this case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must
be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result
to site 3. The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 +
1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the
result to site 3. In this case, 400,000 + 3,500 = 403,500 bytes must be transferred.
If minimizing the amount of data transfer is our optimization criterion, we should choose strategy
3. Now consider another query Q For each department, retrieve the department name and the
name of the department manager. This can be stated as follows in the relational algebra:
82 | P a g e
Again, suppose that the query is submitted at site 3. The same three strategies for executing
query Q apply to Q, except that the result of Q includes only 100 records, assuming that each
department has a manager:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and
perform the join at site 3. In this case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must
be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result
to site 3. The size of the query result is 40 * 100 = 4,000 bytes, so 4,000 + 1,000,000 =
1,004,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the
result to site 3. In this case, 4,000 + 3,500 = 7,500 bytes must be transferred.
Again, we would choose strategy 3—this time by an overwhelming margin over strategies 1 and
2. The preceding three strategies are the most obvious ones for the case where the result site (site
3) is different from all the sites that contain files involved in the query (sites 1 and 2). However,
suppose that the result site is site 2; then we have two simple strategies:
1. Transfer the EMPLOYEE relation to site 2, execute the query, and present the result to the
user at site 2. Here, the same number of bytes—1,000,000— must be transferred for
both Q and Q .
2. Transfer the DEPARTMENT relation to site 1, execute the query at site 1, and send the
result back to site 2. In this case 400,000 + 3,500 = 403,500 bytes must be transferred
for Q and 4,000 + 3,500 = 7,500 bytes for Q .
A more complex strategy, which sometimes works better than these simple strategies, uses an
operation called semijoin. We introduce this operation and discuss distributed execution using
semijoins next.
The idea behind distributed query processing using the semijoin operation is to reduce the number
of tuples in a relation before transferring it to another site. Intuitively, the idea is to send the joining
column of one relation R to the site where the other relation S is located; this column is then joined
with S. Following that, the join attributes, along with the attributes required in the result, are
83 | P a g e
projected out and shipped back to the original site and joined with R. Hence, only the joining
column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes
is transferred in the other direction. If only a small fraction of the tuples in S participate in the join,
this can be quite an efficient solution to minimizing data transfer.
Project the join attributes of DEPARTMENT at site 2, and transfer them to site 1. For Q, we
transfer F = πDnumber(DEPARTMENT), whose size is 4 * 100 = 400 bytes, whereas, for Q , we
transfer F = πMgr_ssn(DEPARTMENT), whose size is 9 * 100 = 900 bytes.
Join the transferred file with the EMPLOYEE relation at site 1, and transfer the required attributes
EMPLOYEE), whose
from the resulting file to site 2. For Q, we transfer R = πDno, Fname, Lname(F Dnumber=Dno
size is 34
* 10,000 =
340,000 bytes, whereas, for Q , we transfer R = πMgr_ssn, Fname, Lname (F
Mgr_ssn=Ssn EMPLOYEE), whose size is 39 * 100 = 3,900 bytes.
Execute the query by joining the transferred file R or R with DEPARTMENT, and present the
result to the user at site 2.
Using this strategy, we transfer 340,400 bytes for Q and 4,800 bytes for Q . We limited
the EMPLOYEE attributes and tuples transmitted to site 2 in step 2 to only those that will actually
be joined with a DEPARTMENT tuple in step 3. For query Q, this turned out to include
all EMPLOYEE tuples, so little improvement was achieved. However, for Q only 100 out of the
10,000 EMPLOYEE tuples were needed.
The semijoin operation was devised to formalize this strategy. A semijoin operation R A=B S,
where A and B are domain-compatible attributes of R and S, respectively, produces the same
result as the relational algebra expression πR(R A=B S). In a distributed environment where R
and S reside at different sites, the semijoin is typically implemented by first transferring F = πB(S)
to the site where R resides and then joining F with R, thus leading to the strategy discussed here.
84 | P a g e
4.7. Query and Update Decomposition
In a DDBMS with no distribution transparency, the user phrases a query directly in terms of
specific fragments. For example, consider another query Q: Retrieve the names and hours per
week for each employee who works on some project controlled by department 5, which is specified
on the distributed database where the relations at sites 2 and 3 and those at site 1, as in our earlier
example. A user who submits such a query must specify whether it references
the PROJS_5 and WORKS_ON_5 relations at site 2 or the PROJECT and WORKS_ON relations
at site 1 .The user must also maintain consistency of replicated data items when updating a
DDBMS with no replication transparency.
On the other hand, a DDBMS that supports full distribution, fragmentation, and replication
transparency allows the user to specify a query or update request on the schema in Figure 3.5 just
as though the DBMS were centralized. For updates, the DDBMS is responsible for
maintaining consistency among replicated items by using one of the distributed concurrency
control algorithms to be discussed in Section 25.7. For queries, a query decomposition module
must break up or decompose a query into subqueries that can be executed at the individual sites.
Additionally, a strategy for combining the results of the subqueries to form the query result must
be generated. Whenever the DDBMS determines that an item referenced in the query is replicated,
it must choose or materialize a particular replica during query execution.
To determine which replicas include the data items referenced in a query, the DDBMS refers to
the fragmentation, replication, and distribution information stored in the DDBMS catalog. For
vertical fragmentation, the attribute list for each fragment is kept in the catalog. For horizontal
fragmentation, a condition, some-times called a guard, is kept for each fragment. This is basically
a selection condition that specifies which tuples exist in the fragment; it is called a guard
because only tuples that satisfy this condition are permitted to be stored in the fragment. For
mixed fragments, both the attribute list and the guard condition are kept in the catalog.
In our earlier example, the guard conditions for fragments at site 1 are TRUE (all tuples), and the
attribute lists are * (all attributes). For the fragments shown in Figure 25.8, we have the guard
85 | P a g e
conditions and attribute lists. When the DDBMS decomposes an update request, it can determine
which fragments must be updated by examining their guard conditions. For example, a user request
to insert a new EMPLOYEE tuple <‘Alex’, ‘B’, ‘Coleman’, ‘345671239’, ‘22-APR-64’, ‘3306
Sandstone, Houston, TX’, M, 33000, ‘987654321’, 4> would be decomposed by the DDBMS into
two insert requests: the first inserts the preceding tuple in the EMPLOYEE fragment at site 1, and
the second inserts the pro-jected tuple <‘Alex’, ‘B’, ‘Coleman’, ‘345671239’, 33000,
‘987654321’, 4> in the EMPD4 fragment at site 3.
For query decomposition, the DDBMS can determine which fragments may contain the required
tuples by comparing the query condition with the guard
(a) EMPD5
attribute list: Fname, Minit, Lname, Ssn, Salary, Super_ssn, Dno guard condition: Dno=5
DEP5
attribute list: * (all attributes Dname, Dnumber, Mgr_ssn, Mgr_start_date) guard condition:
Dnumber=5
DEP5_LOCS
PROJS5
attribute list: * (all attributes Pname, Pnumber, Plocation, Dnum) guard condition: Dnum=5
WORKS_ON5
(b) EMPD4
attribute list: Fname, Minit, Lname, Ssn, Salary, Super_ssn, Dno guard condition: Dno=4
DEP4
86 | P a g e
attribute list: * (all attributes Dname, Dnumber, Mgr_ssn, Mgr_start_date) guard condition:
Dnumber=4
DEP4_LOCS
PROJS4
attribute list: * (all attributes Pname, Pnumber, Plocation, Dnum) guard condition: Dnum=4
WORKS_ON4
attribute list: * (all attributes Essn, Pno, Hours) guard condition: Essn IN (πSsn (EMPD4))
conditions. For example, consider the query Q: Retrieve the names and hours per week for each
employee who works on some project controlled by department 5. This can be specified in SQL
on the schema:
Suppose that the query is submitted at site 2, which is where the query result will be needed. The
DDBMS can determine from the guard condition on PROJS5 and WORKS_ON5 that all tuples
satisfying the conditions (Dnum = 5 AND Pnumber = Pno) reside at site 2. Hence, it may
decompose the query into the following relational algebra subqueries:
This decomposition can be used to execute the query by using a semijoin strategy. The DDBMS
knows from the guard conditions that PROJS5 contains exactly those tuples satisfying (Dnum =
87 | P a g e
5) and that WORKS_ON5 contains all tuples to be joined with PROJS5; hence, subquery T1 can
be executed at site 2, and the projected column Essn can be sent to site 1. Subquery T2 can then be
executed at site 1, and the result can be sent back to site 2, where the final query result is calculated
and displayed to the user. An alternative strategy would be to send the query Q itself to site 1,
which includes all the database tuples, where it would be executed locally and from which the
result would be sent back to site 2. The query optimizer would estimate the costs of both strategies
and would choose the one with the lower cost estimate.
Distribution transparency
Transaction transparency
Performance transparency
This refers to freedom for the user from the operational details of the network. Allows
management of a physically dispersed database as though it were a centralized database
Fragmentation transparency
Location transparency
Local mapping transparency
88 | P a g e
Table 20: A Summary of Transparency Features
Replica transparency: DDBMS’s ability to hide the existence of multiple copies of data
from the user. Copies of data may be stored at multiple sites for better availability,
performance, and reliability.
Ensures database transactions will maintain distributed database’s integrity and consistency.
Distributed transaction accesses data stored at more than one location. Each transaction is divided
into number of sub transactions, one for each site that has to be accessed. DDBMS must ensure the
indivisibility of both the global transaction and each of the sub transactions. We should distinguish
between Distributed Requests and Distributed Transactions:
Remote request: Lets a single SQL statement access data to be processed by a single remote
database processor
89 | P a g e
Distributed transaction:
Can update or request data from several different remote sites on a network. Distributed
transaction Allows a transaction to reference several different (local or remote) DP sites
Distributed request: Lets a single SQL statement reference data located at several different local
or remote DP sites
90 | P a g e
Figure 14: Another Distributed Request
91 | P a g e
Which location to use.
DQP produces execution strategy optimized with respect to some cost function.
Typically, costs associated with a distributed request include
I/O cost;
CPU cost
Communication cost.
Objective of query optimization routine is to minimize total cost associated with the execution
of a request.
SELECT S.S_id
Colour = ‘red’ ;
92 | P a g e
5. restrict Parts to tuples containing red parts, move the result to A and process there
6. think of other possibilities …
there is an extra dimension added by the site where the query was issued
Costs associated with a request are a function of the:
Access time (I/O) cost
Communication cost
CPU time cost
Must provide distribution transparency as well as replica transparency
Query optimization techniques:
o Manual(end user, programmer) or automatic(DDMS)
o Static(takes place at compilation time) or dynamic(takes place at run time)
o Statistically based(information about the DB as size ,number of records,
average access time,DDMS) or rule-based algorithms (is based on a set of
user-difined rules to determine the best query access strategy, end user)
Multisite, multiple-process operations are much more likely to create data inconsistencies and
deadlocked transactions than are single-site systems
93 | P a g e
4.10.1. Two-Phase Commit Protocol
Distributed databases make it possible for a transaction to access data at several sites
The objective of the 2PC protocol is to ensure that all nodes commit their part of the
transaction.
Final COMMIT must not be issued until all sites have committed their parts of the
transaction
Two-phase commit protocol requires:
Two-Phase Commit Protocol requires:
DO-UNDO-REDO protocol: used by DP to roll back and/or roll forward transactions
with the help of transaction log entries.
DO perform the operation and writes the before and after values in the
transaction log.
UNDO reverses the operation, using the Transaction log entry written by DO
operation
REDO redoes the operation, using log entries
Write-ahead-Protocol: forces the log entry to be written to permanent storage before
the actual operation takes place. each individual DP’s transaction log entry must be
written before the database fragment is actually updated
The 2PC protocol defines operations between two types of nodes:
Coordinator TP in the site where transaction is executed
Subordinates or cohorts: DPs in sites where data affected by the transaction is located.
Phase 1: Preparation
94 | P a g e
Phase 2: the Final COMMIT
The coordinator broadcast a COMITT message to all subordinates and waits for replies.
Each subordinate receive the message, then updates the database using the DO protocol.
The subordinates reply (COMMITTED or NOT COMMITTED) message to coordinator.
If one or more subordinates did not commit, the coordinator send ABORT message,
For WRITE operations, it ensures that updates are visible across all sites containing copies
(replicas) of the data item. For ABORT operations, the manager ensures that no effects of the
transaction are reflected in any site of the distributed database. For COMMIT operations, it
ensures that the effects of a write are persistently recorded on all databases containing copies of
the data item. Atomic termination (COMMIT/ ABORT) of distributed transactions is commonly
implemented using the two-phase commit protocol.
The transaction manager passes to the concurrency controller the database operation and
associated information. The controller is responsible for acquisition and release of associated
locks. If the transaction requires access to a locked resource, it is delayed until the lock is acquired.
Once the lock is acquired, the operation is sent to the runtime processor, which handles the actual
execution of the database operation. Once the operation is completed, locks are released and the
transaction manager is updated with the result of the operation.
we described the two-phase commit protocol (2PC), which requires a global recovery manager,
or coordinator, to maintain information needed for recovery, in addition to the local recovery
managers and the information they main-tain (log, tables) The two-phase commit protocol has
95 | P a g e
certain drawbacks that led to the development of the three-phase commit protocol, which we
discuss next.
The biggest drawback of 2PC is that it is a blocking protocol. Failure of the coordinator blocks all
participating sites, causing them to wait until the coordinator recovers. This can cause performance
degradation, especially if participants are holding locks to shared resources. Another problematic
scenario is when both the coordinator and a participant that has committed crash together. In the
two-phase commit protocol, a participant has no way to ensure that all participants got the commit
message in the second phase. Hence once a decision to commit has been made by the coordinator
in the first phase, participants will commit their transactions in the second phase independent of
receipt of a global commit message by other participants. Thus, in the situation that both the
coordinator and a committed participant crash together, the result of the transaction becomes
uncertain or nondeterministic. Since the transaction has already been committed by one
participant, it cannot be aborted on recovery by the coordinator. Also, the transaction cannot be
optimistically committed on recovery since the original vote of the coordinator may have been to
abort.
These problems are solved by the three-phase commit (3PC) protocol, which essentially divides
the second commit phase into two sub phases called prepare-to-commit and commit. The
prepare-to-commit phase is used to communicate the result of the vote phase to all participants. If
all participants vote yes, then the coordinator instructs them to move into the prepare-to-commit
state. The commit sub phase is identical to its two-phase counterpart. Now, if the coordinator
crashes during this sub phase, another participant can see the transaction through to completion. It
can simply ask a crashed participant if it received a prepare-to-commit message. If it did not, then
it safely assumes to abort. Thus the state of the protocol can be recovered irrespective of which
participant crashes. Also, by limiting the time required for a transaction to commit or abort to a
maximum time-out period, the protocol ensures that a transaction attempting to commit via 3PC
releases locks on time-out.
The main idea is to limit the wait time for participants who have committed and are waiting for a
global commit or abort from the coordinator. When a participant receives a precommit message,
96 | P a g e
it knows that the rest of the participants have voted to commit. If a precommit message has not
been received, then the participant will abort and release all locks.
The following are the main benefits of operating system (OS)-supported transaction management:
1. Typically, DBMSs use their own semaphores to guarantee mutually exclusive access to
shared resources. Since these semaphores are implemented in user space at the level of the
DBMS application software, the OS has no knowledge about them. Hence if the OS
deactivates a DBMS process holding a lock, other DBMS processes wanting this lock
resource get queued. Such a situation can cause serious performance degradation. OS-level
knowledge of semaphores can help eliminate such situations.
2. Specialized hardware support for locking can be exploited to reduce associated costs. This
can be of great importance, since locking is one of the most common DBMS operations.
3. Providing a set of common transaction support operations though the kernel allows
application developers to focus on adding new features to their products as opposed to
reimplementing the common functionality for each application. For example, if different
DDBMSs are to coexist on the same machine and they chose the two-phase commit
protocol, then it is more beneficial to have this protocol implemented as part of the kernel
so that the DDBMS developers can focus more on adding new features to their products.
Let us consider the following scenario to analyze how transaction fail may occur. Let suppose, we
have two-person X and Y. X sends a message to Y and expects a response, but Y is unable to
receive it.
97 | P a g e
The following are some of the issues with this circumstance:
One of the most famous methods of Transaction Recovery is the “Two-Phase Commit
Protocol”. The coordinator and the subordinate are the two types of nodes that the Two-Phase
Commit Protocol uses to accomplish its procedures. The coordinator’s process is linked to the user
app, and communication channels between the subordinates and the coordinator are formed.
The two-phase commit protocol contains two stages, as the name implies. The first step is the
PREPARE phase, in which the transaction’s coordinator delivers a PREPARE message. The
second step is the decision-making phase, in which the coordinator sends a COMMIT message if
all of the nodes can complete the transaction, or an abort message if at least one subordinate node
cannot. Centralized 2PC, Linear 2PC, and Distributed 2PC are all ways that may be used to
perform the 2PC.
Centralized 2 PC: Contact in the Centralized 2PC is limited to the coordinator’s process, and
no communication between subordinates is permitted. The coordinator is in charge of sending
the PREPARE message to the subordinates, and once all of the subordinates’ votes have been
received and analyzed, the coordinator chooses whether to abort or commit. There are two
stages to this method:
The First Phase: When a user desires to COMMIT a transaction during this phase,
the coordinator sends a PREPARE message to all subordinates. When a subordinate
gets the PREPARE message, it either records a PREPARE log and sends a YES
VOTE and enters the PREPARED state if the subordinate is willing to COMMIT;
or it creates an abort record and sends a NO VOTE if the subordinate is not willing
to COMMIT. Because it knows the coordinator will issue an abort, a subordinate
transmitting a NO VOTE does not need to enter a PREPARED state. In this
98 | P a g e
situation, the NO VOTE functions as a veto since only one NO VOTE is required
to cancel the transaction.
Second Phase: After the coordinator has reached a decision, it must communicate
that decision to the subordinates. If COMMIT is chosen, the coordinator enters the
committing state and sends a COMMIT message to all subordinates notifying them
of the choice. When the subordinates get the COMMIT message, they go into the
committing state and send the coordinator an acknowledge (ACK) message. The
transaction is completed when the coordinator gets the ACK messages. If the
coordinator, on the other hand, makes an ABORT decision, it sends an ABORT
message to all subordinates. In this case, the coordinator does not need to send an
ABORT message to the NO VOTE subordinate(s).
Linear 2 PC: Subordinates in the linear 2PC, can communicate with one another. The sites
are numbered 1 to N, with site 1 being the coordinator. As a result, the PREPARE message is
propagated in a sequential manner. As a result, the transaction takes longer to complete than
centralized or dispersed approaches. Finally, it is node N that sends out the Global COMMIT.
Distributed 2 PC: All of the nodes of a distributed 2PC interact with one another. Unlike other
2PC techniques, this procedure does not require the second phase. Furthermore, in order to
know that each node has put in its vote, each node must hold a list of all participating nodes.
When the coordinator delivers a PREPARE message to all participating nodes, the distributed
2PC gets started. When a participant receives the PREPARE message, it transmits his or her
vote to all other participants. As a result, each node keeps track of every transaction’s
participants.
99 | P a g e
Chapter 5
Object Oriented DBMS
5.1. Object Oriented Concepts
The term object-oriented—abbreviated by OO or O-O—has its origins in OO programming
languages, or OOPLs. Today OO concepts are applied in the areas of databases, software
engineering, knowledge bases, artificial intelligence, and computer systems in general. An object
typically has two components: state (value) and behavior (operations).
Objects in an OOPL exist only during program execution and are hence called transient objects.
An OO database can extend the existence of objects so that they are stored permanently, and hence
the objects persist beyond program termination and can be retrieved later and shared by other
programs. In other words, OO databases store persistent objects permanently on secondary storage,
and allow the sharing of these objects among multiple programs and applications. This requires
the incorporation of other well-known features of database management systems, such as indexing
mechanisms, concurrency control, and recovery. An OO database system interfaces with one or
more OO programming languages to provide persistent and shared object capabilities.
One goal of OO databases is to maintain a direct correspondence between real-world and database
objects so that objects do not lose their integrity and identity and can easily be identified and
operated upon. Hence, OO databases provide a unique system-generated object identifier (OID)
for each object. We can compare this with the relational model where each relation must have a
primary key attribute whose value identifies each tuple uniquely. In the relational model, if the
value of the primary key is changed, the tuple will have a new identity, even though it may still
represent the same real-world object. Alternatively, a real-world object may have different names
for key attributes in different relations, making it difficult to ascertain that the keys represent the
same object (for example, the object identifier may be represented as EMP_ID in one relation and
as SSN in another).
Another feature of OO databases is that objects may have an object structure of arbitrary
complexity in order to contain all of the necessary information that describes the object. In
contrast, in traditional database systems, information about a complex object is often scattered
100 | P a g e
over many relations or records, leading to loss of direct correspondence between a real-world
object and its database representation.
Applications for OO databases, there are many fields where it is believed that the OO model
can be used to overcome some of the limitations of Relational technology, where the use of
complex data types and the need for high performance are essential. These applications
include:
Computer-aided design and manufacturing (CAD/CAM)
Computer-integrated manufacturing (CIM)
Computer-aided software engineering (CASE)
Geographic information systems (GIS)
Many applications in science and medicine
Document storage and retrieval
Object
Class
Polymorphism
Inheritance
Encapsulation
Objects
Objects represent real world entities, concepts, and tangible as well as intangible things.
. For example a person, a drama, a licenses
Every object has a unique identifier (OID). The value of an OID is not visible to the
external user, but is used internally by the system to identify each object uniquely and to
create and manage inter-object references. The OID can be assigned to program variables
of the appropriate type when needed.
1. System generated
2. Never changes in the lifetime of the object
Object Structure:
101 | P a g e
Loosely speaking, an object corresponds to an entity in the E-R model.
The object-oriented paradigm is based on encapsulating code and data related to an object
into single unit related to an object into single unit.
An object has:
A set of variables that contain the data for the object The value of each that
contain the data for the object. The value of each variable is itself an object.
A set of messages to which the object responds; each message may have zero,
one, or more parameters.
A set of methods, each of which is a body of code to implement a message ; a
method returns a value as the response to the message
Objects are categorized by their type or class.
An object is an instance of a type or class.
Class
Class DefinitionExample
class employee {
/*Variables */
string name;
string address;
date start-date;
int salary;
/* Messages */
102 | P a g e
int annual-salary();
string get-name();
string get-address();
int employment-length(); };
Methods to read and set the other variables are also needed with strict encapsulation
Methods are defined separately
E.g. int employment-length() { return today() – start-date; }
int set-address (string new-address) { address = new-address;}
Polymorphism
Polymorphism is the capability of an object to take multiple forms. This ability allows the same
program code to work with different data types. Both a car and a motorcycle are able to break, but
the mechanism is different. In this example, the action break is a polymorphism. The defined action
is polymorphic — the result changes depending on which vehicle performs.
Inheritance
Inheritance creates a hierarchical relationship between related classes while making parts of code
reusable. Defining new types inherits all the existing class fields and methods plus further extends
them. The existing class is the parent class, while the child class extends the parent.
For example, a parent class called Vehicle will have child classes Car and Bike. Both child
classes inherit information from the parent class and extend the parent class with new information
depending on the vehicle type.
Encapsulation
Encapsulation is the ability to group data and mechanisms into a single object to provide access
protection. Through this process, pieces of information and details of how an object works
are hidden, resulting in data and function security. Classes interact with each other through
methods without the need to know how particular methods work. As an example, a car has
103 | P a g e
descriptive characteristics and actions. You can change the color of a car, yet the model or make
are examples of properties that cannot change. A class encapsulates all the car information into
one entity, where some elements are modifiable while some are not.
One disadvantage of relational databases is the expensive of setting up and maintaining the
database system. In order to set up a relational database, you generally need to purchase
special software. If you are not a programmer, you can use any number of products to set
up a relational database. It does take time to enter in all the information and set up the
program. If your company is large and you need a more robust database, you will need to
hire a programmer to create a relational database using Structured Query Language (SQL)
and a database administrator to maintain the database once it is built. Regardless of what
data you use, you will have to either import it from other data like text files or Excel
spreadsheets, or have the data entered at the keyboard. No matter the size of your company,
if you store legally confidential or protected information in your database such as health
information, social security numbers or credit card numbers, you will also have to secure
your data against unauthorized access in order to meet regulatory standards.
Abundance of Information
104 | P a g e
Structured Limits
Some relational databases have limits on field lengths. When you design the database, you
have to specify the amount of data you can fit into a field. Some names or search queries
are shorter than the actual, and this can lead to data loss.
Isolated Databases
Complex relational database systems can lead to these databases becoming "islands of
information" where the information cannot be shared easily from one large system to another.
Often, with big firms or institutions, you find relational databases grew in separate divisions
differently. For example, maybe the hospital billing department used one database while the
hospital personnel department used a different database. Getting those databases to "talk" to each
other can be a large, and expensive, undertaking, yet in a complex hospital system, all the databases
need to be involved for good patient and employee care.
OODBMS definitions
105 | P a g e
Components of Object-Oriented Data Model:
The OODBMS is based on three major components, namely: Object structure, Object
classes, and Object identity. These are explained below.
1. Object Structure:
The structure of an object refers to the properties that an object is made up of. These
properties of an object are referred to as an attribute. Thus, an object is a real-world entity
with certain attributes that makes up the object structure. Also, an object encapsulates the
data code into a single unit which in turn provides data abstraction by hiding the
implementation details from the user.
1. Messages –
A message provides an interface or acts as a communication medium between an
object and the outside world. A message can be of two types:
Read-only message: If the invoked method does not change the value of a
variable, then the invoking message is said to be a read-only message.
Update message: If the invoked method changes the value of a variable, then the
invoking message is said to be an update message.
2. Methods –
When a message is passed then the body of code that is executed is known as a
method. Whenever a method is executed, it returns a value as output. A method can
be of two types:
Read-only method: When the value of a variable is not affected by a method,
then it is known as the read-only method.
Update-method: When the value of a variable change by a method, then it is
known as an update method.
3. Variables –
It stores the data of an object. The data stored in the variables makes the object
distinguishable from one another.
106 | P a g e
2. Object Classes:
An object which is a real-world entity is an instance of a class. Hence first we need to
define a class and then the objects are made which differ in the values they store but
share the same class definition. The objects in turn correspond to various messages and
variables stored in them.
Example –
class CLERK
{ //variables
char name;
string address;
int id;
int salary;
//methods
char get_name();
string get_address();
int annual_salary();
};
In the above example, we can see, CLERK is a class that holds the object variables and
messages.
The concept of encapsulation that is the data or information hiding is also supported by an
object-oriented data model. And this data model also provides the facility of abstract data
types apart from the built-in data types like char, int, float. ADT’s are the user-defined data
types that hold the values within them and can also have methods attached to them.
107 | P a g e
Thus, OODBMS provides numerous facilities to its users, both built-in and user-defined.
It incorporates the properties of an object-oriented data model with a database management
system, and supports the concept of programming paradigms like classes and objects along
with the support for other concepts like encapsulation, inheritance, and the user-defined
ADT’s (abstract data types).
ER model is used to represent real life scenarios as entities. The properties of these entities
are their attributes in the ER diagram and their connections are shown in the form of
relationships.
An ER model is generally considered as a top down approach in data designing.
An example of ER model is –
108 | P a g e
Advantages of E - R model
The E-R diagram is very easy to understand as it has clearly defined entities and
the relations between them.
109 | P a g e
Advantages of Object Oriented Model
Due to inheritance, the data types can be reused in different objects. This reduces
the cost of maintaining the same data in multiple locations.
The object oriented model is quite flexible in most cases.
It is easier to extend the design in Object Oriented Model.
Disadvantages of Object Oriented Model
Objects
Object consists of entity and attributes which can describe the state of real world object and
action associated with that object.
Object name: The name is used to refer different objects in the program.
Object identifier: This is the system generated identifier which is assigned,
when a new object is created.
Structure of object: Structure defines, how the object is constructed using
constructor. In object oriented database the state of complex object can be
constructed from other objects by using certain type of constructor. The formal
way of representing objects as (i,c,v) where 'i' is object identifier, 'c' is type
constructor and 'v' is current value of an object.
Transient object: In OOPL, objects which are present only at the time of
execution are called as transient object.
o For example: Variables in OOPL
Persistent objects: An object which exists even after the program is completely
executed (or terminated), is called as persistent objects. Object-oriented databases
can store objects in secondary memory.
110 | P a g e
Attributes
Attributes are nothing but the properties of objects in the system.
Example: Employee can have attribute 'name' and 'address' with assigned values as:
Attribute value
Name Abebe
Address hossana
Id 07
Types of Attributes
The three types of attributes are as follows:
1. Simple attributes
Attributes can be of primitive data type such as, integer, string, real etc. which can
take literal value.
Example: 'ID' is simple attribute and value is 07.
2. Complex attributes
Attributes which consist of collections or reference of other multiple objects are
called as complex attributes.
Example: Collection of Employees consists of many employee names.
3. Reference attributes
Attributes that represent a relationship between objects and consist of value or
collection of values are called as reference attributes.
Example: Manager is reference of staff object
111 | P a g e
Example
An Example of the Object Oriented data model with object and attributes –
Figure 17: Object Oriented data model with object and attributes
Shape, Circle, Rectangle and Triangle are all objects in this model.
Circle has the attributes Center and Radius.
Rectangle has the attributes Length and Breath
Every object has unique identity. In an object oriented system, when object is created
OID is assigned to it.
In RDBMS OID is value based and primary key is used to provide uniqueness of each
table in relation. Primary key is unique only for that relation and not for the entire
system. Primary key is chosen from the attributes of the relation which makes object
independent on the object state.
In OODBMS OID are variable name or pointer.
112 | P a g e
Properties of OID
1. Uniqueness: OID cannot be same to every object in the system and it is generated
automatically by the system.
2. Invariant: OID cannot be changed throughout its entire lifetime.
3. Invisible: OID is not visible to user.
113 | P a g e
Chapter 6
Data warehousing and Data Mining Techniques
6.1. Data Warehousing
A data warehouse is a relational database management system responsible for the collection
and storage of data to support management decision making and problem solving.
A data warehouse takes data from all the databases and create a layer optimized for and
dedicated to analytical.
It enables managers and other business professionals to undertake data mining, online
analytical processing, market research and decision support. Current evolution of Decision
Support Systems (DSSs). Data warehouse maintains a copy of information from the source
transaction system.
114 | P a g e
6.1.1. Introduction
Data warehouse
The table and join are simple since they are denormalized
It is usually database
Data ware is huge data using for analytical (storage) copy and analyses
115 | P a g e
• The data warehouse and operational environments are separated. Data warehouse
receives its data from operational databases.
– Data warehouse environment is characterized by read-only transactions to
very large data sets.
– Operational environment is characterized by numerous update transactions to
a few data entities at a time.
– Data warehouse contains historical data over a long time horizon.
• Ultimately Information is created from data warehouses. Such Information becomes the
basis for rational decision making.
• The data found in data warehouse is analyzed to discover previously unknown data
characteristics, relationships, dependencies, or trends.
Data warehouses provide access to data for complex analysis, knowledge discovery, and
decision making and support high-performance demands on an organization's data and
information. The construction of data warehouses involves data cleaning, data integration and
data transformation and can be viewed as an important preprocessing step for data mining.
Database
Database crud operation with frequently used data
It is basically any system which keeps data in a table format
Used to OLTP and data warehousing
The tables and joins are complex since they are normalized
Entity relation modeling techniques are used for relational data base management
system database design
Optimized for write operation
Performance is low for analyses query
In comparison to traditional databases, data warehouses generally contain very large amounts of
data from multiple sources that may include databases from different data models and sometimes
files acquired from independent systems and platforms
6.1.3. Benefits
116 | P a g e
implementation of the warehouse is an important and challenging consideration that should
not be underestimated. The building of an enterprise-wide warehouse in a large organization is
a major undertaking, potentially taking years from conceptualization to implementation.
Because of the difficulty and amount of lead time required for such an undertaking, the
widespread development and deployment of data marts may provide an attractive alternative,
especially to those organizations with urgent needs for OLAP , DSS, and/or data mining
support.
The administration of a data warehouse is an intensive enterprise, proportional to the size and
complexity of the warehouse. An organization that attempts to administer a data warehouse
must realistically understand the complex nature of its administration. Although designed for
read-access, a data warehouse is no more a static structure than any of its information sources.
Source databases can be expected to evolve. The warehouse's schema and acquisition
component must be expected to be updated to handle these evolutions.
A significant issue in data warehousing is the quality control of data. Both quality and
consistency of data are major concerns. Although the data passes through a cleaning function
during achievement, quality and consistency remain significant issues for the database
administrator. Melding data from heterogeneous and disparate sources is a major challenge
given differences in naming, domain definitions, identification numbers, and the like. Every
time a source database changes, the data warehouse administrator must consider the possible
interactions with other elements of the warehouse.
Administration of a data warehouse will require far broader skills than are needed for
traditional database administration. A team of highly skilled technical experts with overlapping
areas of expertise will likely be needed, rather than a single individual. Like database
administration, data warehouse administration is only partly technical; a large part of the
responsibility requires working effectively with all the members of the organization with an
interest in the data warehouse. However difficult that can be at times for database
administrators, it is that much more challenging for data warehouse administrators, as the scope
of their responsibilities is considerably broader.
Design of the management function and selection of the management team for a database
warehouse are crucial. Managing the data warehouse in a large organization will surely be a
117 | P a g e
major task. Many commercial tools are already available to support management functions.
Effective data warehouse management will certainly be a team function, requiring a wide set
of technical skills, careful coordination, and effective leadership. Just as we must prepare for
the evolution of the warehouse, we must also recognize that the skills of the management team
will, of necessity, evolve with it.
Generally data warehouse is a collection of decision support technologies, aimed at enabling the
knowledge worker (executive, manager, analyst) to make better and faster decisions
118 | P a g e
6.3. Data Mining
6.3.1. Introduction
Data mining is the part of knowledge discovery process, Knowledge Discovery in Databases,
frequently abbreviated as KDD , typically encompasses more than data mining. The knowledge
discovery process comprises six phases: data selection, data cleansing, enrichment, data
transformation or encoding, data mining, and the reporting and display of the discovered
information.
During data selection, data about specific items or categories of items, or from stores in a specific
region or area of the country, may be selected.
The data cleansing process then may correct invalid zip codes or eliminate records with incorrect
phone prefixes. Enrichment typically enhances the data with additional sources of information.
For example, given the client names and phone numbers, the store may purchase other data about
age, income, and credit rating and append them to each record.
Data transformation and encoding may be done to reduce the amount of data. For instance, item
codes may be grouped in terms of product categories into audio, video, supplies, electronic
gadgets, camera, accessories, and so on. Zip codes may be aggregated into geographic regions;
incomes may be divided into ten ranges, and so on. We showed a step called cleaning as a
precursor to the data warehouse creation. If data mining is based on an existing warehouse for this
retail store chain, we would expect that the cleaning has already been applied. It is only after such
preprocessing that data mining techniques are used to mine different rules and patterns. For
example, the result of mining may be to discover:
Data Mining is a technology that uses various techniques to discover hidden knowledge from
heterogeneous and distributed historical data stored in large databases, warehouses and other
massive information repositories so to find patterns in data that are:
valid: not only represent current state, but also hold on new data with some certainty
119 | P a g e
novel: non-obvious to the system that are generated as new facts
useful: should be possible to act on the item or problem
understandable: humans should be able to interpret the pattern
Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data.
Thus, data mining should have been more appropriately named “knowledge mining from data,”
which is unfortunately somewhat long. “Knowledge mining,” a shorter term may not reflect the
emphasis on mining from large amounts of data. Thus, such a misnomer that carries both “data”
and “mining” became a popular choice. Many other terms carry a similar or slightly different
meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery from Data, or KDD. Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.
The goals of data mining fall into the following classes: prediction, identification, classification,
and optimization.
¤ Prediction: Data mining can show how certain attributes within the data will behave in the
future. Examples of predictive data mining include the analysis of buying transactions to
predict what consumers will buy under certain discounts, how much sales volume a store would
generate in a given period, and whether deleting a product line would yield more profits. In
such applications, business logic is used coupled with data mining. In a scientific context,
certain seismic wave patterns may predict an earthquake with high probability.
¤ Identification: Data patterns can be used to identify the existence of an item, an event, or an
activity. For example, intruders trying to break a system may be identified by the programs
executed, files accessed, and CPU time per session. In biological applications, existence of a
gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. The
area known as authentication is a form of identification. It ascertains whether a user is indeed
a specific user or one from an authorized class; it involves a comparison of parameters or
images or signals against a database.
120 | P a g e
¤ Classification: Data mining can partition the data so that different classes or categories can be
identified based on combinations of parameters. For example, customers in a supermarket can
be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, and
infrequent shoppers. This classification may be used in different analyses of customer buying
transactions as a post-mining activity. Sometimes classification based on common domain
knowledge is used as an input to decompose the mining problem and make it simpler. For
instance, health foods, party foods, or school lunch foods are distinct categories in the
supermarket business. It makes sense to analyze relationships within and across categories as
separate problems. Such categorization may be used to encode the data appropriately before
subjecting it to further data mining.
¤ Optimization: One eventual goal of data mining may be to optimize the use of limited
resources such as time, space, money, or materials and to maximize output variables such as
sales or profits under a given set of constraints. As such, this goal of data mining resembles the
objective function used in operations research problems that deals with optimization under
constraints.
The term data mining is currently used in a very broad sense. In some situations it includes
statistical analysis and constrained optimization as well as machine learning. There is no sharp line
separating data mining from these disciplines
121 | P a g e
Manufacturing: Applications involve optimization of resources like machines,
manpower, and materials; optimal design of manufacturing processes, shop-floor layouts,
and product design, such as for automobiles based on customer requirements.
A. Data Preparation
Some data have problems on their own that needs to be cleaned:
2. Data integration: Integration of data from multiple sources, such as databases, data
warehouses, or files. combines data from multiple sources (database, data warehouse,
files & sometimes from non-electronic sources) into a coherent store. Because of the use
of different sources, data that that is fine on its own may become problematic when we
want to integrate it. Some of the issues are:
122 | P a g e
Data at different levels
3. Data reduction: obtains a reduced representation of the data set that is much smaller in
volume, yet produces almost the same results.it Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost the same) analytical
results. Why data reduction? A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set. Dimensionality
reduction: Select best attributes or remove unimportant attributes. , size reduction: Reduce data
volume by choosing alternative, smaller forms of data representation, Data compression: Is a
technology that reduce the size of large files such that smaller files take less memory space and
fast to transfer over a network or the Internet,
4. Data transformation: A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be identified with one of the new
values. Methods for data transformation. Normalization: Scaled to fall within a smaller,
specified range of values min-max normalization z-score normalization
Discretization: Reduce data size by dividing the range of a continuous attribute into
intervals. Interval labels can then be used to replace actual data values Discretization can
be performed recursively on an attribute using method such as Binning: divide values into
intervals Concept hierarchy climbing: organizes concepts (i.e., attribute values)
hierarchically
B. Classification (which is also called supervised learning) maps data into predefined groups or
classes to enhance the prediction process. Predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data.
C. Clustering (which is also called Unsupervised learning) Clustering is a data mining (machine
learning) technique that finds similarities between data according to the characteristics found
in the data & groups similar data objects into one cluster. Groups’ similar data together into
clusters. is used to find appropriate groupings of elements for a set of data. Unlike
classification, clustering is a kind of undirected knowledge discovery or unsupervised learning;
123 | P a g e
that is, there is no target field, and the relationship among the data is identified by bottom-up
approach.
D. Association Rule (is also known as market basket analysis) It discovers interesting
associations between attributes contained in a database. Based on frequency counts of the
number of items occur in the event, association rule tells if item X is a part of the event, then
what is the percentage of item Y is also part of the event. Pattern discovery attempts to discover
hidden linkage between data items.
124 | P a g e
References
[1] J. Kandiri, “ADVANCED DATABASE SYSTEMS.”
[2] “D ATA B ASE.”
[3] “GRADUATE DIPLOMA IN IT ADVANCED DATABASE MANAGEMENT
SYSTEMS,” no. September, 2021.
[4] S. Summary, “INF312 - Advanced Database Systems INF312 - Advanced Database
Systems 3 - Step Historical View 3 - Step Comparison,” pp. 1–20, 2002.
[5] C. Title, “NATIONAL OPEN UNIVERSITY OF NIGERIA COURSE CODE : CIT 905.”
125 | P a g e