0% found this document useful (0 votes)
14 views

4. Advanced Databse module

this file explains about advanced database briefly

Uploaded by

abenezer865
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

4. Advanced Databse module

this file explains about advanced database briefly

Uploaded by

abenezer865
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

Werabe University

Institute of Technology
Department of Information Systems

Module on Advanced Database Systems

i|Page
Table of Contents
Chapter One ................................................................................................................................................ 1
Transaction Management and Concurrency Control.............................................................................. 1
1.1. Transaction.................................................................................................................................... 1
1.1.1. Evaluating Transaction Results ............................................................................................. 1
1.1.2. Transaction Properties........................................................................................................... 2
1.1.3. Transaction Management with SQL ..................................................................................... 7
1.1.4. Transaction Log .................................................................................................................... 7
1.2. Concurrency Control ..................................................................................................................... 8
1.3. Problems of Concurrent Sharing ................................................................................................. 19
1.4. Concept of Serializability............................................................................................................ 23
1.4.1. Types of serializability ........................................................................................................ 24
1.5. Database Recovery...................................................................................................................... 26
1.6. Transaction and Recovery ........................................................................................................... 30
1.6.1. Transaction .......................................................................................................................... 30
1.6.2. Recovery ............................................................................................................................. 31
1.6.3. Recovery techniques and facilities ...................................................................................... 31
Chapter two ............................................................................................................................................... 34
Query Processing and Optimization........................................................................................................ 34
2.1. Overview ..................................................................................................................................... 34
2.2. Query Processing Steps ............................................................................................................... 34
2.3. Query Decomposition ................................................................................................................. 36
2.4. Optimization Process .................................................................................................................. 37
2.4.1. Top-K Optimization ............................................................................................................ 37
2.4.2. Join Minimization ............................................................................................................... 38
2.4.3. Multi Query Optimization and Shared Scans...................................................................... 39
2.4.4. Parametric Query Optimization .......................................................................................... 40
2.5. Approaches to Query Optimization ............................................................................................ 40
2.5.1. Exhaustive Search Optimization ......................................................................................... 40
2.5.2. Heuristic Based Optimization ............................................................................................. 41
2.6. Transformation Rules.................................................................................................................. 41
2.7. Implementing relational Operators ............................................................................................. 45
2.7.1. Relational Algebra .............................................................................................................. 45
2.8. Pipelining .................................................................................................................................... 52

i|Page
2.8.1. Pipelining vs. Materialization ............................................................................................. 53
Chapter 3 ................................................................................................................................................... 55
Database Integrity, Security and Recovery ............................................................................................ 55
3.1. Integrity ....................................................................................................................................... 55
3.1.1. Types of Data Integrity ......................................................................................................... 56
3.1.2. Integrity Constraints ............................................................................................................ 57
3.2. Security ....................................................................................................................................... 60
3.3. Database threats .......................................................................................................................... 61
3.3.1. Threats in a Database .......................................................................................................... 62
3.3.2. Measures of Control ............................................................................................................ 63
3.4. Identification and Authentication................................................................................................ 63
3.5. Categories of Control .................................................................................................................. 64
3.6. Data Encryption .......................................................................................................................... 67
3.6.1. Symmetric and Asymmetric Encryption ............................................................................. 69
Chapter Four ............................................................................................................................................. 74
Distributed Database ................................................................................................................................ 74
4.1. Distributed Database overview .................................................................................................. 74
4.2. Components of Distributed DBMS and types............................................................................. 75
4.2.1. Types of DDBS: .................................................................................................................. 75
4.2.2. DDB Components ............................................................................................................... 76
4.2.3. DDBMS Functions.............................................................................................................. 76
4.3. Distributed Database Design ....................................................................................................... 77
4.3.1. Data Fragmentation ............................................................................................................. 77
4.4. Data Replication.......................................................................................................................... 78
4.5. Data Allocation ........................................................................................................................... 79
4.6. Query Processing and Optimization in Distributed Databases ................................................... 80
4.6.1. Distributed Query Processing ............................................................................................. 80
4.6.2. Data Transfer Costs of Distributed Query Processing ........................................................ 81
4.6.3. Distributed Query Processing Using Semijoin.................................................................... 83
4.7. Query and Update Decomposition .............................................................................................. 85
4.8. Distributed Database Transparency Features .............................................................................. 88
4.8.1. Distribution Transparency................................................................................................... 88
4.8.2. Transaction Transparency ................................................................................................... 89
4.9. Performance Transparency and Query Optimization .................................................................. 91
4.9.1. Distributed Concurrency Control ........................................................................................ 93

i | 4.10.
P a g e The Effect of a Premature COMMIT ...................................................................................... 93
4.10.1. Two-Phase Commit Protocol .............................................................................................. 94
.4.10.2 Phases of Two-Phase Commit Protocol .............................................................................. 94
4.11. Distributed Transaction Management and Recovery ............................................................ 95
4.12. Operating System Support for Transaction Management ....................................................... 97
Chapter 5 ................................................................................................................................................. 100
Object Oriented DBMS .......................................................................................................................... 100
5.1. Object Oriented Concepts ......................................................................................................... 100
5.2. Drawbacks of relational DBMS ................................................................................................ 104
5.3. OO Data modeling and E-R diagramming ................................................................................ 108
5.3.1. E-R Model ......................................................................................................................... 108
5.4. Object Oriented Model.............................................................................................................. 109
5.5. Objects and Attributes............................................................................................................... 110
5.6. Characteristics of Object ........................................................................................................... 110
5.6.1. Object Identity................................................................................................................... 112
Chapter 6 ................................................................................................................................................. 114
Data warehousing and Data Mining Techniques ................................................................................. 114
6.1. Data Warehousing ..................................................................................................................... 114
6.1.1. Introduction ....................................................................................................................... 115
6.1.2. Database & data warehouse: Differences.......................................................................... 115
6.1.3. Benefits ............................................................................................................................. 116
6.2. Online Transaction Processing (OLTP) and Data Warehousing............................................... 118
6.3. Data Mining .............................................................................................................................. 119
6.3.1. Introduction ....................................................................................................................... 119
6.4. Data Mining Techniques ........................................................................................................... 122

i|Page
List of figures
Figure 1:States of Transactions ..................................................................................................................... 6
Figure 2:Pre-claiming Lock Protocol .......................................................................................................... 13
Figure 3:Two-phase locking (2PL) ............................................................................................................. 14
Figure 4:Strict Two-phase locking (Strict-2PL).......................................................................................... 16
Figure 5:Precedence Graph for TS Ordering .............................................................................................. 17
Figure 6: Steps in query processing ............................................................................................................ 35
Figure 7: Types of relational operation ....................................................................................................... 46
Figure 8: Types of constraints ..................................................................................................................... 57
Figure 9: Data encryption process............................................................................................................... 68
Figure 10: Communication network ........................................................................................................... 76
Figure 11: Data replication ......................................................................................................................... 79
Figure 12: Accesses data at a single remote site ......................................................................................... 89
Figure 13: Distributed transaction............................................................................................................... 90
Figure 14: Another Distributed Request ..................................................................................................... 91
Figure 15: The Effect of a Premature COMMIT ........................................................................................ 93
Figure 16:E-R Model ................................................................................................................................ 108
Figure 17: Object Oriented data model with object and attributes ............................................................ 112
Figure 18: Current evolution of Decision Support Systems...................................................................... 114

iv | P a g e
List of Tables

Table 1: Transaction T consisting of T1 and T2 ........................................................................................... 2


Table 2:Concurrent execution ....................................................................................................................... 9
Table 3:Dirty read problem ......................................................................................................................... 10
Table 4:Unrepeatable Read Problem .......................................................................................................... 11
Table 5:Unlocking and locking work with 2-PL ......................................................................................... 15
Table 6: Temporary Update Problem .......................................................................................................... 19
Table 7: Incorrect Summary Problem ......................................................................................................... 20
Table 8: Lost Update Problem .................................................................................................................... 21
Table 9: Unrepeatable Read Problem ......................................................................................................... 22
Table 10: Phantom Read Problem............................................................................................................... 22
Table 11: Concept of Serializability ........................................................................................................... 23
Table 12: Non serial schedule ..................................................................................................................... 24
Table 13: Customer Relation ...................................................................................................................... 48
Table 14: Domain constraints ..................................................................................................................... 58
Table 15: Entity integrity constraints ......................................................................................................... 58
Table 16: Referential Integrity Constraints ................................................................................................. 59
Table 17: Referential Integrity Constraints ................................................................................................. 59
Table 18: Key constraints ........................................................................................................................... 59
Table 19: Advantage and Disadvantage of Distributed Database ........................................................ 74
Table 20: A Summary of Transparency Features........................................................................................ 89

v|Page
Chapter One
Transaction Management and Concurrency Control
1.1. Transaction

A transaction is the execution of a sequence of one or more operations (e.g., SQL queries) on a
shared database to perform some higher level function. They are the basic unit of change in a
DBMS.
Example: Move $100 from Abebe’s bank account to his bookie’s account
1Check whether Abebe has $100.

2 Deduct $100 from his account.

3 Add $100 to his bookie’s account


A transaction is Logical unit of work that must be entirely completed or aborted. It consists of:
SELECT statement
Series of related UPDATE statements
Series of INSERT statements
Combination of SELECT, UPDATE, and INSERT statements
Consistent database state:
All data integrity
constraints are satisfied
Must begin with the database in a known consistent state to ensure consistency

1.1.1. Evaluating Transaction Results

 Not all transactions update database


 SQL code represents a transaction because it accesses a database
 Improper or incomplete transactions can have devastating effect on database integrity
 Users can define enforceable constraints based on business rules
 Other integrity rules are automatically enforced by the DBMS

1|Page
1.1.2. Transaction Properties

¤ Atomicity
 All operations of a transaction must be completed
 If not, the transaction is aborted

 It states that all operations of the transaction take place at once if not, the
transaction is aborted.
 There is no midway, i.e., the transaction cannot occur partially. Each transaction
is treated as one unit and either run to completion or is not executed at all.

Atomicity involves the following two operations:

Abort: If a transaction aborts then all the changes made are not visible.

Commit: If a transaction commits then all the changes made are visible.

Example: Let's assume that following transaction T consisting of T1 and T2. A consists of Rs 600
and B consists of Rs 300. Transfer Rs 100 from account A to account B.

Table 1: Transaction T consisting of T1 and T2

T1 T2

Read(A) Read(B)
A:=A-100 Y:=Y+100
Write(A) Write(B)

After completion of the transaction, A consists of Rs 500 and B consists of Rs 400. If the
transaction T fails after the completion of transaction T1 but before completion of transaction T2,
then the amount will be deducted from A but not added to B. This shows the inconsistent database
state. In order to ensure correctness of database state, the transaction must be executed in entirety.

2|Page
¤ Consistency
 Permanence of database’s consistent state

 The integrity constraints are maintained so that the database is consistent before and after
the transaction.
 The execution of a transaction will leave a database in either its prior stable state or a new
stable state.
 The consistent property of database states that every transaction sees a consistent database
instance.
 The transaction is used to transform the database from one consistent state to another
consistent state.

For example: The total amount must be maintained before or after the transaction.

1. Total before T occurs = 600+300=900


2. Total after T occurs= 500+400=900

Therefore, the database is consistent. In the case when T1 is completed but T2 fails, then
inconsistency will occur.

¤ Isolation

 Data used during transaction cannot be used by second transaction until the first is
completed
 It shows that the data which is used at the time of execution of a transaction cannot be
used by the second transaction until the first one is completed.
 In isolation, if the transaction T1 is being executed and using the data item X, then that
data item can't be accessed by any other transaction T2 until the transaction T1 ends.
 The concurrency control subsystem of the DBMS enforced the isolation property.

¤ Durability
 Ensures that once transactions are committed, they cannot be undone or lost
 The durability property is used to indicate the performance of the database's consistent
state. It states that the transaction made the permanent changes.

3|Page
 They cannot be lost by the erroneous operation of a faulty transaction or by the system
failure. When a transaction is completed, then the database reaches a state known as the
consistent state. That consistent state cannot be lost, even in the event of a system's
failure.
 The recovery subsystem of the DBMS has the responsibility of Durability property

¤ Serializability
 Ensures that the schedule for the concurrent execution of several transactions should
yield consistent results.

When multiple transactions are being executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with some
other transaction.

 Schedule − A chronological execution sequence of a transaction is called a schedule. A


schedule can have many transactions in it, each comprising of a number of
instructions/tasks.

 Serial Schedule − It is a schedule in which transactions are aligned in such a way that
one transaction is executed first. When the first transaction completes its cycle, then the
next transaction is executed. Transactions are ordered one after the other. This type of
schedule is called a serial schedule, as transactions are executed in a serial manner.

In a multi-transaction environment, serial schedules are considered as a benchmark. The


execution sequence of an instruction in a transaction cannot be changed, but two transactions can
have their instructions executed in a random fashion.

This execution does no harm if two transactions are mutually independent and working on
different segments of data; but in case these two transactions are working on the same data, then
the results may vary.

This ever-varying result may bring the database to an inconsistent state. To resolve this problem,
we allow parallel execution of a transaction schedule, if its transactions are either serializable or
have some equivalence relation among them.

4|Page
Equivalence Schedules:- An equivalence schedule can be of the following types:
Result Equivalence:

If two schedules produce the same result after execution, they are said to be result equivalent. They
may yield the same result for some value and different results for another set of values. That's why
this equivalence is not generally considered significant.

View Equivalence

Two schedules would be view equivalence if the transactions in both the schedules perform similar
actions in a similar manner.

For example −

If T reads the initial data in S1, then it also reads the initial data in S2.

If T reads the value written by J in S1, then it also reads the value written by J in S2.

If T performs the final write on the data value in S1, then it also performs the final write on the
data value in S2.

Conflict Equivalence

Two schedules would be conflicting if they have the following properties −

Both belong to separate transactions.

Both accesses the same data item.

At least one of them is "write" operation.

Two schedules having multiple transactions with conflicting operations are said to be conflict
equivalent if and only if −

Both the schedules contain the same set of Transactions.

The order of conflicting pairs of operation is maintained in both the schedules.

Note − View equivalent schedules are view serializable and conflict equivalent schedules are
conflict serializable. All conflict serializable schedules are view serializable too.

5|Page
States of Transactions

A transaction in a database can be in one of the following states −

Figure 1:States of Transactions

 Active − In this state, the transaction is being executed. This is the initial state of every
transaction.

 Partially Committed − When a transaction executes its final operation, it is said to be in


a partially committed state.

 Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.

 Aborted − If any of the checks fails and the transaction has reached a failed state, then the
recovery manager rolls back all its write operations on the database to bring the database
back to its original state where it was prior to the execution of the transaction. Transactions
in this state are called aborted.

6|Page
The database recovery module can select one of the two operations after a transaction aborts −

 Re-start the transaction

 Kill the transaction

 Committed − If a transaction executes all its operations successfully, it is said to be


committed. All its effects are now permanently established on the database system

1.1.3. Transaction Management with SQL

¤ SQL statements that provide transaction support


 COMMIT:- A COMMIT command in Structured Query Language (SQL) is a transaction
command that is used to save all changes made by a particular transaction in a
relational database management system since the last COMMIT or ROLLBACK
command.
 ROLLBACK:- a rollback is an operation which returns the database to some previous
state
¤ Transaction sequence must continue until:
 COMMIT statement is reached
 ROLLBACK statement is reached
 End of program is reached
 Program is abnormally terminated

1.1.4. Transaction Log

¤ Keeps track of all transactions that update the database


¤ DBMS uses the information stored in a log for:
 Recovery requirement triggered by a ROLLBACK statement
 A program’s abnormal termination
 A system failure

7|Page
1.2. Concurrency Control
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database. But before knowing about concurrency
control, we should know about concurrent execution.
Concurrent Execution in DBMS
 In a multi-user system, multiple users can access and use the same database at one time,
which is known as the concurrent execution of the database. It means that the same database
is executed simultaneously on a multi-user system by different users.
 While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case, concurrent
execution of the database is performed.
 The thing is that the simultaneous execution that is performed should be done in an
interleaved manner, and no operation should affect the other executing operations, thus
maintaining the consistency of the database. Thus, on making the concurrent execution of
the transaction operations, there occur several challenging problems that need to be solved.

Problems with Concurrent Execution

In a database transaction, the two main operations are READ and WRITE operations. So, there is
a need to manage these two operations in the concurrent execution of the transactions as if these
operations are not performed in an interleaved manner, and the data may become inconsistent. So,
the following problems occur with the Concurrent Execution of the operations:

Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the read/write operations
on the same database items in an interleaved manner (i.e., concurrent execution) that makes the
values of the items incorrect hence making the database inconsistent.

8|Page
For example:

Consider the below diagram where two transactions TX and TY, are performed on the same account
A where the balance of account A is $300.

Table 2:Concurrent execution

 At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
 At time t2, transaction TX deducts $50 from account A that becomes $250 (only deducted
and not updated/write).
 Alternately, at time t3, transaction TY reads the value of account A that will be $300 only
because TX didn't update the value yet.
 At time t4, transaction TY adds $100 to account A that becomes $400 (only added but not
updated/write).
 At time t6, transaction TX writes the value of account A that will be updated as $250 only,
as TY didn't update the value yet.
 Similarly, at time t7, transaction TY writes the values of account A, so it will write as done
at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is lost.

9|Page
Hence data becomes incorrect, and database sets to inconsistent.

Dirty Read Problems (W-R Conflict)

The dirty read problem occurs when one transaction updates an item of the database, and somehow
the transaction fails, and before the data gets rollback, the updated database item is accessed by
another transaction. There comes the Read-Write Conflict between both transactions.

For example:

Consider two transactions TX and TY in the below diagram performing read/write operations on
account A where the available balance in account A is $300:

Table 3:Dirty read problem

 At time t1, transaction TX reads the value of account A, i.e., $300.


 At time t2, transaction TX adds $50 to account A that becomes $350.
 At time t3, transaction TX writes the updated value in account A, i.e., $350.
 Then at time t4, transaction TY reads account A that will be read as $350.
 Then at time t5, transaction TX rollbacks due to server problem, and the value changes back
to $300 (as initially).

10 | P a g e
 But the value for account A remains $350 for transaction TY as committed, which is the
dirty read and therefore known as the Dirty Read Problem.
Unrepeatable Read Problem (W-R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different
values are read for the same database item.

For example:

Consider two transactions, TX and TY, performing the read/write operations on account A, having
an available balance = $300. The diagram is shown below:

Table 4:Unrepeatable Read Problem

 At time t1, transaction TX reads the value from account A, i.e., $300.
 At time t2, transaction TY reads the value from account A, i.e., $300.
 At time t3, transaction TY updates the value of account A by adding $100 to the available
balance, and then it becomes $400.
 At time t4, transaction TY writes the updated value, i.e., $400.
 After that, at time t5, transaction TX reads the available value of account A, and that will
be read as $400.

11 | P a g e
 It means that within the same transaction TX, it reads two different values of account A,
i.e., $ 300 initially, and after updating made by transaction TY, it reads $400. It is an
unrepeatable read and is therefore known as the Unrepeatable read problem.

Thus, in order to maintain consistency in the database and avoid such problems that take place in
concurrent execution, management is needed, and that is where the concept of Concurrency
Control comes into role.

Concurrency Control

Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the database.
Thus, for maintaining the concurrency of the database, we have the concurrency control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation,


durability and serializability of the concurrent execution of the database transactions. Therefore,
these protocols are categorized as:

o Lock Based Concurrency Control Protocol


o Time Stamp Concurrency Control Protocol
o Validation Based Concurrency Control Protocol

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an appropriate
lock on it. There are two types of lock:

Shared lock:

 It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.
 It can be shared between the transactions because when the transaction holds a lock, then
it can't update the data on the data item.

12 | P a g e
Exclusive lock:

 In the exclusive lock, the data item can be both reads as well as written by the transaction.
 This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
There are four types of lock protocols available:
1. Simplistic lock protocol

It is the simplest way of locking the data while transaction. Simplistic lock-based protocols allow
all the transactions to get the lock on the data before insert or delete or update on it. It will unlock
the data item after completing the transaction.

Pre-claiming Lock Protocol

 Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they
need locks.
 Before initiating an execution of the transaction, it requests DBMS for all the lock on all
those data items.
 If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.
 If all the locks are not granted then this protocol allows the transaction to rolls back and
waits until all the locks are granted.

Figure 2:Pre-claiming Lock Protocol

13 | P a g e
Two-phase locking (2PL)

 The two-phase locking protocol divides the execution phase of the transaction into three
parts.
 In the first part, when the execution of the transaction starts, it seeks permission for the
lock it requires.
 In the second part, the transaction acquires all the locks. The third phase is started as soon
as the transaction releases its first lock.
 In the third phase, the transaction cannot demand any new locks. It only releases the
acquired locks.

Figure 3:Two-phase locking (2PL)

There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released,
but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase can happen:

1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.


2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking phase.

14 | P a g e
Example:
Table 5:Unlocking and locking work with 2-PL

The following way shows how unlocking and locking work with 2-PL.

Transaction T1:

 Growing phase: from step 1-3


 Shrinking phase: from step 5-7
 Lock point: at 3

Transaction T2:

 Growing phase: from step 2-6


 Shrinking phase: from step 8-9
 Lock point: at 6

15 | P a g e
Strict Two-phase locking (Strict-2PL)

 The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the
locks, the transaction continues to execute normally.
 The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock
after using it.
 Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at
a time.
 Strict-2PL protocol does not have shrinking phase of lock release.

Figure 4:Strict Two-phase locking (Strict-2PL)

It does not have cascading abort as 2PL does.

Timestamp Ordering Protocol


 The Timestamp Ordering Protocol is used to order the transactions based on their
Timestamps. The order of transaction is nothing but the ascending order of the transaction
creation.
 The priority of the older transaction is higher that's why it executes first. To determine the
timestamp of the transaction, this protocol uses system time or logical counter.
 The lock-based protocol is used to manage the order between conflicting pairs among
transactions at the execution time. But Timestamp based protocols start working as soon
as a transaction is created.

16 | P a g e
 Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered
the system at 007 times and transaction T2 has entered the system at 009 times. T1 has the
higher priority, so it executes first as it is entered the system first.
 The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.

Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues a Read (X) operation:

 If W_TS(X) >TS(Ti) then the operation is rejected.


 If W_TS(X) <= TS(Ti) then the operation is executed.
 Timestamps of all the data items are updated.

2. Check the following condition whenever a transaction Ti issues a Write(X) operation:

 If TS(Ti) < R_TS(X) then the operation is rejected.


 If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the
operation is executed.

Where,Java Try Catch

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Advantages and Disadvantages of TO protocol:

 TO protocol ensures serializability since the precedence graph is as follows:

Figure 5:Precedence Graph for TS Ordering

17 | P a g e
 TS protocol ensures freedom from deadlock that means no transaction ever waits.
 But the schedule may not be recoverable and may not even be cascade- free.

Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In the validation
based protocol, the transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to read the
value of various data items and stores them in temporary local variables. It can perform all
the write operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the
actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results
are written to the database or system otherwise the transaction is rolled back.

Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.

 This protocol is used to determine the time stamp for the transaction for serialization using
the time stamp of the validation phase, as it is the actual phase which determines if the
transaction will commit or rollback.
 Hence TS(T) = validation(T).
 The serializability is determined during the validation process. It can't be decided in
advance.

 While executing the transaction, it ensures a greater degree of concurrency and also less
number of conflicts.
 Thus it contains transactions which have less number of rollbacks.

18 | P a g e
1.3. Problems of Concurrent Sharing
When multiple transactions execute concurrently in an uncontrolled or unrestricted manner, then
it might lead to several problems. These problems are commonly referred to as concurrency
problems in a database environment. The five concurrency problems that can occur in the
database are:
 Temporary Update Problem
 Incorrect Summary Problem
 Lost Update Problem
 Unrepeatable Read Problem
 Phantom Read Problem
These are explained as following below.

Temporary Update Problem:


Temporary update or dirty read problem occurs when one transaction updates an item and fails.
But the updated item is used by another transaction before the item is changed or reverted back
to its last value.
Example:

Table 6: Temporary Update Problem

19 | P a g e
In the above example, if transaction 1 fails for some reason then X will revert back to its previous
value. But transaction 2 has already read the incorrect value of X.
Incorrect Summary Problem:
Consider a situation, where one transaction is applying the aggregate function on some records
while another transaction is updating these records. The aggregate function may calculate some
values before the values have been updated and others after they are updated.
Example:
Table 7: Incorrect Summary Problem

In the above example, transaction 2 is calculating the sum of some records while transaction 1
is updating them. Therefore the aggregate function may calculate some values before they have
been updated and others after they have been updated.

20 | P a g e
Lost Update Problem:
In the lost update problem, an update done to a data item by a transaction is lost as it is
overwritten by the update done by another transaction.
Example:
Table 8: Lost Update Problem

In the above example, transaction 1 changes the value of X but it gets overwritten by the update
done by transaction 2 on X. Therefore, the update done by transaction 1 is lost.

21 | P a g e
Unrepeatable Read Problem:
The unrepeatable problem occurs when two or more read operations of the same transaction read
different values of the same variable.
Example:
Table 9: Unrepeatable Read Problem

In the above example, once transaction 2 reads the variable X, a write operation in transaction 1
changes the value of the variable X.

Thus, when another read operation is performed by transaction 2, it reads the new value of X
which was updated by transaction 1.
Phantom Read Problem:
The phantom read problem occurs when a transaction reads a variable once but when it tries to
read that same variable again, an error occurs saying that the variable does not exist.

Example: Table 10: Phantom Read Problem

22 | P a g e
In the above example, once transaction 2 reads the variable X, transaction 1 deletes the variable
X without transaction 2’s knowledge. Thus, when transaction 2 tries to read X, it is not able to
do it.

1.4. Concept of Serializability


A schedule is serialized if it is equivalent to a serial schedule. A concurrent schedule must ensure
it is the same as if executed serially means one after another. It refers to the sequence of actions
such as read, write, abort, commit are performed in a serial manner.

Example
Let’s take two transactions T1 and T2,

If both transactions are performed without interfering each other then it is called as serial schedule,
it can be represented as follows −

Table 11: Concept of Serializability

T1 T2

READ1(A)

WRITE1(A)

READ1(B)

C1

READ2(B)

WRITE2(B)

READ2(B)

C2

23 | P a g e
Non serial schedule − When a transaction is overlapped between the transaction T1 and T2.
Example
Consider the following example −
Table 12: Non serial schedule

T1 T2

READ1(A)

WRITE1(A)

READ2(B)

WRITE2(B)

READ1(B)

WRITE1(B)

READ1(B)

1.4.1. Types of serializability

There are two types of serializability −


¤ View serializability
A schedule is view-serializability if it is viewed equivalent to a serial schedule.
The rules it follows are as follows:
 T1 is reading the initial value of A, then T2 also reads the initial value of A.
 T1 is the reading value written by T2, then T2 also reads the value written by T1.
 T1 is writing the final value, and then T2 also has the write operation as the final value.

24 | P a g e
A schedule will view serializable if it is view equivalent to a serial schedule. If a schedule is
conflict serializable, then it will be view serializable.

The view serializable which does not conflict serializable contains blind writes. View Equivalent
Two schedules S1 and S2 are said to be view equivalent if they satisfy the following conditions:

1. Initial Read
An initial read of both schedules must be the same. Suppose two schedule S1 and S2. In schedule
S1, if a transaction T1 is reading the data item A, then in S2, transaction T1 should also read A.
DBMS View Serializability
Above two schedules are view equivalent because Initial read operation in S1 is done by T1 and
in S2 it is also done by T1.
2. Updated Read
In schedule S1, if Ti is reading A which is updated by Tj then in S2 also, Ti should read A which
is updated by Tj.
OOPs Concepts in Java

Above two schedules are not view equal because, in S1, T3 is reading A updated by T2 and in S2,
T3 is reading A updated by T1.

Final Write

A final write must be the same between both the schedules. In schedule S1, if a transaction T1
updates A at last then in S2, final writes operations should also be done by T1.

25 | P a g e
Above two schedules is view equal because Final write operation in S1 is done by T3 and in S2,
the final write operation is also done by T3.

Schedule S1
Conflict serializability
It orders any conflicting operations in the same way as some serial execution. A pair of operations
is said to conflict if they operate on the same data item and one of them is a write operation.
That means
 Readi(x) readj(x) - non conflict read-read operation
 Readi(x) writej(x) - conflict read-write operation.
 Writei(x) readj(x) - conflict write-read operation.
 Writei(x) writej(x) - conflict write-write operation.

1.5. Database Recovery

¤ Database recovery is the process of restoring the database to a correct (consistent) state in
the event of a failure. In other words, it is the process of restoring the database to the most
recent consistent state that existed shortly before the time of system failure.
¤ The failure may be the result of a system crash due to hardware or software errors, a media
failure such as head crash, or a software error in the application such as a logical error in
the program that is accessing the database.
¤ Recovery restores a database form a given state, usually inconsistent, to a previously
consistent state

26 | P a g e
¤ DBMS is a highly complex system with hundreds of transactions being executed every
second. The durability and robustness of a DBMS depends on its complex architecture and
its underlying hardware and system software. If it fails or crashes amid transactions, it is
expected that the system would follow some sort of algorithm or techniques to recover lost
data.

Failure Classification
 To see where the problem has occurred, we generalize a failure into various
categories, as follows

Transaction failure
 A transaction has to abort when it fails to execute or when it reaches a point from
where it can’t go any further. This is called transaction failure where only a few
transactions or processes are hurt.

Reasons for a transaction failure could be:-


 Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.

 System errors − Where the database system itself terminates an active transaction because
the DBMS is not able to execute it, or it has to stop because of some system condition.
For example, in case of deadlock or resource unavailability, the system aborts an active
transaction.

System Crash
There are problems − external to the system − that may cause the system to stop abruptly and cause
the system to crash. For example, interruptions in power supply may cause the failure of underlying
hardware or software failure. Examples may include operating system errors.

Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.

Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or any
other failure, which destroys all or a part of disk storage.

27 | P a g e
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories:

 Volatile storage − As the name suggests, a volatile storage cannot survive system crashes.
Volatile storage devices are placed very close to the CPU; normally they are embedded
onto the chipset itself. For example, main memory and cache memory are examples of
volatile storage. They are fast but can store only a small amount of information.

 Non-volatile storage − These memories are made to survive system crashes. They are
huge in data storage capacity, but slower in accessibility. Examples may include hard-
disks, magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.

Recovery and Atomicity


When a system crashes, it may have several transactions being executed and various files opened
for them to modify the data items. Transactions are made of various operations, which are atomic
in nature. But according to ACID properties of DBMS, atomicity of transactions as a whole must
be maintained, that is, either all the operations are executed or none.

When a DBMS recovers from a crash, it should maintain the following −

 It should check the states of all the transactions, which were being executed.

 A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.

 It should check whether the transaction can be completed now or it needs to be rolled back.

 No transactions would be allowed to leave the DBMS in an inconsistent state.

There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction:-

 Maintaining the logs of each transaction, and writing them onto some stable storage before
actually modifying the database.

 Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.

28 | P a g e
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe. Log-based recovery works as follows −

 The log file is kept on a stable storage media.

 When a transaction enters the system and starts execution, it writes a log about it.

<Tn, Start>

 When the transaction modifies an item X, it write logs as follows −

<Tn, X, V1, V2>

It reads Tn has changed the value of X, from V1 to V2.

 When the transaction finishes, it logs:

<Tn, commit>

The database can be modified using two approaches:

 Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.

 Immediate database modification − Each log follows an actual database modification.


That is, the database is modified immediately after every operation.

Recovery with Concurrent Transactions


When more than one transaction are being executed in parallel, the logs are interleaved. At the
time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.

Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and stored

29 | P a g e
permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.

Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner:-

 The recovery system reads the logs backwards from the end to the last checkpoint.

 It maintains two lists, an undo-list and a redo-list.

 If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>,
it puts the transaction in the redo-list.

 If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts
the transaction in undo-list.

All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.

1.6. Transaction and Recovery


1.6.1. Transaction

 A transaction is a ‘logical unit of work’ on a database


• Each transaction does something in the database
• No part of it alone achieves anything of use or interest
 Transactions are the unit of recovery, consistency, and integrity as well

30 | P a g e
1.6.2. Recovery

¤ Transactions should be durable, but we cannot prevent all sorts of failures:


 System crashes
 Power failures
 Disk crashes
 User mistakes
¤ Prevention is better than cure
 Reliable OS
 Security
 UPS and surge protectors
 RAID arrays
 Can’t protect against everything though

1.6.3. Recovery techniques and facilities

¤ Database systems, like any other computer system, are subject to failures but the data stored
in it must be available as and when required.
¤ When a database fails it must possess the facilities for fast recovery. It must also have
atomicity i.e. either transactions are completed successfully and committed (the effect is
recorded permanently in the database) or the transaction should have no effect on the
database. There are both automatic and non-automatic ways for both, backing up of data and
recovery from any failure situations.
¤ The techniques used to recover the lost data due to system crash, transaction errors, viruses,
catastrophic failure, incorrect commands execution etc. are database recovery techniques. So
to prevent data loss recovery techniques based on deferred update and immediate update or
backing up data can be used.
¤ Recovery techniques are heavily dependent upon the existence of a special file known as
a system log. It contains information about the start and end of each transaction and any
updates which occur in the transaction. The log keeps track of all transaction operations that
affect the values of database items. This information is needed to recover from transaction
failure.
 The log is kept on disk start_transaction(T): This log entry records that transaction T
starts the execution.
 read_item(T, X): This log entry records that transaction T reads the value of database
item X.

31 | P a g e
 write_item(T, X, old_value, new_value): This log entry records that transaction T
changes the value of the database item X from old_value to new_value. The old value
is sometimes known as a before an image of X, and the new value is known as an
afterimage of X.
 commit(T): This log entry records that transaction T has completed all accesses to the
database successfully and its effect can be committed (recorded permanently) to the
database.
 abort(T): This records that transaction T has been aborted.
 checkpoint: Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point before which
the DBMS was in consistent state, and all the transactions were committed.
A transaction T reaches its commit point when all its operations that access the database have
been executed successfully i.e. the transaction has reached the point at which it will
not abort (terminate without completing). Once committed, the transaction is permanently
recorded in the database. Commitment always involves writing a commit entry to the log and
writing the log to disk. At the time of a system crash, item is searched back in the log for all
transactions T that have written a start_transaction(T) entry into the log but have not written a
commit(T) entry yet; these transactions may have to be rolled back to undo their effect on the
database during the recovery process
 Undoing – If a transaction crashes, then the recovery manager may undo transactions i.e.
reverse the operations of a transaction. This involves examining a transaction for the log
entry write_item(T, x, old_value, new_value) and setting the value of item x in the
database to old-value.There are two major techniques for recovery from non-catastrophic
transaction failures: deferred updates and immediate updates.
 Deferred update – This technique does not physically update the database on disk until
a transaction has reached its commit point. Before reaching commit, all transaction
updates are recorded in the local transaction workspace. If a transaction fails before
reaching its commit point, it will not have changed the database in any way so UNDO is
not needed. It may be necessary to REDO the effect of the operations that are recorded
in the local transaction workspace, because their effect may not yet have been written in
the database. Hence, a deferred update is also known as the No-undo/redo algorithm

32 | P a g e
 Immediate update – In the immediate update, the database may be updated by some
operations of a transaction before the transaction reaches its commit point. However,
these operations are recorded in a log on disk before they are applied to the database,
making recovery still possible. If a transaction fails to reach its commit point, the effect
of its operation must be undone i.e. the transaction must be rolled back hence we require
both undo and redo. This technique is known as undo/redo algorithm.
 Caching/Buffering – In this one or more disk pages that include data items to be updated
are cached into main memory buffers and then updated in memory before being written
back to disk. A collection of in-memory buffers called the DBMS cache is kept under
control of DBMS for holding these buffers. A directory is used to keep track of which
database items are in the buffer. A dirty bit is associated with each buffer, which is 0 if
the buffer is not modified else 1 if modified.
 Shadow paging – It provides atomicity and durability. A directory with n entries is
constructed, where the ith entry points to the ith database page on the link. When a
transaction began executing the current directory is copied into a shadow directory. When
a page is to be modified, a shadow page is allocated in which changes are made and when
it is ready to become durable, all pages that refer to original are updated to refer new
replacement page.
Some of the backup techniques are as follows :

 Full database backup – In this full database including data and database, Meta
information needed to restore the whole database, including full-text catalogs are backed
up in a predefined time series.
 Differential backup – It stores only the data changes that have occurred since last full
database backup. When same data has changed many times since last full database
backup, a differential backup stores the most recent version of changed data. For this
first, we need to restore a full database backup.
 Transaction log backup – In this, all events that have occurred in the database, like a
record of every single statement executed is backed up. It is the backup of transaction log
entries and contains all transaction that had happened to the database. Through this, the
database can be recovered to a specific point in time.

33 | P a g e
Chapter two
Query Processing and Optimization
2.1. Overview
All database systems must be able to respond to requests for information from the user i.e. process
queries. Obtaining the desired information from a database system in a predictable and reliable
fashion is the scientific art of Query Processing. Getting these results back in a timely manner
deals with the technique of Query Optimization. Query Processing is the activity performed in
extracting data from the database.

2.2. Query Processing Steps


In query processing, it takes various steps for fetching the data from the database. The steps
involved are:

1. Parsing and translation


2. Optimization
3. Evaluation

The query processing works in the following way:

Parsing and Translation

¤ As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL.
¤ It gets translated into expressions that can be further used at the physical level of the file
system. After this, the actual evaluation of the queries and a variety of query -optimizing
transformations and takes place.
¤ Thus before processing a query, a computer system needs to translate the query into a human-
readable and understandable language. Consequently, SQL or Structured Query Language is
the best suitable choice for humans. But, it is not perfectly suitable for the internal
representation of the query to the system. Relational algebra is well suited for the internal
representation of a query.

34 | P a g e
The translation process in query processing is similar to the parser of a query. When a user executes
any query, for generating the internal form of the query, the parser in the system checks the syntax
of the query, verifies the name of the relation in the database, the tuple, and finally the required
attribute value. The parser creates a tree of the query, known as 'parse-tree.' Further, translate it
into the form of relational algebra. With this, it evenly replaces all the use of the views when used
in the query.

Thus, we can understand the working of a query processing in the below-described diagram:How
to find Nth Highest Salary in SQL

Figure 6: Steps in query processing

Suppose a user executes a query. As we have learned that there are various methods of extracting
the data from the database. In SQL, a user wants to fetch the records of the employees whose salary
is greater than or equal to 10000. For doing this, the following query is undertaken:

select emp_name from Employee where salary>10000; Thus, to make the system understand
the user query, it needs to be translated in the form of relational algebra. We can bring this query
in the relational algebra form as:

35 | P a g e
o σsalary>10000 (πsalary (Employee))
o πsalary (σsalary>10000 (Employee))

After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.

Evaluation

For this, with addition to the relational algebra translation, it is required to annotate the translated
relational algebra expression with the instructions used for specifying and evaluating each
operation. Thus, after translating the user query, the system executes a query evaluation plan.

Query Evaluation Plan


 In order to fully evaluate a query, the system needs to construct a query evaluation plan.
 The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
 Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
 Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query execution
plan.
 A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user query.

2.3. Query Decomposition


Query decomposition is the first phase of query processing. The primary targets of query
decomposition are to transform a high-level query into a relational algebra query and to check that
the query is syntactically and semantically correct. The typical stages of query decomposition are
analysis, normalization, semantic analysis, simplification, and query restructuring.

36 | P a g e
2.4. Optimization Process

 The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to write
their query efficiently.
 Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
 For optimizing a query, the query optimizer should have an estimated cost analysis of each
operation. It is because the overall operation cost depends on the memory allocations to
several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
¤ There are various topics which lead the query optimization

2.4.1. Top-K Optimization

A database system is used for fetching data from it. But there are some queries that the user has
given that access results sorted on some attributes and require only top K results for some K. Also,
some queries support bound K, or limit K clause, which accesses the top K results. But, some
queries do not support the bound K. For such queries, the optimizer specifies a hint that indicates
the results of the query retrieved should be the top k results only. It does not matter if the query
generates more number of results, including the top k results. In cases, the value of K is small, and
then also, if the query optimization plan produces the entire set of results, sorts them and generates
the top K results. Such a step is not at all effective as well as inefficient as it may likely to discard
most of the computed intermediate results. Therefore, we use several methods to optimize such
top-k queries.

Two such methods are:

 Using pipelined query evaluation plans for producing the results in sorted order.
 Estimating the highest value on the sorted attributes that will appear in the top K result and
introducing the selection predicates used for eliminating the larger values.

37 | P a g e
Anyhow, some extra tuples are generated beyond the top-K results. Such tuples are discarded, and
if too few tuples are generated that do not reach the top K results, then we need to execute the
query again, and also, there is a need to change the selection condition.

2.4.2. Join Minimization

There are different join operations used for processing the given user query. In cases, when queries
are generated through views, then for computing the query, it is required to join more number of
relations than the actual requirement. To resolve such cases, we need to drop such relations from
a join. Such type of solution or method is known as Join Minimization. We have discussed only
one of such cases. There are also more numbers of similar cases, and we can apply the join
minimization there also.

Optimization of Updates

An update query is used to make changes in the already persisted data. An update query often
involves subqueries in the set as well as where clauses. So, while optimizing the update, both these
subqueries must also get included. For example, if a user wants to update the score as 97 of a
student in a student table whose roll_no is 102. The following update query will be used:

update student set score = 97 where roll_no = 102

However, if the updates involve a selection on the updated column, we need to handle such
updations carefully. If the update is done during the selection performed by an index scan, then we
need to re-insert an updated tuple in the index ahead of the scan. Also, several problems can arise
in the updation of the subqueries, whose result is affected by the update.

Halloween Problem

The problem was named so because it was first identified on Halloween Day at IBM. The problem
of an update that affects execution of a query associated with the update, known as the Halloween
problem. But, we can avoid this problem by breaking up the execution plan by executing the
following steps:

 Executing the queries that define the update first

38 | P a g e
 Creating a list of affected tuples
 At last, updating the tuples and indices.

Thus, following these steps increases the execution cost of the query evaluation plan.

We can optimize the update plans by checking if the Halloween problem can occur, and if it cannot
occur, perform the update during the processing of the query. It, however, reduces the update
overheads. We can understand this with an example, suppose that Halloween's problem cannot
occur if the index attributes are not affected by the updates. However, if it does and if the updates
also decrease the value, even if the index is scanned in increasing order, in that case, it will not
encounter the updated tuples again during the scan process. But in such cases, it can update the
index even the query is being executed. Thus, it will reduce the overall cost and lead to an
optimized update.

Another method of optimizing such update queries that result or concludes a large number of
updates is by collecting all the updates as a batch. After collecting, apply these updates batch
separately to each affected index. But, before applying an updates batch to an index, it is required
to sort the batch in index order for that index. Thus, such sorting of the batch will reduce the
amount of random I/O, which are needed to update the indices at a great height.

Therefore, we can perform such optimization of updates in most of the database systems.

2.4.3. Multi Query Optimization and Shared Scans

We can understand the multi-query optimization as when the user submits a queries batch. The
query optimizer exploits common subexpressions between different queries. It does so to evaluate
them once and reuse them whenever required. Thus, for complex queries also, we can exploit the
subexpression, and it reduces the cost of the query evaluation plan, consequently. So, we need to
optimize the subexpressions for different queries. One way of optimization is the elimination of
the common subexpression, known as Common subexpression elimination. The common
subexpression elimination method optimizes the subexpressions by computing and storing the
result. Further, reusing the result whenever the subexpressions occur. Only a few databases

39 | P a g e
perform the exploitation of common subexpressions among the evaluation plans, which are
selected for each of the batches of queries.

In some database systems, another form of multi-query optimization is implemented. Such a form
of implementation is known as Sharing of relation scans between the queries. Understand the
following steps to know the working of the Shared-scan:

 It does not read the relation in a repeated manner from the disk.
 It reads data only once from the disk for one time for every query which needs to scan a
relation.
 Finally, it pipelines to each of the queries.

Such a method of shared-scan optimization is useful when multiple queries perform a scan on a
fact table or a single large relation.

2.4.4. Parametric Query Optimization

In the parametric query optimization method, query optimization is performed without specifying
its parameter values. The optimizer outputs several optimal plans for different parametric values.
It outputs the plan only if it is optimal for some possible parameter values. After this, the optimizer
stores the output set of alternative plans. Then the cheapest plan is found and selected. Such
selection takes very less time than the re-optimization process. In this way, the optimizer optimizes
the parameters and leads to an optimized and cost-effective output.

2.5. Approaches to Query Optimization


Among the approaches for query optimization, exhaustive search and heuristics-based algorithms
are mostly used.

2.5.1. Exhaustive Search Optimization

In these techniques, for a query, all possible query plans are initially generated and then the best
plan is selected. Though these techniques provide the best solution, it has an exponential time and

40 | P a g e
space complexity owing to the large solution space. For example, dynamic programming
technique.

2.5.2. Heuristic Based Optimization

Heuristic based optimization uses rule-based optimization approaches for query optimization.
These algorithms have polynomial time and space complexity, which is lower than the
exponential complexity of exhaustive search-based algorithms. However, these algorithms do not
necessarily produce the best query plan.

Some of the common heuristic rules are −

 Perform select and project operations before join operations. This is done by moving the
select and project operations down the query tree. This reduces the number of tuples
available for join.

 Perform the most restrictive select/project operations at first before the other operations.

 Avoid cross-product operation since they result in very large-sized intermediate tables.

2.6. Transformation Rules

The first step of the optimizer says to implement such expressions that are logically equivalent to
the given expression. For implementing such a step, we use the equivalence rule that describes the
method to transform the generated expression into a logically equivalent expression.

Although there are different ways through which we can express a query, with different costs. But
for expressing a query in an efficient manner, we will learn to create alternative as well as
equivalent expressions of the given expression, instead of working with the given expression. Two
relational-algebra expressions are equivalent if both the expressions produce the same set of tuples
on each legal database instance. A legal database instance refers to that database system which
satisfies all the integrity constraints specified in the database schema. However, the sequence of
the generated tuples may vary in both expressions, but they are considered equivalent until they
produce the same tuples set.

41 | P a g e
Equivalence Rules

The equivalence rule says that expressions of two forms are the same or equivalent because both
expressions produce the same outputs on any legal database instance. It means that we can possibly
replace the expression of the first form with that of the second form and replace the expression of
the second form with an expression of the first form. Thus, the optimizer of the query-evaluation
plan uses such an equivalence rule or method for transforming expressions into the logically
equivalent one.

The optimizer uses various equivalence rules on relational-algebra expressions for transforming
the relational expressions. For describing each rule, we will use the following symbols:

Difference between JDK, JRE, and JVM

θ, θ1, θ2 … : Used for denoting the predicates.

L1, L2, L3 … : Used for denoting the list of attributes.

E, E1, E2 …. : Represents the relational-algebra expressions.

Let's discuss a number of equivalence rules:

Rule 1: Cascade of σ

This rule states the deconstruction of the conjunctive selection operations into a sequence of
individual selections. Such a transformation is known as a cascade of σ.

σθ1 ᴧ θ 2 (E) = σθ1 (σθ2 (E))

Rule 2: Commutative Rule

a) This rule states that selections operations are commutative.

σθ1 (σθ2 (E)) = σ θ2 (σθ1 (E))

b) Theta Join (θ) is commutative.

42 | P a g e
E1 ⋈ θ E 2 = E 2 ⋈ θ E 1 (θ is in subscript with the join symbol)

However, in the case of theta join, the equivalence rule does not work if the order of attributes is
considered. Natural join is a special case of Theta join, and natural join is also commutative.

However, in the case of theta join, the equivalence rule does not work if the order of attributes is
considered. Natural join is a special case of Theta join, and natural join is also commutative.

Rule 3: Cascade of ∏

This rule states that we only need the final operations in the sequence of the projection operations,
and other operations are omitted. Such a transformation is referred to as a cascade of ∏.

∏L1 (∏L2 (. . . (∏Ln (E)) . . . )) = ∏L1 (E)

Rule 4: We can combine the selections with Cartesian products as well as theta joins

Rule 4: We can combine the selections with Cartesian products as well as theta joins

1. σθ (E1 x E2) = E1θ ⋈ E2


2. σθ1 (E1 ⋈ θ2 E2) = E1 ⋈ θ1ᴧθ2 E2

Rule 5: Associative Rule

a) This rule states that natural join operations are associative.

(E1 ⋈ E2) ⋈ E3 = E1 ⋈ (E2 ⋈ E3)

b) Theta joins are associative for the following expression:

(E1 ⋈ θ1 E2) ⋈ θ2ᴧθ3 E3 = E1 ⋈ θ1ᴧθ3 (E2 ⋈ θ2 E3)

In the theta associativity, θ2 involves the attributes from E2 and E3 only. There may be chances
of empty conditions, and thereby it concludes that Cartesian Product is also associative.

Rule 6: Distribution of the Selection operation over the Theta join.

43 | P a g e
Under two following conditions, the selection operation gets distributed over the theta-join
operation:

a) When all attributes in the selection condition θ0 include only attributes of one of the expressions
which are being joined.

σθ0 (E1 ⋈ θ E2) = (σθ0 (E1)) ⋈ θ E2

b) When the selection condition θ1 involves the attributes of E1 only, and θ2 includes the attributes
of E2 only.

σθ1ꓥ θ2 (E1 ⋈ θ E2) = (σθ1 (E1)) ⋈ θ ((σθ2 (E2))

Rule 7: Distribution of the projection operation over the theta join.

Under two following conditions, the selection operation gets distributed over the theta-join
operation:

a) Assume that the join condition θ includes only in L1 υ L2 attributes of E1 and E2 Then, we get
the following expression:

∏L1υL2 (E1 ⋈ θ E2) = (∏L1 (E1)) ⋈ θ (∏L2 (E2))

b) Assume a join as E1 ⋈ E2. Both expressions E1 and E2 have sets of attributes as L1 and L2.
Assume two attributes L3 and L4 where L3 be attributes of the expression E1, involved in the θ join
condition but not in L1 υ L2 Similarly, an L4 be attributes of the expression E2 involved only in the
θ join condition and not in L1 υ L2 attributes. Thus, we get the following expression:

∏L1υL2 (E1 ⋈ θ E2) = ∏L1υL2 ((∏L1υL3 (E1)) ⋈ θ ((∏L2υL4 (E2)))

Rule 8: The union and intersection set operations are commutative.

E1 υ E2 = E2 υ E1

E1 ꓵ E2 = E2 ꓵ E1

44 | P a g e
However, set difference operations are not commutative.

Rule 9: The union and intersection set operations are associative.

(E1 υ E2) υ E3 = E1 υ (E2 υ E3)

(E1 ꓵ E2) ꓵ E3 = E1 ꓵ (E2 ꓵ E3)

Rule 10: Distribution of selection operation on the intersection, union, and set difference
operations.

The below expression shows the distribution performed over the set difference operation.

σp (E1 − E2) = σp(E1) − σp(E2)

We can similarly distribute the selection operation on υ and ꓵ by replacing with -. Further, we get:

σp (E1 − E2) = σp(E1) −E2

Rule 11: Distribution of the projection operation over the union operation.

This rule states that we can distribute the projection operation on the union operation for the given
expressions.

∏L (E1 υ E2) = (∏L (E1)) υ (∏L (E2))

2.7. Implementing relational Operators


2.7.1. Relational Algebra

Relational algebra is a procedural query language. It gives a step by step process to obtain the
result of the query. It uses operators to perform queries.

45 | P a g e
Types of Relational operation

Figure 7: Types of relational operation

1. Select Operation:

 The select operation selects tuples that satisfy a given predicate.


 It is denoted by sigma (σ).

1. Notation: σ p(r)

Where:

σ is used for selection prediction


r is used for relation
p is used as a propositional logic formula which may use connectors like: AND OR and NOT.
These relational can use as relational operators like =, ≠, ≥, <, >, ≤.

46 | P a g e
For example: LOAN Relation

BRANCH_NAME LOAN_NO AMOUNT

Downtown L-17 1000

Redwood L-23 2000

Perryride L-15 1500

Downtown L-14 1500

Mianus L-13 500

Roundhill L-11 900

Perryride L-16 1300

Input:

1. σ BRANCH_NAME="perryride" (LOAN)

Output:

BRANCH_NAME LOAN_NO AMOUNT

Perryride L-15 1500

Perryride L-16 1300

2. Project Operation:

 This operation shows the list of those attributes that we wish to appear in the result. Rest
of the attributes are eliminated from the table.

47 | P a g e
 It is denoted by ∏.

Notation: ∏ A1, A2, An (r)

Where

A1, A2, A3 is used as an attribute name of relation r.

Example: CUSTOMER RELATION

Table 13: Customer Relation

NAME STREET CITY

Jones Main Harrison

Smith North Rye

Hays Main Harrison

Curry North Rye

Johnson Alma Brooklyn

Brooks Senator Brooklyn

Input

1. ∏ NAME, CITY (CUSTOMER)

Output:

48 | P a g e
NAME CITY

Jones Harrison

Smith Rye

Hays Harrison

Curry Rye

Johnson Brooklyn

Brooks Brooklyn

3. Union Operation:

 Suppose there are two tuples R and S. The union operation contains all the tuples that are
either in R or S or both in R & S.
 It eliminates the duplicate tuples. It is denoted by 𝖴.

1. Notation: R 𝖴 S

A union operation must hold the following condition:

 R and S must have the attribute of the same number.


 Duplicate tuples are eliminated automatically.

4. Set Intersection:

 Suppose there are two tuples R and S. The set intersection operation contains all tuples that
are in both R & S.
 It is denoted by intersection ∩.

1. Notation: R ∩ S

Example: Using the above DEPOSITOR table and BORROW table

Input:

49 | P a g e
1. ∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)

Output: CUSTOMER_NAME

Smith

Jones

5. Set Difference:

 Suppose there are two tuples R and S. The set intersection operation contains all
tuples that are in R but not in S.
 It is denoted by intersection minus (-).

1. Notation: R - S

Example: Using the above DEPOSITOR table and BORROW table

Input:

1. ∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME

Jackson

Hayes

Willians

Curry

6. Cartesian product

50 | P a g e
 The Cartesian product is used to combine each row in one table with each row in the other
table. It is also known as a cross product.
 It is denoted by X.

Notation: E X D
Example:

EMPLOYEE

EMP_ID EMP_NAME EMP_DEPT

1 Smith A

2 Harry C

3 John B

DEPARTMENT

DEPT_NO DEPT_NAME

A Marketing

B Sales

C Legal

Input:

51 | P a g e
EMPLOYEE X DEPARTMENT

Output:

EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME

1 Smith A A Marketing

1 Smith A B Sales

1 Smith A C Legal

2 Harry C A Marketing

2 Harry C B Sales

2 Harry C C Legal

3 John B A Marketing

3 John B B Sales

3 John B C Legal

7. Rename Operation:

The rename operation is used to rename the output relation. It is denoted by rho (ρ).

Example: We can use the rename operator to rename STUDENT relation to STUDENT1.

1. ρ(STUDENT1, STUDENT)

2.8. Pipelining

52 | P a g e
Pipelining helps in improving the efficiency of the query-evaluation by decreasing the production
of a number of temporary files. Actually, we reduce the construction of the temporary files by
merging the multiple operations into a pipeline. The result of one currently executed operation
passes to the next operation for its execution, and the chain continues till all operations are
completed, and we get the final output of the expression. Such type of evaluation process is known
as Pipelined Evaluation.

Advantages of Pipeline

There are following advantages of creating a pipelining of operations:

 It reduces the cost of query evaluation by eliminating the cost of reading and writing the
temporary relations, unlike the materialization process.
 If we combine the root operator of a query evaluation plan in a pipeline with its inputs, the
process of generating query results becomes quick. As a result, it is beneficial for the users
as they can view the results of their asked queries as soon as the outputs get generated.
Else, the users need to wait for high-time to get and view any query results.

2.8.1. Pipelining vs. Materialization

Although both methods are used for evaluating multiple operations of expression, there are few
differences between them. The difference points are described in the below table

Pipelining Materialization

It is a modern approach to evaluate multiple It is a traditional approach to evaluate multiple


operations. operations.

It does not use any temporary relations for It uses temporary relations for storing the results of
storing the results of the evaluated operations. the evaluated operations. So, it needs more
temporary files and I/O.

It is a more efficient way of query evaluation It is less efficient as it takes time to generate the
as it quickly generates the results. query results.

53 | P a g e
It requires memory buffers at a high rate for It does not have any higher requirements for
generating outputs. Insufficient memory memory buffers for query evaluation.
buffers will cause thrashing.

Poor performance if trashing occurs. No trashing occurs in materialization. Thus, in


such cases, materialization is having better
performance.

It optimizes the cost of query evaluation. As it The overall cost includes the cost of operations
does not include the cost of reading and writing plus the cost of reading and writing results on the
the temporary storages. temporary storage.

54 | P a g e
Chapter 3
Database Integrity, Security and Recovery
3.1. Integrity
The term data integrity refers to the accuracy and consistency of data. When creating databases,
attention needs to be given to data integrity and how to maintain it. A good database will enforce
data integrity whenever possible.

For example, a user could accidentally try to enter a phone number into a date field. If the system
enforces data integrity, it will prevent the user from making these mistakes.

Maintaining data integrity means making sure the data remains intact and unchanged throughout
its entire life cycle. This includes the capture of the data, storage, updates, transfers, backups, etc.
Every time data is processed there’s a risk that it could get corrupted (whether accidentally or
maliciously).

Risks to Data Integrity


Some more examples of where data integrity is at risk:

 A user tries to enter a date outside an acceptable range.


 A user tries to enter a phone number in the wrong format.
 A bug in an application attempts to delete the wrong record.
 While transferring data between two databases, the developer accidentally tries to insert
the data into the wrong table.
 While transferring data between two databases, the network went down.
 A user tries to delete a record in a table, but another table is referencing that record as part
of a relationship.
 A user tries to update a primary key value when there’s already a foreign key in a related
table pointing to that value.
 A developer forgets that he’s on a production system and starts entering test data directly
into the database.
 A hacker manages to steal all user passwords from the database.
 A hacker hacks into the network and drops the database (i.e. deletes it and all its data).

55 | P a g e
 A fires sweeps through the building, burning the database computer to a cinder.
 The regular backups of the database has been failing for the past two months…

It’s not hard to think of many more scenarios where data integrity is at risk. Many of these risks
can be addressed from within the database itself (through the use of data types and constraints
against each column for example, encryption, etc), while others can be addressed through other
features of the DBMS (such as regular backups – and testing that the backups do actually restore
the database as expected).

Some of these require other (non-database related) factors to be present, such as an offsite backup
location, a properly functioning IT network, proper training, security policies, etc.

3.1.1. Types of Data Integrity

In the database world, data integrity is often placed into the following types:

 Entity integrity
 Referential integrity
 Domain integrity
 User-defined integrity

Entity Integrity
Entity integrity defines each row to be unique within its table. No two rows can be the same.

To achieve this, a primary key can be defined. The primary key field contains a unique identifier
– no two rows can contain the same unique identifier.

Referential Integrity
Referential integrity is concerned with relationships. When two or more tables have a
relationship, we have to ensure that the foreign key value matches the primary key value at all
times. We don’t want to have a situation where a foreign key value has no matching primary key
value in the primary table. This would result in an orphaned record.

So referential integrity will prevent users from:

56 | P a g e
 Adding records to a related table if there is no associated record in the primary table.
 Changing values in a primary table that result in orphaned records in a related table.
 Deleting records from a primary table if there are matching related records.

Domain Integrity
Domain integrity concerns the validity of entries for a given column. Selecting the appropriate
data type for a column is the first step in maintaining domain integrity. Other steps could include,
setting up appropriate constraints and rules to define the data format and/or restricting the range
of possible values.

User-Defined Integrity
User-defined integrity allows the user to apply business rules to the database that aren’t covered
by any of the other three data integrity types.

3.1.2. Integrity Constraints


 Integrity constraints are a set of rules. It is used to maintain the quality of information.
 Integrity constraints ensure that the data insertion, updating, and other processes have to
be performed in such a way that data integrity is not affected.
 Thus, integrity constraint is used to guard against accidental damage to the database.

Types of Constraints

Figure 8: Types of constraints

57 | P a g e
1. Domain constraints

 Domain constraints can be defined as the definition of a valid set of values for an attribute.
 The data type of domain includes string, character, integer, time, date, currency, etc. The
value of the attribute must be available in the corresponding domain.

Example:

Table 14: Domain constraints

2. Entity integrity constraints

 The entity integrity constraint states that primary key value can't be null.
 This is because the primary key value is used to identify individual rows in relation and if
the primary key has a null value, then we can't identify those rows.
 A table can contain a null value other than the primary key field.

Example:

Table 15: Entity integrity constraints

58 | P a g e
3. Referential Integrity Constraints

 A referential integrity constraint is specified between two tables.


 In the Referential integrity constraints, if a foreign key in Table 1 refers to the Primary Key
of Table 2, then every value of the Foreign Key in Table 1 must be null or be available in
Table 2.

Example:

Table 16: Referential Integrity Constraints

4. Key constraints

 Keys are the entity set that is used to identify an entity within its entity set uniquely.
 An entity set can have multiple keys, but out of which one key will be the primary key. A
primary key can contain a unique and null value in the relational table.

Example:

Table 18: Key constraints

59 | P a g e
3.2. Security

Database security refers to the range of tools, controls, and measures designed to establish and
preserve database confidentiality, integrity, and availability. This article will focus primarily on
confidentiality since it’s the element that’s compromised in most data breaches.

Database security must address and protect the following:

 The data in the database


 The database management system (DBMS)
 Any associated applications
 The physical database server and/or the virtual database server and the underlying
hardware
 The computing and/or network infrastructure used to access the database

Database security is a complex and challenging endeavor that involves all aspects of information
security technologies and practices. It’s also naturally at odds with database usability. The more
accessible and usable the database, the more vulnerable it is to security threats; the more
invulnerable the database is to threats, the more difficult it is to access and use.

Why is it important?

By definition, a data breach is a failure to maintain the confidentiality of data in a database. How
much harm a data breach inflicts on your enterprise depends on a number of consequences or
factors:

 Compromised intellectual property: Your intellectual property—trade secrets,


inventions, proprietary practices—may be critical to your ability to maintain a
competitive advantage in your market. If that intellectual property is stolen or exposed,
your competitive advantage may be difficult or impossible to maintain or recover.
 Damage to brand reputation: Customers or partners may be unwilling to buy your
products or services (or do business with your company) if they don’t feel they can trust
you to protect your data or theirs.

60 | P a g e
 Business continuity (or lack thereof): Some business cannot continue to operate until
a breach is resolved.
 Fines or penalties for non-compliance: The financial impact for failing to comply with
global regulations such as the Sarbannes-Oxley Act (SAO) or Payment Card Industry
Data Security Standard (PCI DSS), industry-specific data privacy regulations such as
HIPAA, or regional data privacy regulations, such as Europe’s General Data Protection
Regulation (GDPR) can be devastating, with fines in the worst cases exceeding several
million dollars per violation.

 Costs of repairing breaches and notifying customers: In addition to the cost of


communicating a breach to customer, a breached organization must pay for forensic and
investigative activities, crisis management, triage, repair of the affected systems, and
more.
 Database security includes a variety of measures used to secure database management
systems from malicious cyber-attacks and illegitimate use. Database security programs are
designed to protect not only the data within the database, but also the data management
system itself, and every application that accesses it, from misuse, damage, and intrusion.

 Database security encompasses tools, processes, and methodologies which establish


security inside a database environment.

3.3. Database threats


Data security is an imperative aspect of any database system. It is of particular importance in
distributed systems because of large number of users, fragmented and replicated data, multiple
sites and distributed control.

Database security begins with physical security for the systems that host the database
management system (DBMS). Database Management system is not safe from intrusion,
corruption, or destruction by people who have physical access to the computers. Once physical
security has been established, database must be protected from unauthorized access by authorized
users as well as unauthorized users. There are three main objects when designing a secure

61 | P a g e
database system, and anything prevents from a database management system to achieve these
goals would be consider a threat to database security.

3.3.1. Threats in a Database

 Availability loss − Availability loss refers to non-availability of database objects by


legitimate users.

 Integrity loss − Integrity loss occurs when unacceptable operations are performed upon
the database either accidentally or maliciously. This may happen while creating, inserting,
updating or deleting data. It results in corrupted data leading to incorrect decisions.

 Confidentiality loss − Confidentiality loss occurs due to unauthorized or unintentional


disclosure of confidential information. It may result in illegal actions, security threats and
loss in public confidence.

 Secrecy: Data should not be disclosed to unauthorized users. For example, a student
should not be allowed to see and change other student grades.
 Denial of service attack: This attack makes a database server greatly slower or even not
available to user at all. DoS attack does not result in the disclosure or loss of the database
information; it can cost the victims much time and money.
 Sniff attack: To accommodate the e-commerce and advantage of distributed systems,
database is designed in a client-server mode. Attackers can use sniffer software to monitor
data streams, and acquire some confidential information. For example, the credit card
number of a customer.
 Spoofing attack: Attackers forge a legal web application to access the database, and then
retrieve data from the database and use it for bad transactions. The most common spoofing
attacks are TCP used to get the IP addresses and DNS spoofing used to get the mapping
between IP address and DNS name.
 Trojan Horse: It is a malicious program that embeds into the system. It can modify the
database and reside in operating system.

62 | P a g e
3.3.2. Measures of Control
The measures of control can be broadly divided into the following categories −
 Access Control − Access control includes security mechanisms in a database management
system to protect against unauthorized access. A user can gain access to the database after
clearing the login process through only valid user accounts. Each user account is password
protected.

 Flow Control − Distributed systems encompass a lot of data flow from one site to another
and also within a site. Flow control prevents data from being transferred in such a way that
it can be accessed by unauthorized agents. A flow policy lists out the channels through
which information can flow. It also defines security classes for data as well as transactions.

 Data Encryption − Data encryption refers to coding data when sensitive data is to be
communicated over public channels. Even if an unauthorized agent gains access of the
data, he cannot understand it since it is in an incomprehensible format.

 RAID: Redundant Array of Independent Disks which protect against data loss due to disk
failure.
 Authentication: Access to the database is a matter of authentication. It provides the
guidelines how the database is accessed. Every access should be monitored.
 Backup: At every instant, backup should be done. In case of any disaster, Organizations
can retrieve their data.

3.4. Identification and Authentication


Identification and authentication is the process of determining the name of a user and verifying
that users are who they say they are. You can use database Access Control Lists (ACLs) to control
access to individual databases on the server. For each database on the server, you can set the ACL
to allow:

 Anonymous access
 Basic password authentication

63 | P a g e
The settings in the database ACLs work together with the "Maximum Internet name & password"
setting for each database to control the level of access that web browser users have to a database
on the Sometime server.

Using database ACLs


The database ACL defines user access to the content of the database. Before you set up basic
password authentication or anonymous access to a database, you should be familiar with how to
add users to a database ACL and the available settings within the ACL. For more information, see:

 Adding a name to a database ACL


 Database ACL settings

Maximum Internet name & password setting


The "Maximum Internet name & password" setting on the Advanced panel of each database ACL
specifies the maximum level of access to the database that is allowed for web browser clients. This
setting overrides individual levels set in the ACL.

Generally, administrators should not need to change the "Maximum Internet name & password"
settings for databases on the Sometime server. The default settings should function adequately in
most cases

3.5. Categories of Control


Database control refers to the task of enforcing regulations so as to provide correct data to
authentic users and applications of a database. In order that correct data is available to users, all
data should conform to the integrity constraints defined in the database. Besides, data should be
screened away from unauthorized users so as to maintain security and privacy of the database.
Database control is one of the primary tasks of the database administrator (DBA).

The three dimensions of database control are −

 Authentication
 Access rights
 Integrity constraints

64 | P a g e
¤ Authentication
In a distributed database system, authentication is the process through which only legitimate users
can gain access to the data resources.

Authentication can be enforced in two levels −

 Controlling Access to Client Computer − At this level, user access is restricted while
login to the client computer that provides user-interface to the database server. The most
common method is a username/password combination. However, more sophisticated
methods like biometric authentication may be used for high security data.

 Controlling Access to the Database Software − At this level, the database


software/administrator assigns some credentials to the user. The user gains access to the
database using these credentials. One of the methods is to create a login account within
the database server.

¤ Access Rights
A user’s access rights refers to the privileges that the user is given regarding DBMS operations
such as the rights to create a table, drop a table, add/delete/update tuples in a table or query upon
the table.

In distributed environments, since there are large number of tables and yet larger number of users,
it is not feasible to assign individual access rights to users. So, DDBMS defines certain roles. A
role is a construct with certain privileges within a database system. Once the different roles are
defined, the individual users are assigned one of these roles. Often a hierarchy of roles are defined
according to the organization’s hierarchy of authority and responsibility.

For example, the following SQL statements create a role "Accountant" and then assigns this role
to user "ABC".

CREATE ROLE ACCOUNTANT;


GRANT SELECT, INSERT, UPDATE ON EMP_SAL TO ACCOUNTANT;
GRANT INSERT, UPDATE, DELETE ON TENDER TO ACCOUNTANT;
GRANT INSERT, SELECT ON EXPENSE TO ACCOUNTANT;
COMMIT;

65 | P a g e
GRANT ACCOUNTANT TO ABC;
COMMIT;

¤ Semantic Integrity Control


Semantic integrity control defines and enforces the integrity constraints of the database system.

The integrity constraints are as follows −

 Data type integrity constraint

 Entity integrity constraint

 Referential integrity constraint

¤ Data Type Integrity Constraint


A data type constraint restricts the range of values and the type of operations that can be applied
to the field with the specified data type.

For example, let us consider that a table "HOSTEL" has three fields - the hostel number, hostel
name and capacity. The hostel number should start with capital letter "H" and cannot be NULL,
and the capacity should not be more than 150. The following SQL command can be used for data
definition −

CREATE TABLE HOSTEL (


H_NO VARCHAR2(5) NOT NULL,
H_NAME VARCHAR2(15),
CAPACITY INTEGER,
CHECK ( H_NO LIKE 'H%'),
CHECK ( CAPACITY <= 150)
);
¤ Entity Integrity Control
Entity integrity control enforces the rules so that each tuple can be uniquely identified from other
tuples. For this a primary key is defined. A primary key is a set of minimal fields that can uniquely

66 | P a g e
identify a tuple. Entity integrity constraint states that no two tuples in a table can have identical
values for primary keys and that no field which is a part of the primary key can have NULL value.

For example, in the above hostel table, the hostel number can be assigned as the primary key
through the following SQL statement (ignoring the checks) −

CREATE TABLE HOSTEL (


H_NO VARCHAR2(5) PRIMARY KEY,
H_NAME VARCHAR2(15),
CAPACITY INTEGER
);
¤ Referential Integrity Constraint
Referential integrity constraint lays down the rules of foreign keys. A foreign key is a field in a
data table that is the primary key of a related table. The referential integrity constraint lays down
the rule that the value of the foreign key field should either be among the values of the primary
key of the referenced table or be entirely NULL.

For example, let us consider a student table where a student may opt to live in a hostel. To include
this, the primary key of hostel table should be included as a foreign key in the student table. The
following SQL statement incorporates this −

CREATE TABLE STUDENT (


S_ROLL INTEGER PRIMARY KEY,
S_NAME VARCHAR2(25) NOT NULL,
S_COURSE VARCHAR2(10),
S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL
);

3.6. Data Encryption

Encryption helps us to secure data that we send, receive, and store. It can consist text messages
saved on our cell-phone, logs stored on our fitness watch, and details of banking sent by your
online account.

67 | P a g e
It is the way that can climb readable words so that the individual who has the secret access code,
or decryption key can easily read it. For diplomatic information to help in providing data security.

A large volume of personal information is handled electronically and maintained in the cloud or
on servers connected to the web on an ongoing basis. Without our distinctive data bending up in
the networked systematic system of a company, it's almost not possible to go on with the business
of any, which is why it is crucial to know how to help in keeping the information private.

¤ How does it work?

It is the procedure of taking ordinary text, such as a text or email, and climbing it into an unreadable
type of format known as "cipher text." It helps to protect the digital information either saved on or
spread through a network such as the internet on computer systems.in JDK, JRE, and JVM

The cipher text is converted back to the real form when the calculated recipient accesses the
message which is known as decryption. "Secret" encryption key, a lining up of algorithms that
climbed and unscramble info. back to a readable type, must be worked by both the sender and the
receiver to get the code.

Figure 9: Data encryption process

68 | P a g e
3.6.1. Symmetric and Asymmetric Encryption
The sequence of numbers used to encrypt and decrypt data is an encryption key. Algorithms are
used to construct encryption keys. It's random and special to each key.

Symmetric encryption and asymmetric encryption are two kinds of encryption schemes. Here's
how distinct they are.

 Symmetric encryption encrypts and decrypts information using a single password.


 For encryption and decryption, asymmetric encryption uses two keys. A public key, which
is interchanged between more than one user. Data is decrypted by a private key, which is
not exchanged.

¤ Types of Encryption

There are various types of encryption, and every encryption type is created as per the needs of the
professionals and keeping the security specifications in mind. The most common encryption types
are as follows.

¤ Data Encryption Standard (DES)

The Data Encryption Standard is example of a low-level encryption. In 1977, the U.S. government
set up the standard. DES is largely redundant for securing confidential data due to advancements
in technology and reductions in hardware costs.

¤ Triple DES

The Triple DES works 3* times the encryption of DES. It means, it first encrypts the data, decrypts
the data, and again encrypt the data. It improves the original DES standard, which for sensitive
data has been considered too poor a form of encryption.

69 | P a g e
¤ RSA

The RSA holds its name from three computer scientists' ancestral initials. For encryption, it utilises
a powerful and common algorithm. Because of its main length, RSA is common and thus
commonly used for safe data transmission.

¤ Advanced Encryption Standard (AES)

The U.S. government norm as of 2002 is the Advanced Encryption Standard. Worldwide, AES is
used.

¤ Two-Fish

The Two-fish is exampled as one of the quick encryption algorithms and is of no-cost for anyone
to use. It is usable in hardware and software.

¤ Using encryption via SSL

Most legally sites use very known as "secure sockets layer" (SSL), which, when sent to and from
a website, is a procedure of encrypting data. It prevents attackers from accessing the information
when it is in transit.

To confirm that we practice safe the encrypted online transactions, search the padlock icon in URL
bar and the "s" in the "https".

Accessing sites using SSL is a good idea if:

 We store confidential information or submit it online. To watch the sites to utilize SSL is
a useful idea whether we are utilizing the internet to perform tasks such as making
transactions, filing our taxes, renewing our driver's license, or doing some other personal
business.
 Our job asks it. Our workplace may have protocols for encryption or it may be subject to
encryption-requiring regulations. Encryption is a must in these instances.

70 | P a g e
¤ Why encryption matters?

There are following reasons to use the encryption in our day-to-day life. That are:

1. Internet privacy concerns are real

Encryption helps protect our privacy online by translating sensitive information into messages
"only for your eyes" intended only for the parties who need them, and no one else. We should
make sure our emails sent over an encrypted network, or either message must be in an encrypted
format. In their Settings menu, most email clients come with the encryption option and if we check
our email with a web browser, take a moment to ensure that SSL encryption is available.

2. Hacking is big business

Cybercrime, mostly managed by international corporations, is a global sector. Many of the large-
scale thefts of data we might have read about in the news show that cybercriminals are indeed out
for financial gain to steal personal information.

3. Regulations demand it

The Portability and Transparency Act for Health Insurance (HIPAA) allows healthcare providers
to incorporate safety features that help secure online confidential health information for patients.

The Fair Credit Practices Act (FCPA) and related regulations that help protect customers must be
enforced by retailers. Encryption allows companies to remain consistent with regulatory guidelines
and specifications. It also helps secure their clients' valuable data.

¤ How ransomware uses encryption to commit cybercrimes?

Encryption is intended to secure our data, but it is also possible to use encryption against us.
Targeted ransomware, for example, is a cybercrime that can impact organizations, including
government agencies, of all sizes. Also, ransomware can attack individual users of computers.

71 | P a g e
¤ How do attacks involving ransomware occur?

In order to attempt to encrypted different devices, including computers and servers, attackers
deploy ransomware. Until they give a key to decrypt the encrypted data, the attackers also demand
a ransom. Ransomware attacks on government departments can shut down facilities, making it
impossible, for example, to obtain a permit, obtain a marriage license, or pay a tax bill.

Targeted attacks mostly target large organizations, but we can also experience ransomware attacks.

Some ways we must always keep in our mind to be safe from such attacks.

 On all of our computers, including our cell phone, install and use trusted protection apps.
 Keep up to date with our protection applications. It will help protect against cyberattacks
on our computers.
 Our operating system and other software changes. This will fix vulnerabilities for
protection.
 Avoid opening email attachments reflexively. About why? One of the key methods for the
distribution of ransomware is email.
 Be careful of any email attachment that advises us to allow macros to display their content.
Macro malware will infect multiple files if macros are allowed.
 Back-up the details on an external hard drive. If we are the victim of a ransomware attack,
once the malware has been cleaned up, we will possibly be able to recover our files.

Consider making use of cloud resources. It can help to prevent a ransomware infection, since
previous versions of files are maintained by several cloud providers, enabling us to 'roll back' to
the unencrypted type. Don't pay any ransom. In the hope of getting our files back, we might pay a
ransom, but we might not get them back. There's no assurance that our data will be released by
cybercriminals.

To help protect our confidential personal details, encryption is important. But it can be used against
us in the event of ransomware attacks. Taking steps to help us reap the benefits and prevent the
damage is wise.

72 | P a g e
¤ How is encrypted data deciphered?

With the support of a key, an algorithm, a decoder or something similar, the intended recipient of
the encrypted data will decrypt it. If the data and the encryption process are in the digital domain,
the intended user may use the necessary decryption tool to access the information they need.

For decryption purposes, the item used can be referred to as the key, cipher or algorithm. We will
find specific details about each of them below.

Cipher: The word cipher refers to an algorithm primarily used for the purposes of encryption. A
cipher consists of a series of successive steps at the end of which it decrypts the encrypted
information. Two major types of ciphers exist: stream ciphers and block ciphers.

Algorithm: The processes that are followed by the encryption processes are algorithms. There are
various types of algorithms that are explicitly used to decrypt encrypted files and data: some of
these types include blowfish, triple DES and RSA. In addition to algorithms and ciphers, it is
possible to use brute force to decode an encoded text.

73 | P a g e
Chapter Four

Distributed Database
4.1. Distributed Database overview

¤ Distributed Database: A single logical database that is spread physically across


computers in multiple locations that are connected by a data communications link.
¤ Distributed database is a collection of connected sites:

 each site is a DB in its own right


 has its own DBMS and its own users
 operations can be performed locally as if the DB was not distributed
 the sites collaborate (transparently from the user’s point of view)
 the union of all DBs = the DB of the whole organisation (institution)
¤ Schema: A structure that contains descriptions of the database objects. (Tables, views,
constraints…..)
• Local schema describes database object on it’s own site only.
• Global schema describes database objects on all network nodes.
¤ Distributed database management system (DDBMS)
 A software system that permits the management of the distributed database and makes the
distribution transparent to the users.

Table 19: Advantage and Disadvantage of Distributed Database

DDBMS Advantages DDBMS Disadvantages

 Reflects organizational structure ( Data  Complexity of management and


are located near “greatest demand” site) control: extra work must be done to
 Faster data access ensure the transparency of the
 Improved performance ( data is located system and to maintain multiple
near the site of greatest demand), disparate systems, instead of one
 Modularity — systems can be big one.
modified, added and removed from the

74 | P a g e
distributed database without affecting  Increased complexity and a more
other modules. extensive infrastructure means extra
labour costs.
 Improved communications
 Security: remote database
 Economics - it costs less to create a
fragments must be secured, the
network of smaller computers with the
infrastructure must also be secured
power of a single large computer.
(eg: by encrypting the network links
 Improved availability — a fault in one
between remote sites).
database system will only affect one
 Lack of standards
fragment, instead of the entire database
 Increased storage requirements
Processor independence
 Difficult to maintain integrity:
enforcing integrity over a network
may require too much networking
resources to be feasible.

4.2. Components of Distributed DBMS and types


4.2.1. Types of DDBS:

a) Homogeneous DDBMSs

 Integrate only one type of centralized DBMS over a network


 Data is distributed across all the nodes.
 Same DBMS at each node.
 All data is managed by the distributed DBMS (no exclusively local data.)
 All access is through one global schema.
 The global schema is the union of the entire local schema.
b) Heterogeneous DDBMSs

 Integrate different types of centralized DBMSs over a network


 Data distributed across all the nodes.
 Local access is done using the local DBMS and schema.
 Remote access is done using the global schema.

75 | P a g e
¤ Fully heterogeneous DDBMS: Support different DBMSs that may even support different data
models (relational, hierarchical, or network) running under different computer systems, such
as mainframes and microcomputers

4.2.2. DDB Components

DDBS must include (at least) the following components:

 Computer workstations
 Network hardware and software
 Communications media
 Transaction processor (or, application processor, or transaction manager): Software
component found in each computer that requests data
 Data processor or data manager: Software component residing on each computer that stores
and retrieves data located at the site, may be a centralized DBMS

Figure 10: Communication network

4.2.3. DDBMS Functions:

DDBMS must perform all the functions of a centralized DBMS, and must handle all necessary
functions imposed by the distribution of data and processing

 Must perform these additional functions transparently to the end user


 Provide the user interface needed for location transparency
 Locate the data - directs queries to proper site(s)
 Process queries - local, remote, compound (global)

76 | P a g e
 Provide network - wide concurrency control and recovery procedures
 Provide data translation in heterogeneous systems

4.3. Distributed Database Design

DDBMS Design Strategies:

 Data fragmentation: How to partition the database into fragments


 Data replication: Which fragments to replicate
 Data allocation: Where to locate those fragments and replicas

4.3.1. Data Fragmentation

 Breaks single object into two or more segments or fragments


 Each fragment can be stored at any site over a computer network
Information about data fragmentation is stored in the distributed data catalog (DDC), from
which it is accessed by the TP to process user requests

Data Fragmentation Strategies

 Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples


(rows):

FRAGMENT Emp INTO


Lo_Emp AT SITE ‘London’
WHERE Dept_id = ‘Sales’
Le_Emp AT SITE ‘Leeds’
WHERE Dept_id = ‘Dev’ ;

 Vertical fragmentation: Division of a relation into attributes (column) subsets


 Mixed fragmentation: Combination of horizontal and vertical strategies

77 | P a g e
 Vertically Fragmented Table Contents

Site 1 Site 2

4.4. Data Replication

 Storage of data copies at multiple sites served by a computer network


 Fragment copies can be stored at several sites to serve specific information requirements
 Can enhance data availability and response time
 Can help to reduce communication and total query costs
 Updating distributed copies:
 Primary copy scheme:
 one copy is designated primary copy (unique), or
 primary copies can exist at different sites (distributed)
 An update is logically complete if the primary copy has been updated
 the site holding the primary copy would have to propagate the updates to other sites, this
has to be done before COMMIT (preserve - ACID)
 in some DDBMS: update propagation is guaranteed for some future time
 Synchronous Replication:
78 | P a g e
 All copies of a modified relation (fragment) must be updated before the modifying
transaction commits.
 Data distribution is made transparent to users.
 Asynchronous Replication:
 Copies of a modified relation are only periodically updated; different copies may get out
of synch in the meantime.

Figure 11: Data replication

Replication Scenarios:

 Fully replicated database:


o Stores multiple copies of each database fragment at multiple sites
o Can be impractical due to amount of overhead
 Partially replicated database:
o Stores multiple copies of some database fragments at multiple sites
o Most DDBMSs are able to handle the partially replicated database well
 Unreplicated database:
o Stores each database fragment at a single site
o No duplicate database fragments

4.5. Data Allocation


Deciding where to locate data. Allocation strategies:

 Centralized data allocation - Entire database is stored at one site

79 | P a g e
 Partitioned data allocation - Database is divided into several disjointed parts (fragments)
and stored at several sites
 Replicated data allocation - Copies of one or more database fragments are stored at
several sites
 Data distribution over a computer network is achieved through data partition, data
replication, or a combination of both

4.6. Query Processing and Optimization in Distributed Databases


4.6.1. Distributed Query Processing

A distributed database query is processed in stages as follows:

Query Mapping. The input query on distributed data is specified formally using a query
language. It is then translated into an algebraic query on global relations. This translation is done
by referring to the global conceptual schema and does not take into account the actual distribution
and replica-tion of data. Hence, this translation is largely identical to the one performed in a
centralized DBMS. It is first normalized, analyzed for semantic errors, simplified, and finally
restructured into an algebraic query.

Localization. In a distributed database, fragmentation results in relations being stored in


separate sites, with some fragments possibly being replicated. This stage maps the distributed
query on the global schema to separate queries on individual fragments using data distribution and
replication information.

Global Query Optimization. Optimization consists of selecting a strategy from a list of


candidates that is closest to optimal. A list of candidate queries can be obtained by permuting the
ordering of operations within a fragment query generated by the previous stage. Time is the
preferred unit for measuring cost. The total cost is a weighted combination of costs such as CPU
cost, I/O costs, and communication costs. Since DDBs are connected by a net-work, often the
communication costs over the network are the most significant. This is especially true when the
sites are connected through a wide area network (WAN).

Local Query Optimization. This stage is common to all sites in the DDB. The techniques
are similar to those used in centralized systems.

80 | P a g e
The first three stages discussed above are performed at a central control site, while the last stage
is performed locally.

4.6.2. Data Transfer Costs of Distributed Query Processing

We discussed the issues involved in processing and optimizing a query in a centralized DBMS in
Chapter 19. In a distributed system, several additional factors further complicate query processing.
The first is the cost of transferring data over the net-work. This data includes intermediate files
that are transferred to other sites for further processing, as well as the final result files that may
have to be transferred to the site where the query result is needed. Although these costs may not
be very high if the sites are connected via a high-performance local area network, they become
quite significant in other types of networks. Hence, DDBMS query optimization algorithms
consider the goal of reducing the amount of data transfer as an optimization criterion in choosing
a distributed query execution strategy.

We illustrate this with two simple sample queries. Suppose that


the EMPLOYEE and DEPARTMENT relations are distributed at two sites as We will assume in
this example that neither relation is fragmented, the size of the EMPLOYEE relation is
100 * 10,000 = 106 bytes, and the size of the DEPARTMENT relation is 35 * 100 = 3500 bytes.
Consider the query Q:

For each employee, retrieve the employee name and the name of the department for which the
employee works. This can be stated as follows in the relational algebra:

81 | P a g e
The result of this query will include 10,000 records, assuming that every employee is related to a
department. Suppose that each record in the query result is 40 bytes long.

The query is submitted at a distinct site 3, which is called the result site because the query result
is needed there. Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3. There
are three simple strategies for executing this distributed query:

1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and
perform the join at site 3. In this case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must
be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result
to site 3. The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 +
1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the
result to site 3. In this case, 400,000 + 3,500 = 403,500 bytes must be transferred.

If minimizing the amount of data transfer is our optimization criterion, we should choose strategy
3. Now consider another query Q For each department, retrieve the department name and the
name of the department manager. This can be stated as follows in the relational algebra:

82 | P a g e
Again, suppose that the query is submitted at site 3. The same three strategies for executing
query Q apply to Q, except that the result of Q includes only 100 records, assuming that each
department has a manager:

1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and
perform the join at site 3. In this case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must
be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result
to site 3. The size of the query result is 40 * 100 = 4,000 bytes, so 4,000 + 1,000,000 =
1,004,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the
result to site 3. In this case, 4,000 + 3,500 = 7,500 bytes must be transferred.

Again, we would choose strategy 3—this time by an overwhelming margin over strategies 1 and
2. The preceding three strategies are the most obvious ones for the case where the result site (site
3) is different from all the sites that contain files involved in the query (sites 1 and 2). However,
suppose that the result site is site 2; then we have two simple strategies:

1. Transfer the EMPLOYEE relation to site 2, execute the query, and present the result to the
user at site 2. Here, the same number of bytes—1,000,000— must be transferred for
both Q and Q .
2. Transfer the DEPARTMENT relation to site 1, execute the query at site 1, and send the
result back to site 2. In this case 400,000 + 3,500 = 403,500 bytes must be transferred
for Q and 4,000 + 3,500 = 7,500 bytes for Q .

A more complex strategy, which sometimes works better than these simple strategies, uses an
operation called semijoin. We introduce this operation and discuss distributed execution using
semijoins next.

4.6.3. Distributed Query Processing Using Semijoin

The idea behind distributed query processing using the semijoin operation is to reduce the number
of tuples in a relation before transferring it to another site. Intuitively, the idea is to send the joining
column of one relation R to the site where the other relation S is located; this column is then joined
with S. Following that, the join attributes, along with the attributes required in the result, are

83 | P a g e
projected out and shipped back to the original site and joined with R. Hence, only the joining
column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes
is transferred in the other direction. If only a small fraction of the tuples in S participate in the join,
this can be quite an efficient solution to minimizing data transfer.

To illustrate this, consider the following strategy for executing Q or Q:

Project the join attributes of DEPARTMENT at site 2, and transfer them to site 1. For Q, we
transfer F = πDnumber(DEPARTMENT), whose size is 4 * 100 = 400 bytes, whereas, for Q , we
transfer F = πMgr_ssn(DEPARTMENT), whose size is 9 * 100 = 900 bytes.

Join the transferred file with the EMPLOYEE relation at site 1, and transfer the required attributes
EMPLOYEE), whose
from the resulting file to site 2. For Q, we transfer R = πDno, Fname, Lname(F Dnumber=Dno

size is 34
* 10,000 =
340,000 bytes, whereas, for Q , we transfer R = πMgr_ssn, Fname, Lname (F
Mgr_ssn=Ssn EMPLOYEE), whose size is 39 * 100 = 3,900 bytes.

Execute the query by joining the transferred file R or R with DEPARTMENT, and present the
result to the user at site 2.

Using this strategy, we transfer 340,400 bytes for Q and 4,800 bytes for Q . We limited
the EMPLOYEE attributes and tuples transmitted to site 2 in step 2 to only those that will actually
be joined with a DEPARTMENT tuple in step 3. For query Q, this turned out to include
all EMPLOYEE tuples, so little improvement was achieved. However, for Q only 100 out of the
10,000 EMPLOYEE tuples were needed.

The semijoin operation was devised to formalize this strategy. A semijoin operation R A=B S,

where A and B are domain-compatible attributes of R and S, respectively, produces the same

result as the relational algebra expression πR(R A=B S). In a distributed environment where R
and S reside at different sites, the semijoin is typically implemented by first transferring F = πB(S)
to the site where R resides and then joining F with R, thus leading to the strategy discussed here.

Notice that the semijoin operation is not commutative; that is,

84 | P a g e
4.7. Query and Update Decomposition

In a DDBMS with no distribution transparency, the user phrases a query directly in terms of
specific fragments. For example, consider another query Q: Retrieve the names and hours per
week for each employee who works on some project controlled by department 5, which is specified
on the distributed database where the relations at sites 2 and 3 and those at site 1, as in our earlier
example. A user who submits such a query must specify whether it references
the PROJS_5 and WORKS_ON_5 relations at site 2 or the PROJECT and WORKS_ON relations
at site 1 .The user must also maintain consistency of replicated data items when updating a
DDBMS with no replication transparency.

On the other hand, a DDBMS that supports full distribution, fragmentation, and replication
transparency allows the user to specify a query or update request on the schema in Figure 3.5 just
as though the DBMS were centralized. For updates, the DDBMS is responsible for
maintaining consistency among replicated items by using one of the distributed concurrency
control algorithms to be discussed in Section 25.7. For queries, a query decomposition module
must break up or decompose a query into subqueries that can be executed at the individual sites.
Additionally, a strategy for combining the results of the subqueries to form the query result must
be generated. Whenever the DDBMS determines that an item referenced in the query is replicated,
it must choose or materialize a particular replica during query execution.

To determine which replicas include the data items referenced in a query, the DDBMS refers to
the fragmentation, replication, and distribution information stored in the DDBMS catalog. For
vertical fragmentation, the attribute list for each fragment is kept in the catalog. For horizontal
fragmentation, a condition, some-times called a guard, is kept for each fragment. This is basically
a selection condition that specifies which tuples exist in the fragment; it is called a guard
because only tuples that satisfy this condition are permitted to be stored in the fragment. For
mixed fragments, both the attribute list and the guard condition are kept in the catalog.

In our earlier example, the guard conditions for fragments at site 1 are TRUE (all tuples), and the
attribute lists are * (all attributes). For the fragments shown in Figure 25.8, we have the guard

85 | P a g e
conditions and attribute lists. When the DDBMS decomposes an update request, it can determine
which fragments must be updated by examining their guard conditions. For example, a user request
to insert a new EMPLOYEE tuple <‘Alex’, ‘B’, ‘Coleman’, ‘345671239’, ‘22-APR-64’, ‘3306
Sandstone, Houston, TX’, M, 33000, ‘987654321’, 4> would be decomposed by the DDBMS into
two insert requests: the first inserts the preceding tuple in the EMPLOYEE fragment at site 1, and
the second inserts the pro-jected tuple <‘Alex’, ‘B’, ‘Coleman’, ‘345671239’, 33000,
‘987654321’, 4> in the EMPD4 fragment at site 3.

For query decomposition, the DDBMS can determine which fragments may contain the required
tuples by comparing the query condition with the guard

(a) EMPD5

attribute list: Fname, Minit, Lname, Ssn, Salary, Super_ssn, Dno guard condition: Dno=5

DEP5

attribute list: * (all attributes Dname, Dnumber, Mgr_ssn, Mgr_start_date) guard condition:
Dnumber=5

DEP5_LOCS

attribute list: * (all attributes Dnumber, Location) guard condition: Dnumber=5

PROJS5

attribute list: * (all attributes Pname, Pnumber, Plocation, Dnum) guard condition: Dnum=5

WORKS_ON5

attribute list: * (all attributes Essn, Pno,Hours)

guard condition: Essn IN (πSsn (EMPD5)) OR Pno IN (πPnumber (PROJS5))

(b) EMPD4

attribute list: Fname, Minit, Lname, Ssn, Salary, Super_ssn, Dno guard condition: Dno=4

DEP4

86 | P a g e
attribute list: * (all attributes Dname, Dnumber, Mgr_ssn, Mgr_start_date) guard condition:
Dnumber=4

DEP4_LOCS

attribute list: * (all attributes Dnumber, Location) guard condition: Dnumber=4

PROJS4

attribute list: * (all attributes Pname, Pnumber, Plocation, Dnum) guard condition: Dnum=4

WORKS_ON4

attribute list: * (all attributes Essn, Pno, Hours) guard condition: Essn IN (πSsn (EMPD4))

OR Pno IN (πPnumber (PROJS4))

(a) Site 2 fragments. (b) Site 3 fragments.

conditions. For example, consider the query Q: Retrieve the names and hours per week for each
employee who works on some project controlled by department 5. This can be specified in SQL
on the schema:

SELECT Fname, Lname, Hours

FROM EMPLOYEE, PROJECT, WORKS_ON WHERE Dnum=5 AND Pnumber=Pno AND


Essn=Ssn;

Suppose that the query is submitted at site 2, which is where the query result will be needed. The
DDBMS can determine from the guard condition on PROJS5 and WORKS_ON5 that all tuples
satisfying the conditions (Dnum = 5 AND Pnumber = Pno) reside at site 2. Hence, it may
decompose the query into the following relational algebra subqueries:

This decomposition can be used to execute the query by using a semijoin strategy. The DDBMS
knows from the guard conditions that PROJS5 contains exactly those tuples satisfying (Dnum =

87 | P a g e
5) and that WORKS_ON5 contains all tuples to be joined with PROJS5; hence, subquery T1 can
be executed at site 2, and the projected column Essn can be sent to site 1. Subquery T2 can then be
executed at site 1, and the result can be sent back to site 2, where the final query result is calculated
and displayed to the user. An alternative strategy would be to send the query Q itself to site 1,
which includes all the database tuples, where it would be executed locally and from which the
result would be sent back to site 2. The query optimizer would estimate the costs of both strategies
and would choose the one with the lower cost estimate.

4.8. Distributed Database Transparency Features


Allow end user to see the distributed database’s as though it were a centralized one.
Transparency features include:

 Distribution transparency
 Transaction transparency
 Performance transparency

4.8.1. Distribution Transparency

This refers to freedom for the user from the operational details of the network. Allows
management of a physically dispersed database as though it were a centralized database

Three levels of distribution transparency are recognized:

 Fragmentation transparency
 Location transparency
 Local mapping transparency

88 | P a g e
Table 20: A Summary of Transparency Features

IF the SQL statement A Summary of Transparency Features:


requires:

Fragment Location Then the DBMS support Level of Distribution


Name Name transparency

Yes Yes Local Mapping Low

Yes No Location Transparency Medium

No No Fragmentation Transparency High

 Replica transparency: DDBMS’s ability to hide the existence of multiple copies of data
from the user. Copies of data may be stored at multiple sites for better availability,
performance, and reliability.

4.8.2. Transaction Transparency

Ensures database transactions will maintain distributed database’s integrity and consistency.
Distributed transaction accesses data stored at more than one location. Each transaction is divided
into number of sub transactions, one for each site that has to be accessed. DDBMS must ensure the
indivisibility of both the global transaction and each of the sub transactions. We should distinguish
between Distributed Requests and Distributed Transactions:

Remote request: Lets a single SQL statement access data to be processed by a single remote
database processor

Figure 12: Accesses data at a single remote site

89 | P a g e
Distributed transaction:

Can update or request data from several different remote sites on a network. Distributed
transaction Allows a transaction to reference several different (local or remote) DP sites

Figure 13: Distributed transaction

Distributed request: Lets a single SQL statement reference data located at several different local
or remote DP sites

90 | P a g e
Figure 14: Another Distributed Request

Another Distributed Request

4.9. Performance Transparency and Query Optimization

 DDBMS must perform as if it were a centralized DBMS


� DDBMS should not suffer any performance degradation due to distributed
architecture.
� DDBMS should determine most cost-effective strategy to execute a request.
 Distributed Query Processor (DQP) maps data request into ordered sequence of operations
on local databases.
 Must consider fragmentation, replication, and allocation schemas
 DQP has to decide
 which fragment to access
 which copy of a fragment to use

91 | P a g e
 Which location to use.
 DQP produces execution strategy optimized with respect to some cost function.
 Typically, costs associated with a distributed request include
 I/O cost;
 CPU cost
 Communication cost.
Objective of query optimization routine is to minimize total cost associated with the execution
of a request.

Example of Query optimization transparency:

Site A: Suppliers ( S_id, City ) 10,000 tuples

Contracts ( S_id, P_id ) 1,000,000 tuples

Site B: Parts (P_id, Colour ) 100,000 tuples

SELECT S.S_id

FROM Suppliers S, Contracts C, Parts P

WHERE S.S_id = C.S_id AND P.P_id = C.P_id AND

City = ‘London’ AND

Colour = ‘red’ ;

 Possible evaluation procedures:


1. move relation Parts to site A and evaluate the query at A
2. move relations Suppliers and Contracts to B and evaluate at B
3. join Suppliers with Contracts at A, restrict the tuples for suppliers from London, and for
each of these tuples check at site B to see whether the corresponding part is red
4. join Suppliers with Contracts at A, restrict the tuples for suppliers from London, transfer
them B and terminate the processing there

92 | P a g e
5. restrict Parts to tuples containing red parts, move the result to A and process there
6. think of other possibilities …

 there is an extra dimension added by the site where the query was issued
 Costs associated with a request are a function of the:
 Access time (I/O) cost
 Communication cost
 CPU time cost
 Must provide distribution transparency as well as replica transparency
 Query optimization techniques:
o Manual(end user, programmer) or automatic(DDMS)
o Static(takes place at compilation time) or dynamic(takes place at run time)
o Statistically based(information about the DB as size ,number of records,
average access time,DDMS) or rule-based algorithms (is based on a set of
user-difined rules to determine the best query access strategy, end user)

4.9.1. Distributed Concurrency Control

Multisite, multiple-process operations are much more likely to create data inconsistencies and
deadlocked transactions than are single-site systems

4.10. The Effect of a Premature COMMIT

Figure 15: The Effect of a Premature COMMIT

93 | P a g e
4.10.1. Two-Phase Commit Protocol

 Distributed databases make it possible for a transaction to access data at several sites
 The objective of the 2PC protocol is to ensure that all nodes commit their part of the
transaction.
 Final COMMIT must not be issued until all sites have committed their parts of the
transaction
 Two-phase commit protocol requires:
 Two-Phase Commit Protocol requires:
 DO-UNDO-REDO protocol: used by DP to roll back and/or roll forward transactions
with the help of transaction log entries.
 DO perform the operation and writes the before and after values in the
transaction log.
 UNDO reverses the operation, using the Transaction log entry written by DO
operation
 REDO redoes the operation, using log entries
 Write-ahead-Protocol: forces the log entry to be written to permanent storage before
the actual operation takes place. each individual DP’s transaction log entry must be
written before the database fragment is actually updated
 The 2PC protocol defines operations between two types of nodes:
 Coordinator TP in the site where transaction is executed
 Subordinates or cohorts: DPs in sites where data affected by the transaction is located.

4.10.2. Phases of Two-Phase Commit Protocol

Phase 1: Preparation

 The coordinator sends a PREPARE TO COMMIT message to all subordinates.


 The subordinates receive the message, write transaction log entry, and send replay (prepared /
not prepared) to the coordinator.
 The coordinator makes sure that all nodes are ready, or it aborts the action.

94 | P a g e
Phase 2: the Final COMMIT

 The coordinator broadcast a COMITT message to all subordinates and waits for replies.
 Each subordinate receive the message, then updates the database using the DO protocol.
 The subordinates reply (COMMITTED or NOT COMMITTED) message to coordinator.
 If one or more subordinates did not commit, the coordinator send ABORT message,

4.11. Distributed Transaction Management and Recovery


The site where the transaction originated can temporarily assume the role of global transaction
manager and coordinate the execution of database operations with transaction managers across
multiple sites. Transaction managers export their functionality as an interface to the application
programs. The manager stores bookkeeping information related to each transaction, such as a
unique identifier, originating site, name, and so on. For READ operations, it returns a local copy
if valid and available.

For WRITE operations, it ensures that updates are visible across all sites containing copies
(replicas) of the data item. For ABORT operations, the manager ensures that no effects of the
transaction are reflected in any site of the distributed database. For COMMIT operations, it
ensures that the effects of a write are persistently recorded on all databases containing copies of
the data item. Atomic termination (COMMIT/ ABORT) of distributed transactions is commonly
implemented using the two-phase commit protocol.

The transaction manager passes to the concurrency controller the database operation and
associated information. The controller is responsible for acquisition and release of associated
locks. If the transaction requires access to a locked resource, it is delayed until the lock is acquired.
Once the lock is acquired, the operation is sent to the runtime processor, which handles the actual
execution of the database operation. Once the operation is completed, locks are released and the
transaction manager is updated with the result of the operation.

1. Two-Phase Commit Protocol

we described the two-phase commit protocol (2PC), which requires a global recovery manager,
or coordinator, to maintain information needed for recovery, in addition to the local recovery
managers and the information they main-tain (log, tables) The two-phase commit protocol has

95 | P a g e
certain drawbacks that led to the development of the three-phase commit protocol, which we
discuss next.

2. Three-Phase Commit Protocol

The biggest drawback of 2PC is that it is a blocking protocol. Failure of the coordinator blocks all
participating sites, causing them to wait until the coordinator recovers. This can cause performance
degradation, especially if participants are holding locks to shared resources. Another problematic
scenario is when both the coordinator and a participant that has committed crash together. In the
two-phase commit protocol, a participant has no way to ensure that all participants got the commit
message in the second phase. Hence once a decision to commit has been made by the coordinator
in the first phase, participants will commit their transactions in the second phase independent of
receipt of a global commit message by other participants. Thus, in the situation that both the
coordinator and a committed participant crash together, the result of the transaction becomes
uncertain or nondeterministic. Since the transaction has already been committed by one
participant, it cannot be aborted on recovery by the coordinator. Also, the transaction cannot be
optimistically committed on recovery since the original vote of the coordinator may have been to
abort.

These problems are solved by the three-phase commit (3PC) protocol, which essentially divides
the second commit phase into two sub phases called prepare-to-commit and commit. The
prepare-to-commit phase is used to communicate the result of the vote phase to all participants. If
all participants vote yes, then the coordinator instructs them to move into the prepare-to-commit
state. The commit sub phase is identical to its two-phase counterpart. Now, if the coordinator
crashes during this sub phase, another participant can see the transaction through to completion. It
can simply ask a crashed participant if it received a prepare-to-commit message. If it did not, then
it safely assumes to abort. Thus the state of the protocol can be recovered irrespective of which
participant crashes. Also, by limiting the time required for a transaction to commit or abort to a
maximum time-out period, the protocol ensures that a transaction attempting to commit via 3PC
releases locks on time-out.

The main idea is to limit the wait time for participants who have committed and are waiting for a
global commit or abort from the coordinator. When a participant receives a precommit message,

96 | P a g e
it knows that the rest of the participants have voted to commit. If a precommit message has not
been received, then the participant will abort and release all locks.

4.12. Operating System Support for Transaction Management

The following are the main benefits of operating system (OS)-supported transaction management:

1. Typically, DBMSs use their own semaphores to guarantee mutually exclusive access to
shared resources. Since these semaphores are implemented in user space at the level of the
DBMS application software, the OS has no knowledge about them. Hence if the OS
deactivates a DBMS process holding a lock, other DBMS processes wanting this lock
resource get queued. Such a situation can cause serious performance degradation. OS-level
knowledge of semaphores can help eliminate such situations.
2. Specialized hardware support for locking can be exploited to reduce associated costs. This
can be of great importance, since locking is one of the most common DBMS operations.
3. Providing a set of common transaction support operations though the kernel allows
application developers to focus on adding new features to their products as opposed to
reimplementing the common functionality for each application. For example, if different
DDBMSs are to coexist on the same machine and they chose the two-phase commit
protocol, then it is more beneficial to have this protocol implemented as part of the kernel
so that the DDBMS developers can focus more on adding new features to their products.

Transactions may be performed effectively using distributed transaction processing. However,


there are instances in which a transaction may fail for a variety of causes. System failure, hardware
failure, network error, inaccurate or invalid data, application problems, are all probable causes.
Transaction failures are impossible to avoid. These failures must be handled by the distributed
transaction system. When mistakes arise, one must be able to identify and correct them.
Transaction Recovery is the name for this procedure. In distributed databases, the most difficult
procedure is recovery. It is extremely difficult to recover a communication network system that
has failed.

Let us consider the following scenario to analyze how transaction fail may occur. Let suppose, we
have two-person X and Y. X sends a message to Y and expects a response, but Y is unable to
receive it.

97 | P a g e
The following are some of the issues with this circumstance:

 The message was not sent due to a network problem.


 The communication sent by location B was not delivered to place A.
 Location B was destroyed.
 As a result, locating the source of a problem in a big communication network is extremely
challenging.
Distributed commit in the network is another major issue that can wreak havoc on a distributed
database’s recovery.

One of the most famous methods of Transaction Recovery is the “Two-Phase Commit
Protocol”. The coordinator and the subordinate are the two types of nodes that the Two-Phase
Commit Protocol uses to accomplish its procedures. The coordinator’s process is linked to the user
app, and communication channels between the subordinates and the coordinator are formed.

The two-phase commit protocol contains two stages, as the name implies. The first step is the
PREPARE phase, in which the transaction’s coordinator delivers a PREPARE message. The
second step is the decision-making phase, in which the coordinator sends a COMMIT message if
all of the nodes can complete the transaction, or an abort message if at least one subordinate node
cannot. Centralized 2PC, Linear 2PC, and Distributed 2PC are all ways that may be used to
perform the 2PC.

 Centralized 2 PC: Contact in the Centralized 2PC is limited to the coordinator’s process, and
no communication between subordinates is permitted. The coordinator is in charge of sending
the PREPARE message to the subordinates, and once all of the subordinates’ votes have been
received and analyzed, the coordinator chooses whether to abort or commit. There are two
stages to this method:
 The First Phase: When a user desires to COMMIT a transaction during this phase,
the coordinator sends a PREPARE message to all subordinates. When a subordinate
gets the PREPARE message, it either records a PREPARE log and sends a YES
VOTE and enters the PREPARED state if the subordinate is willing to COMMIT;
or it creates an abort record and sends a NO VOTE if the subordinate is not willing
to COMMIT. Because it knows the coordinator will issue an abort, a subordinate
transmitting a NO VOTE does not need to enter a PREPARED state. In this

98 | P a g e
situation, the NO VOTE functions as a veto since only one NO VOTE is required
to cancel the transaction.
 Second Phase: After the coordinator has reached a decision, it must communicate
that decision to the subordinates. If COMMIT is chosen, the coordinator enters the
committing state and sends a COMMIT message to all subordinates notifying them
of the choice. When the subordinates get the COMMIT message, they go into the
committing state and send the coordinator an acknowledge (ACK) message. The
transaction is completed when the coordinator gets the ACK messages. If the
coordinator, on the other hand, makes an ABORT decision, it sends an ABORT
message to all subordinates. In this case, the coordinator does not need to send an
ABORT message to the NO VOTE subordinate(s).
 Linear 2 PC: Subordinates in the linear 2PC, can communicate with one another. The sites
are numbered 1 to N, with site 1 being the coordinator. As a result, the PREPARE message is
propagated in a sequential manner. As a result, the transaction takes longer to complete than
centralized or dispersed approaches. Finally, it is node N that sends out the Global COMMIT.
 Distributed 2 PC: All of the nodes of a distributed 2PC interact with one another. Unlike other
2PC techniques, this procedure does not require the second phase. Furthermore, in order to
know that each node has put in its vote, each node must hold a list of all participating nodes.
When the coordinator delivers a PREPARE message to all participating nodes, the distributed
2PC gets started. When a participant receives the PREPARE message, it transmits his or her
vote to all other participants. As a result, each node keeps track of every transaction’s
participants.

99 | P a g e
Chapter 5
Object Oriented DBMS
5.1. Object Oriented Concepts
The term object-oriented—abbreviated by OO or O-O—has its origins in OO programming
languages, or OOPLs. Today OO concepts are applied in the areas of databases, software
engineering, knowledge bases, artificial intelligence, and computer systems in general. An object
typically has two components: state (value) and behavior (operations).

Objects in an OOPL exist only during program execution and are hence called transient objects.
An OO database can extend the existence of objects so that they are stored permanently, and hence
the objects persist beyond program termination and can be retrieved later and shared by other
programs. In other words, OO databases store persistent objects permanently on secondary storage,
and allow the sharing of these objects among multiple programs and applications. This requires
the incorporation of other well-known features of database management systems, such as indexing
mechanisms, concurrency control, and recovery. An OO database system interfaces with one or
more OO programming languages to provide persistent and shared object capabilities.

One goal of OO databases is to maintain a direct correspondence between real-world and database
objects so that objects do not lose their integrity and identity and can easily be identified and
operated upon. Hence, OO databases provide a unique system-generated object identifier (OID)
for each object. We can compare this with the relational model where each relation must have a
primary key attribute whose value identifies each tuple uniquely. In the relational model, if the
value of the primary key is changed, the tuple will have a new identity, even though it may still
represent the same real-world object. Alternatively, a real-world object may have different names
for key attributes in different relations, making it difficult to ascertain that the keys represent the
same object (for example, the object identifier may be represented as EMP_ID in one relation and
as SSN in another).
Another feature of OO databases is that objects may have an object structure of arbitrary
complexity in order to contain all of the necessary information that describes the object. In
contrast, in traditional database systems, information about a complex object is often scattered

100 | P a g e
over many relations or records, leading to loss of direct correspondence between a real-world
object and its database representation.
Applications for OO databases, there are many fields where it is believed that the OO model
can be used to overcome some of the limitations of Relational technology, where the use of
complex data types and the need for high performance are essential. These applications
include:
 Computer-aided design and manufacturing (CAD/CAM)
 Computer-integrated manufacturing (CIM)
 Computer-aided software engineering (CASE)
 Geographic information systems (GIS)
 Many applications in science and medicine
 Document storage and retrieval

Object-oriented databases closely relate to object-oriented programming concepts. The main


ideas of object-oriented programming are:

 Object
 Class
 Polymorphism
 Inheritance
 Encapsulation

Objects

Objects represent real world entities, concepts, and tangible as well as intangible things.
. For example a person, a drama, a licenses

 Every object has a unique identifier (OID). The value of an OID is not visible to the
external user, but is used internally by the system to identify each object uniquely and to
create and manage inter-object references. The OID can be assigned to program variables
of the appropriate type when needed.
1. System generated
2. Never changes in the lifetime of the object
Object Structure:

101 | P a g e
 Loosely speaking, an object corresponds to an entity in the E-R model.
 The object-oriented paradigm is based on encapsulating code and data related to an object
into single unit related to an object into single unit.
 An object has:
 A set of variables that contain the data for the object The value of each that
contain the data for the object. The value of each variable is itself an object.
 A set of messages to which the object responds; each message may have zero,
one, or more parameters.
 A set of methods, each of which is a body of code to implement a message ; a
method returns a value as the response to the message
 Objects are categorized by their type or class.
 An object is an instance of a type or class.
Class

 Similar objects are grouped into a class


 Each individual object is called an instance of its class
 All objects in a class have the same as variables, with the same types à message interface
à methods They may differ in the values assigned to variables
 e.g., group objects for people into a person class
 Classes are analogous to entity sets in the E-R model

Class DefinitionExample

class employee {

/*Variables */

string name;

string address;

date start-date;

int salary;

/* Messages */

102 | P a g e
int annual-salary();

string get-name();

string get-address();

int set-address(string new-address);

int employment-length(); };

 Methods to read and set the other variables are also needed with strict encapsulation
 Methods are defined separately
E.g. int employment-length() { return today() – start-date; }
int set-address (string new-address) { address = new-address;}
 Polymorphism

Polymorphism is the capability of an object to take multiple forms. This ability allows the same
program code to work with different data types. Both a car and a motorcycle are able to break, but
the mechanism is different. In this example, the action break is a polymorphism. The defined action
is polymorphic — the result changes depending on which vehicle performs.

Inheritance

Inheritance creates a hierarchical relationship between related classes while making parts of code
reusable. Defining new types inherits all the existing class fields and methods plus further extends
them. The existing class is the parent class, while the child class extends the parent.

For example, a parent class called Vehicle will have child classes Car and Bike. Both child
classes inherit information from the parent class and extend the parent class with new information
depending on the vehicle type.

Encapsulation

Encapsulation is the ability to group data and mechanisms into a single object to provide access
protection. Through this process, pieces of information and details of how an object works
are hidden, resulting in data and function security. Classes interact with each other through
methods without the need to know how particular methods work. As an example, a car has

103 | P a g e
descriptive characteristics and actions. You can change the color of a car, yet the model or make
are examples of properties that cannot change. A class encapsulates all the car information into
one entity, where some elements are modifiable while some are not.

5.2. Drawbacks of relational DBMS


Relational databases are widely used in many industries to store financial records, keep track
of inventory and to keep records on employees. In a relational database, information is stored
in tables (often called relations) which help organize and structure data. Even though they
are widely used, relational databases have some drawbacks
 Cost

One disadvantage of relational databases is the expensive of setting up and maintaining the
database system. In order to set up a relational database, you generally need to purchase
special software. If you are not a programmer, you can use any number of products to set
up a relational database. It does take time to enter in all the information and set up the
program. If your company is large and you need a more robust database, you will need to
hire a programmer to create a relational database using Structured Query Language (SQL)
and a database administrator to maintain the database once it is built. Regardless of what
data you use, you will have to either import it from other data like text files or Excel
spreadsheets, or have the data entered at the keyboard. No matter the size of your company,
if you store legally confidential or protected information in your database such as health
information, social security numbers or credit card numbers, you will also have to secure
your data against unauthorized access in order to meet regulatory standards.
 Abundance of Information

Advances in the complexity of information cause another drawback to relational databases.


Relational databases are made for organizing data by common characteristics. Complex
images, numbers, designs and multimedia products defy easy categorization leading the
way for a new type of database called object-relational database management systems.
These systems are designed to handle the more complex applications and have the ability
to be scalable.

104 | P a g e
 Structured Limits

Some relational databases have limits on field lengths. When you design the database, you
have to specify the amount of data you can fit into a field. Some names or search queries
are shorter than the actual, and this can lead to data loss.

 Isolated Databases

Complex relational database systems can lead to these databases becoming "islands of
information" where the information cannot be shared easily from one large system to another.
Often, with big firms or institutions, you find relational databases grew in separate divisions
differently. For example, maybe the hospital billing department used one database while the
hospital personnel department used a different database. Getting those databases to "talk" to each
other can be a large, and expensive, undertaking, yet in a complex hospital system, all the databases
need to be involved for good patient and employee care.

OODBMS definitions

An object-oriented database management system (OODBMS) is a database management system


that supports the creation and modeling of data as objects. OODBMS also includes support for
classes of objects and the inheritance of class properties, and incorporates methods, subclasses and
their objects. Most of the object databases also offer some kind of query language, permitting
objects to be found through a declarative programming approach.

An object-oriented database management system (OODBMS) applies concepts of object-


oriented programming, and applies them to the management of persistent objects on behalf of
multiple users, with capabilities for security, integrity, recovery and contention management. An
OODBMS is based on the principles of “objects,” namely abstract data types, classes, inheritance
mechanisms, polymorphism, dynamic binding and message passing.

The ODBMS which is an abbreviation for object-oriented database management


system is the data model in which data is stored in form of objects, which are instances of
classes. These classes and objects together make an object-oriented data model.

105 | P a g e
Components of Object-Oriented Data Model:
The OODBMS is based on three major components, namely: Object structure, Object
classes, and Object identity. These are explained below.

1. Object Structure:
The structure of an object refers to the properties that an object is made up of. These
properties of an object are referred to as an attribute. Thus, an object is a real-world entity
with certain attributes that makes up the object structure. Also, an object encapsulates the
data code into a single unit which in turn provides data abstraction by hiding the
implementation details from the user.

The object structure is further composed of three types of components: Messages,


Methods, and Variables. These are explained below.

1. Messages –
A message provides an interface or acts as a communication medium between an
object and the outside world. A message can be of two types:
 Read-only message: If the invoked method does not change the value of a
variable, then the invoking message is said to be a read-only message.
 Update message: If the invoked method changes the value of a variable, then the
invoking message is said to be an update message.
2. Methods –
When a message is passed then the body of code that is executed is known as a
method. Whenever a method is executed, it returns a value as output. A method can
be of two types:
 Read-only method: When the value of a variable is not affected by a method,
then it is known as the read-only method.
 Update-method: When the value of a variable change by a method, then it is
known as an update method.
3. Variables –
It stores the data of an object. The data stored in the variables makes the object
distinguishable from one another.

106 | P a g e
2. Object Classes:
An object which is a real-world entity is an instance of a class. Hence first we need to
define a class and then the objects are made which differ in the values they store but
share the same class definition. The objects in turn correspond to various messages and
variables stored in them.

Example –

class CLERK

{ //variables

char name;

string address;

int id;

int salary;

//methods

char get_name();

string get_address();

int annual_salary();

};

In the above example, we can see, CLERK is a class that holds the object variables and
messages.

An OODBMS also supports inheritance in an extensive manner as in a database there may


be many classes with similar methods, variables and messages. Thus, the concept of the
class hierarchy is maintained to depict the similarities among various classes.

The concept of encapsulation that is the data or information hiding is also supported by an
object-oriented data model. And this data model also provides the facility of abstract data
types apart from the built-in data types like char, int, float. ADT’s are the user-defined data
types that hold the values within them and can also have methods attached to them.

107 | P a g e
Thus, OODBMS provides numerous facilities to its users, both built-in and user-defined.
It incorporates the properties of an object-oriented data model with a database management
system, and supports the concept of programming paradigms like classes and objects along
with the support for other concepts like encapsulation, inheritance, and the user-defined
ADT’s (abstract data types).

5.3. OO Data modeling and E-R diagramming


5.3.1. E-R Model

ER model is used to represent real life scenarios as entities. The properties of these entities
are their attributes in the ER diagram and their connections are shown in the form of
relationships.
An ER model is generally considered as a top down approach in data designing.
An example of ER model is –

Figure 16:E-R Model

108 | P a g e
Advantages of E - R model

 The data requirements are easily understandable using an E - R model as it utilises


clear diagrams.

 The E-R model can be easily converted into a relational database.

 The E-R diagram is very easy to understand as it has clearly defined entities and
the relations between them.

Disadvantages of E-R model

 There is no data manipulation language available for an E- R model as it is a


largely abstract concept.

 There are no standard notations for an E - R model. It depends on each individual


designer how they design it.

5.4. Object Oriented Model


Object oriented data model is based on using real life scenarios. In this model, the scenarios are
represented as objects. The objects with similar functionalities are grouped together and linked to
different other objects.

An Example of the Object Oriented data model is −

109 | P a g e
Advantages of Object Oriented Model

 Due to inheritance, the data types can be reused in different objects. This reduces
the cost of maintaining the same data in multiple locations.
 The object oriented model is quite flexible in most cases.
 It is easier to extend the design in Object Oriented Model.
Disadvantages of Object Oriented Model

 It is not practically implemented in database systems as it is mostly a theoretical


approach.
 This model can be quite complicated to create and understand.

5.5. Objects and Attributes

Objects

Object consists of entity and attributes which can describe the state of real world object and
action associated with that object.

5.6. Characteristics of Object

Some important characteristics of an object are:

 Object name: The name is used to refer different objects in the program.
 Object identifier: This is the system generated identifier which is assigned,
when a new object is created.
 Structure of object: Structure defines, how the object is constructed using
constructor. In object oriented database the state of complex object can be
constructed from other objects by using certain type of constructor. The formal
way of representing objects as (i,c,v) where 'i' is object identifier, 'c' is type
constructor and 'v' is current value of an object.
 Transient object: In OOPL, objects which are present only at the time of
execution are called as transient object.
o For example: Variables in OOPL
 Persistent objects: An object which exists even after the program is completely
executed (or terminated), is called as persistent objects. Object-oriented databases
can store objects in secondary memory.

110 | P a g e
Attributes
Attributes are nothing but the properties of objects in the system.
Example: Employee can have attribute 'name' and 'address' with assigned values as:
Attribute value
Name Abebe
Address hossana
Id 07

Types of Attributes
The three types of attributes are as follows:
1. Simple attributes
Attributes can be of primitive data type such as, integer, string, real etc. which can
take literal value.
Example: 'ID' is simple attribute and value is 07.
2. Complex attributes
Attributes which consist of collections or reference of other multiple objects are
called as complex attributes.
Example: Collection of Employees consists of many employee names.
3. Reference attributes
Attributes that represent a relationship between objects and consist of value or
collection of values are called as reference attributes.
Example: Manager is reference of staff object

111 | P a g e
Example
An Example of the Object Oriented data model with object and attributes –

Figure 17: Object Oriented data model with object and attributes

 Shape, Circle, Rectangle and Triangle are all objects in this model.
 Circle has the attributes Center and Radius.
 Rectangle has the attributes Length and Breath

5.6.1. Object Identity

 Every object has unique identity. In an object oriented system, when object is created
OID is assigned to it.
 In RDBMS OID is value based and primary key is used to provide uniqueness of each
table in relation. Primary key is unique only for that relation and not for the entire
system. Primary key is chosen from the attributes of the relation which makes object
independent on the object state.
 In OODBMS OID are variable name or pointer.

112 | P a g e
Properties of OID
1. Uniqueness: OID cannot be same to every object in the system and it is generated
automatically by the system.
2. Invariant: OID cannot be changed throughout its entire lifetime.
3. Invisible: OID is not visible to user.

113 | P a g e
Chapter 6
Data warehousing and Data Mining Techniques
6.1. Data Warehousing
A data warehouse is a relational database management system responsible for the collection
and storage of data to support management decision making and problem solving.

A data warehouse takes data from all the databases and create a layer optimized for and
dedicated to analytical.

It enables managers and other business professionals to undertake data mining, online
analytical processing, market research and decision support. Current evolution of Decision
Support Systems (DSSs). Data warehouse maintains a copy of information from the source
transaction system.

Figure 18: Current evolution of Decision Support Systems

114 | P a g e
6.1.1. Introduction

• Data warehouse is an integrated, subject-oriented, time-variant, non-volatile database that


provides support for decision making.
 Integrated  centralized, consolidated database that integrates data derived from the
entire organization
 Subject-Oriented  Data warehouse contains data organized by topics.
 Time variant  In contrast to the operational database that focus on current
transactions, the data warehouse represent the flow of data through time. Data
warehouse contains data that reflect what happened last week, last month, past five
years, and so on.
 Nonvolatile  Once data enter the data warehouse, they are never removed. Because
the data in the warehouse represent the company’s entire history.
• Because data is added all the time, warehouse is growing. It has several types of
applications: OLAP, DSS, and data mining applications are supported. OLAP is user
driven; the analyst generates a hypothesis and uses OLAP to verify, e.g., “people with high
debt are bad credit risks.” Data mining tool generates a hypothesis – Tool performs
exploration, e.g., find risk factors for granting credit

6.1.2. Database & data warehouse: Differences

Data warehouse

 Used for OLAP(online analysis process)

 The table and join are simple since they are denormalized

 Data modeling techniques are use data warehouse design

 Optimized for read operation

 Performance is high for analyses queries

 It is usually database

 Data ware is huge data using for analytical (storage) copy and analyses

115 | P a g e
• The data warehouse and operational environments are separated. Data warehouse
receives its data from operational databases.
– Data warehouse environment is characterized by read-only transactions to
very large data sets.
– Operational environment is characterized by numerous update transactions to
a few data entities at a time.
– Data warehouse contains historical data over a long time horizon.
• Ultimately Information is created from data warehouses. Such Information becomes the
basis for rational decision making.
• The data found in data warehouse is analyzed to discover previously unknown data
characteristics, relationships, dependencies, or trends.
Data warehouses provide access to data for complex analysis, knowledge discovery, and
decision making and support high-performance demands on an organization's data and
information. The construction of data warehouses involves data cleaning, data integration and
data transformation and can be viewed as an important preprocessing step for data mining.
Database
 Database crud operation with frequently used data
 It is basically any system which keeps data in a table format
 Used to OLTP and data warehousing
 The tables and joins are complex since they are normalized
 Entity relation modeling techniques are used for relational data base management
system database design
 Optimized for write operation
 Performance is low for analyses query
In comparison to traditional databases, data warehouses generally contain very large amounts of
data from multiple sources that may include databases from different data models and sometimes
files acquired from independent systems and platforms

6.1.3. Benefits

Some significant operational issues arise with data warehousing: construction,


administration, and quality control. Project management the design, construction, and

116 | P a g e
implementation of the warehouse is an important and challenging consideration that should
not be underestimated. The building of an enterprise-wide warehouse in a large organization is
a major undertaking, potentially taking years from conceptualization to implementation.
Because of the difficulty and amount of lead time required for such an undertaking, the
widespread development and deployment of data marts may provide an attractive alternative,
especially to those organizations with urgent needs for OLAP , DSS, and/or data mining
support.

The administration of a data warehouse is an intensive enterprise, proportional to the size and
complexity of the warehouse. An organization that attempts to administer a data warehouse
must realistically understand the complex nature of its administration. Although designed for
read-access, a data warehouse is no more a static structure than any of its information sources.
Source databases can be expected to evolve. The warehouse's schema and acquisition
component must be expected to be updated to handle these evolutions.

A significant issue in data warehousing is the quality control of data. Both quality and
consistency of data are major concerns. Although the data passes through a cleaning function
during achievement, quality and consistency remain significant issues for the database
administrator. Melding data from heterogeneous and disparate sources is a major challenge
given differences in naming, domain definitions, identification numbers, and the like. Every
time a source database changes, the data warehouse administrator must consider the possible
interactions with other elements of the warehouse.

Administration of a data warehouse will require far broader skills than are needed for
traditional database administration. A team of highly skilled technical experts with overlapping
areas of expertise will likely be needed, rather than a single individual. Like database
administration, data warehouse administration is only partly technical; a large part of the
responsibility requires working effectively with all the members of the organization with an
interest in the data warehouse. However difficult that can be at times for database
administrators, it is that much more challenging for data warehouse administrators, as the scope
of their responsibilities is considerably broader.
Design of the management function and selection of the management team for a database
warehouse are crucial. Managing the data warehouse in a large organization will surely be a

117 | P a g e
major task. Many commercial tools are already available to support management functions.
Effective data warehouse management will certainly be a team function, requiring a wide set
of technical skills, careful coordination, and effective leadership. Just as we must prepare for
the evolution of the warehouse, we must also recognize that the skills of the management team
will, of necessity, evolve with it.

6.2. Online Transaction Processing (OLTP) and Data Warehousing


Traditional databases support on-line transaction processing (OLTP ), which includes
insertions, updates, and deletions, while also supporting information query requirements.
Traditional relational databases are optimized to process queries that may touch a small part of
the database and transactions that deal with insertions or updates of a few tuples per relation to
process. Thus, they cannot be optimized for OLAP , DSS, or data mining. By contrast, data
warehouses are designed precisely to support efficient extraction, processing, and presentation
for analytic and decision-making purposes.

In contrast to multi-databases, which provide access to disjoint and usually heterogeneous


databases, a data warehouse is frequently a store of integrated data from multiple sources,
processed for storage in a multidimensional model. Unlike most transactional databases, data
warehouses typically support time-series and trend analysis, both of which require more
historical data than are generally maintained in transactional databases. Compared with
transactional databases, data warehouses are nonvolatile. That means that information in the data
warehouse changes far less often and may be regarded as non-real-time with periodic updating.
In transactional systems, transactions are the unit and are the agent of change to the database; by
contrast, data warehouse information is much more coarse grained and is refreshed according to
a careful choice of refresh policy, usually incremental. Warehouse updates are handled by the
warehouse's acquisition component that provides all required preprocessing.

Generally data warehouse is a collection of decision support technologies, aimed at enabling the
knowledge worker (executive, manager, analyst) to make better and faster decisions

118 | P a g e
6.3. Data Mining
6.3.1. Introduction

Data mining is the part of knowledge discovery process, Knowledge Discovery in Databases,
frequently abbreviated as KDD , typically encompasses more than data mining. The knowledge
discovery process comprises six phases: data selection, data cleansing, enrichment, data
transformation or encoding, data mining, and the reporting and display of the discovered
information.

As an example, consider a transaction database maintained by a specialty consumer goods retailer.


Suppose the client data includes a customer name, zip code, phone number, date of purchase, item
code, price, quantity, and total amount. A variety of new knowledge can be discovered by KDD
processing on this client database.

During data selection, data about specific items or categories of items, or from stores in a specific
region or area of the country, may be selected.
The data cleansing process then may correct invalid zip codes or eliminate records with incorrect
phone prefixes. Enrichment typically enhances the data with additional sources of information.
For example, given the client names and phone numbers, the store may purchase other data about
age, income, and credit rating and append them to each record.

Data transformation and encoding may be done to reduce the amount of data. For instance, item
codes may be grouped in terms of product categories into audio, video, supplies, electronic
gadgets, camera, accessories, and so on. Zip codes may be aggregated into geographic regions;
incomes may be divided into ten ranges, and so on. We showed a step called cleaning as a
precursor to the data warehouse creation. If data mining is based on an existing warehouse for this
retail store chain, we would expect that the cleaning has already been applied. It is only after such
preprocessing that data mining techniques are used to mine different rules and patterns. For
example, the result of mining may be to discover:

Data Mining is a technology that uses various techniques to discover hidden knowledge from
heterogeneous and distributed historical data stored in large databases, warehouses and other
massive information repositories so to find patterns in data that are:

 valid: not only represent current state, but also hold on new data with some certainty

119 | P a g e
 novel: non-obvious to the system that are generated as new facts
 useful: should be possible to act on the item or problem
 understandable: humans should be able to interpret the pattern
Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data.
Thus, data mining should have been more appropriately named “knowledge mining from data,”
which is unfortunately somewhat long. “Knowledge mining,” a shorter term may not reflect the
emphasis on mining from large amounts of data. Thus, such a misnomer that carries both “data”
and “mining” became a popular choice. Many other terms carry a similar or slightly different
meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging.

Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery from Data, or KDD. Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.

The goals of data mining fall into the following classes: prediction, identification, classification,
and optimization.

¤ Prediction: Data mining can show how certain attributes within the data will behave in the
future. Examples of predictive data mining include the analysis of buying transactions to
predict what consumers will buy under certain discounts, how much sales volume a store would
generate in a given period, and whether deleting a product line would yield more profits. In
such applications, business logic is used coupled with data mining. In a scientific context,
certain seismic wave patterns may predict an earthquake with high probability.

¤ Identification: Data patterns can be used to identify the existence of an item, an event, or an
activity. For example, intruders trying to break a system may be identified by the programs
executed, files accessed, and CPU time per session. In biological applications, existence of a
gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. The
area known as authentication is a form of identification. It ascertains whether a user is indeed
a specific user or one from an authorized class; it involves a comparison of parameters or
images or signals against a database.

120 | P a g e
¤ Classification: Data mining can partition the data so that different classes or categories can be
identified based on combinations of parameters. For example, customers in a supermarket can
be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, and
infrequent shoppers. This classification may be used in different analyses of customer buying
transactions as a post-mining activity. Sometimes classification based on common domain
knowledge is used as an input to decompose the mining problem and make it simpler. For
instance, health foods, party foods, or school lunch foods are distinct categories in the
supermarket business. It makes sense to analyze relationships within and across categories as
separate problems. Such categorization may be used to encode the data appropriately before
subjecting it to further data mining.

¤ Optimization: One eventual goal of data mining may be to optimize the use of limited
resources such as time, space, money, or materials and to maximize output variables such as
sales or profits under a given set of constraints. As such, this goal of data mining resembles the
objective function used in operations research problems that deals with optimization under
constraints.

The term data mining is currently used in a very broad sense. In some situations it includes
statistical analysis and constrained optimization as well as machine learning. There is no sharp line
separating data mining from these disciplines

¤ Applications of Data Mining:


Data mining technologies can be applied to a large variety of decision-making contexts in
business. In particular, areas of significant payoffs are expected to include the following:

 Marketing: Applications include analysis of consumer behavior based on buying patterns;


determination of marketing strategies including advertising, store location, and targeted
mailing; segmentation of customers, stores, or products; and design of catalogs, store
layouts, and advertising campaigns.

 Finance: Applications include analysis of creditworthiness of clients, segmentation of


account receivables, performance analysis of finance investments like stocks, bonds, and
mutual funds; evaluation of financing options; and fraud detection.

121 | P a g e
 Manufacturing: Applications involve optimization of resources like machines,
manpower, and materials; optimal design of manufacturing processes, shop-floor layouts,
and product design, such as for automobiles based on customer requirements.

 Health Care: Applications include an analysis of effectiveness of certain treatments;


optimization of processes within a hospital, relating patient wellness data with doctor
qualifications; and analyzing side effects of drugs.

6.4. Data Mining Techniques

A. Data Preparation
Some data have problems on their own that needs to be cleaned:

o Outliers: misleading data that do not fit to most of the data/facts


o Missing data: attributes values might be absent which needs to be replaced with
estimates
o Irrelevant data: attributes in the database that might not be of interest to the DM task
being developed
o Noisy data: attribute values that might be invalid or incorrect. E.g. typographical
errors
o Inconsistent data, duplicate data, etc.
Major Tasks in Data preparation processes
1. Data cleaning: to get rid of bad data, Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies. Incomplete Data, Missing Data, Noisy Data(Noisy:
containing noise, errors, or outliers e.g., Salary=“−10” (an error), green as rgreen)

2. Data integration: Integration of data from multiple sources, such as databases, data
warehouses, or files. combines data from multiple sources (database, data warehouse,
files & sometimes from non-electronic sources) into a coherent store. Because of the use
of different sources, data that that is fine on its own may become problematic when we
want to integrate it. Some of the issues are:

 Different formats and structures

 Conflicting and redundant data

122 | P a g e
 Data at different levels

3. Data reduction: obtains a reduced representation of the data set that is much smaller in
volume, yet produces almost the same results.it Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost the same) analytical
results. Why data reduction? A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set. Dimensionality
reduction: Select best attributes or remove unimportant attributes. , size reduction: Reduce data
volume by choosing alternative, smaller forms of data representation, Data compression: Is a
technology that reduce the size of large files such that smaller files take less memory space and
fast to transfer over a network or the Internet,

4. Data transformation: A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be identified with one of the new
values. Methods for data transformation. Normalization: Scaled to fall within a smaller,
specified range of values min-max normalization z-score normalization

Discretization: Reduce data size by dividing the range of a continuous attribute into
intervals. Interval labels can then be used to replace actual data values Discretization can
be performed recursively on an attribute using method such as Binning: divide values into
intervals Concept hierarchy climbing: organizes concepts (i.e., attribute values)
hierarchically

B. Classification (which is also called supervised learning) maps data into predefined groups or
classes to enhance the prediction process. Predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data.

C. Clustering (which is also called Unsupervised learning) Clustering is a data mining (machine
learning) technique that finds similarities between data according to the characteristics found
in the data & groups similar data objects into one cluster. Groups’ similar data together into
clusters. is used to find appropriate groupings of elements for a set of data. Unlike
classification, clustering is a kind of undirected knowledge discovery or unsupervised learning;

123 | P a g e
that is, there is no target field, and the relationship among the data is identified by bottom-up
approach.

D. Association Rule (is also known as market basket analysis) It discovers interesting
associations between attributes contained in a database. Based on frequency counts of the
number of items occur in the event, association rule tells if item X is a part of the event, then
what is the percentage of item Y is also part of the event. Pattern discovery attempts to discover
hidden linkage between data items.

124 | P a g e
References
[1] J. Kandiri, “ADVANCED DATABASE SYSTEMS.”
[2] “D ATA B ASE.”
[3] “GRADUATE DIPLOMA IN IT ADVANCED DATABASE MANAGEMENT
SYSTEMS,” no. September, 2021.
[4] S. Summary, “INF312 - Advanced Database Systems INF312 - Advanced Database
Systems 3 - Step Historical View 3 - Step Comparison,” pp. 1–20, 2002.
[5] C. Title, “NATIONAL OPEN UNIVERSITY OF NIGERIA COURSE CODE : CIT 905.”

125 | P a g e

You might also like