Advanced DB Lecture All in One PDF
Advanced DB Lecture All in One PDF
Chapter one
Transactions processing and concurrency
management
What is a transaction?
A Transaction is a mechanism for applying the desired
modifications/operations to a database. It is evident in real life that the
final database instance after a successful manipulation of the content of
the database is the most up-to-date copy of the database.
Action, or series of actions, carried out by a single user or application program,
which accesses or changes contents of database. (i.e. Logical unit of work on the
database.)
A transaction could be a whole program, part/module of a program or a
single command.
Changes made in real time to a database are called transactions. Examples
include ATM transactions, credit card approvals, flight reservations, hotel
check-in, phone calls, supermarket canning, academic registration and
billing.
A transaction could be composed of one or more database and non-
database operations.
Transforms database from one consistent state to another, although
consistency may be violated during transaction.
A database transaction is a unit of interaction with database management
system or similar system that is treated in a coherent and reliable way
independent of other transactions.
Transaction processing system
A system that manages transactions and controls their access to a DBMS is
called a TP monitor. A transaction processing system (TPS) generally consists
of a TP monitor, one or more DBMSs, and a set of application programs
containing transaction.
In database field, a transaction is a group of logical operations that must
all succeed or fail as a group. Systems dedicated to supporting such
operations are known as transaction processing systems.
1
Advanced Database Systems Lecture Note
What we are interested about is the online transaction, which is the interaction
between the users of a database system and the shared data stored in the
database. This transaction program contains the steps involved in the business
transaction.
The DBMS has no inherent way of knowing the logical grouping of operations
that are supposed to be done together. Hence, it must provide the users a means
to logically group their operations. Key words such as BEGIN TRANSACTION,
COMMIT and ROLLBACK (or their equivalents) are available in many data
manipulation languages to delimit transactions.
A single transaction might require several queries, each reading and/or writing
information in the database. When this happens it is usually important to be sure
that the database is not left with only some of the queries carried out. For
example, when doing a money transfer, if the money was debited from one
account, it is important that it also be credited to the depositing account. Also,
transactions should not interfere with each other.
2
Advanced Database Systems Lecture Note
A: Atomicity
C: Consistency
I: Isolation
D: Durability
Atomicity
Is All or None property
Every transaction should be considered as an atomic process which can not be
sub divided into small tasks. Due to this property, just like an atom which exists
or does not exist, a transaction has only two states. Done or Never Started.
Done - a transaction must complete successfully and its effect
should be visible in the database.
Never Started - If a transaction fails during execution then all its
modifications must be undone to bring back the database to
the last consistent state, i.e., remove the effect of failed
transaction.
No state between Done and Never Started
Consistency
If the transaction code is correct then a transaction, at the end of its execution,
must leave the database consistent. A transaction should transform a database
from one previous consistent state to another consistent state.
Isolation
A transaction must execute without interference from other concurrent
transactions and its intermediate or partial modifications to data must not be
visible to other transactions.
Durability
The effect of a completed transaction must persist in the database, i.e., its updates
must be available to other transaction immediately after the end of its execution,
and is should not be affected due to failures after the completion of the
transaction.
3
Advanced Database Systems Lecture Note
State of a Transaction
A transaction is an atomic operation from the users‘ perspective. But it has a
collection of operations and it can have a number of states during its execution.
4
Advanced Database Systems Lecture Note
Start Ok to Commit
Commit Commit
Database
Modified
No Error
System Detects
Error
Modify Abort End of
Transaction
Most SQL statements seem to be very short and easy to execute. But the reverse
is true if you consider it as a one command transaction. Actually a database
system interprets a transaction not as an application program but as a logical
sequence of low- level operations read and write (referred to as primitives).
5
Advanced Database Systems Lecture Note
In Serial transaction execution, one transaction being executed does not interfere
the execution of any other transaction.
Good things about serial execution
Correct execution, i.e., if the input is correct then output will be
correct.
Fast execution, since all the resources are available to the active.
The worst thing about serial execution is very inefficient resource utilization. i.e.
reduced parallelism.
6
Advanced Database Systems Lecture Note
Time T1 T2
read (X) {X = 10}
X := X+N {X = 11}
write (X) {X = 11}
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
read (X) {X = 11}
X := X+N {X = 12}
write (X)
Final values of X, Y at the end of T1 and T2: X = 12 and Y = 7.
Thus we can witness that in serial execution of transaction, if we have two transactions
Ti and Ti+1, then Ti+1 will only be executed after the completion of Ti.
Time T1 T2
read (X) {X = 10}
read (X) {X = 10}
X := X+N {X = 11}
X := X+N {X = 11}
write (X) {X = 11}
write (X) (X=11)
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
Final values at the end of T1 and T2: X = 11, and Y = 7. This improves
resource utilization, unfortunately gives incorrect result.
The correct value of X is 12 but in concurrent execution X =11, which is
incorrect. The reason for this error is incorrect sharing of X by T1 and T2.
In serial execution T2 read the value of X written by T1 (i.e., 11) but in
concurrent execution T2 read the same value of X (i.e., 10) as T1 did and the
update made by T1 was overwritten by T2‘s update.
This is the reason the final value of X is one less than what is produced by
serial execution.
7
Advanced Database Systems Lecture Note
8
Advanced Database Systems Lecture Note
Every transaction is correct when executed alone, but this would not
guarantee that the interleaving of operations from these transactions will
produce a correct result.
In the above case, if done one after the other (serially) then we have
no problem.
If the execution is T1 followed by T2 then A=190
If the execution is T2 followed by T1 then A=190
9
Advanced Database Systems Lecture Note
10
Advanced Database Systems Lecture Note
Serializability
In any transaction processing system, if concurrent processing is
implemented, there will be concept called schedule having or
determining the execution sequence of operations in different
transactions.
Schedule: time-ordered sequence of the important actions taken by
one or more transitions. Schedule represents the order in which
instructions are executed in the system in chronological ordering.
Serialization
Objective of serialization is to find schedules that allow transactions
to execute concurrently without interfering with one another.
If two transactions only read data, order is not important.
11
Advanced Database Systems Lecture Note
12
Advanced Database Systems Lecture Note
Example1:
R1(x) W2(x) W1(x)
Example2:
R1(x)W2(x)W1(x)W3(x)
13
Advanced Database Systems Lecture Note
Locking and Time stamping are pessimistic approaches since they delay
transactions.
Both Locking and Time stamping are conservative approaches: delay
transactions in case they conflict with other transactions.
Locking Method
The locking method is a mechanism for preventing simultaneous access on
a shared resource for a critical operation
14
Advanced Database Systems Lecture Note
Types of a Lock
Shared lock: A Read operation does not change the value of a data item.
Hence a data item can be read by two different transactions
simultaneously under share lock mode. So only to read a
data item T1 will do: Share lock (X), then Read (X), and finally
Unlock (X).
Exclusive lock: A write operation changes the value of the data item.
Hence two write operations from two different transactions
or a write from T1 and a read from T2 are not allowed. A
data item can be modified only under Exclusive lock. To
modify a data item T1 will do: Exclusive lock (X), then Write
(X) and finally Unlock (X).
When these locks are applied, then a transaction must behave in a special
way. This special behavior of a transaction is referred to as well-formed.
15
Advanced Database Systems Lecture Note
Examples: T1 and T2 are two transactions. They are executed under locking
as follows. T1 locks A in exclusive mode. When T2 want s to lock A, it finds
it locked by T1 so T2 waits for Unlock on A by T1. When A is released then
T2 locks A and begins execution.
Suppose a lock on a data item is applied, the data item is processed and it
is unlocked immediately after reading/writing is completed as follows.
Initial values of A = 10 and B = 20.
16
Advanced Database Systems Lecture Note
The final result of the two transactions using the two types of transaction
execution (serial and concurrent) is not the same. This indicates that the
above method of locking and unlocking is not correct. This is because
although such kind of locking and unlocking data items increases the concurrency
of execution it violates the isolation and atomicity of transactions. Immediate
unlocking is not trustworthy. Thus, to preserve consistency we have to use
another approach to locking, two-phase locking scheme.
Only one way to break deadlock: abort one or more of the transactions in
the deadlock.
Deadlock should be transparent to user, so DBMS should restart
transaction(s).
17
Advanced Database Systems Lecture Note
Timeout
The deadlock detection could be done using the technique of TIMEOUT.
Every transaction will be given a time to wait in case of deadlock. If a
transaction waits for the predefined period of time in idle mode, the DBMS
will assume that deadlock occurred and it will abort and restart the
transaction.
18
Advanced Database Systems Lecture Note
Time-stamping Method
Again each data item will have a timestamp for Read and Write.
WTS(A) which denotes the largest timestamp of any transaction
that successfully executed Write(A)
RTS(A) which denotes the largest timestamp of any transaction
that successfully executed Read(A)
These timestamps are updated whenever a new Read (A) or Write
(A) instruction is executed.
Read/write proceeds only if last update on that data item was carried out
by an older transaction. Otherwise, transaction requesting read/write is
restarted and given a new timestamp.
19
Advanced Database Systems Lecture Note
Optimistic Technique
Locking and assigning and checking timestamp values may be
unnecessary for some transactions
Assumes that conflict is rare.
When transaction reaches the level of executing commit, a
check is performed to determine whether conflict has occurred.
If there is a conflict, transaction is rolled back and restarted.
Based on assumption that conflict is rare and more efficient to
let transactions proceed without delays to ensure serializability.
20
Advanced Database Systems Lecture Note
Three phases:
1. Read
2. Validation
3. Write
Granularity is the size of the data items chosen as the unit of protection
by a concurrency control protocol. See figure bellow.
It could be:
The entire database
A file
A page (a section of physical disk in which relations are
stored)(sometimes also called a block)
21
Advanced Database Systems Lecture Note
A record
A field value of a record
The granularity has effect on the performance of the system. As locking will
prevent access to the data, the size of the data required to be locked will prevent
other transactions from having access. If the entire database is locked, then
consistency will be highly maintained but less performance of the system will be
witnessed. Is a single data item is locked; consistency maybe at risk but
concurrent processing and performance will be enhanced. Thus, as one go from
the entire database to a single value, performance and concurrent processing will
be enhanced but consistency will be at risk and needs good concurrency control
mechanism and strategy.
Database
Record1
Record2
Record3
Field1
Firld2
Field3
22
Advanced Database Systems Lecture Note
Transaction Subsystem
Refer to the figure on page 4 (transaction sub system)
Transaction manager
Scheduler
Recovery manager
Buffer manager
23
Advanced Database Systems Lecture Note
Database Recovery
Database recovery is the process of restoring database to a correct state in
the event of a failure.
A database recovery is the process of eliminating the effects of a failure
from the database.
Recovery, in database systems terminology, is called restoring the last
consistent state of the data items.
24
Advanced Database Systems Lecture Note
1. Isolating the database from other users. Occasionally, you may need
to drop and re-create the database to continue the recovery.
One can recover databases after three basic types of problems: user error,
software failure, and hardware failure.
25
Advanced Database Systems Lecture Note
Example:
The initial value of A=100, B=200 and C=300
The Required state after the execution of T1 is A=500, B=800 and C=700
Thus S1= (100,200,300)
S2= (500,800,700)
Transaction (T1)
Time Operation
1 A=A+200
2 B=B-100
3 C=C-200
Failure
4 A=A+200
5 B=B+700
6 C=C+600
26
Advanced Database Systems Lecture Note
DBMS starts at time t0, but fails at time tf. Assume data for
transactions T2 and T3 have been written to secondary storage.
T1 and T6 have to be undone. In absence of any other information,
recovery manager has to redo T2, T3, T4, and T5.
tc is the checkpoint time by the DBMS
Recovery Facilities
DBMS should provide following facilities to assist with recovery:
o Backup mechanism: that makes periodic backup copies of
database.
o Logging facility: that keeps track of current state of
transactions and database changes.
o Checkpoint facility: that enables updates to database in
progress to be made permanent.
o Recovery manger: This allows the DBMS to restore the
database to a consistent state following a failure.
27
Advanced Database Systems Lecture Note
Restoring the database means transforming the state of the database to the
immediate good state before the failure. To do this, the change made on
the database should be preserved. Such kind of information is stored in a
system log or transaction log file.
Log File
Contains information about all updates to database:
o Transaction records.
o Checkpoint records.
Often used for other purposes (for example, auditing).
Transaction records contain:
o Transaction identifier.
o Type of log record, (transaction start, insert, update, delete,
abort, commit).
o Identifier of data item affected by database action (insert,
delete, and update operations).
o Before-image of data item.
o After-image of data item.
28
Advanced Database Systems Lecture Note
Check pointing
Checkpoint: is a point of synchronization between database and log file.
All buffers are force-written to secondary storage.
29
Advanced Database Systems Lecture Note
Recovery Techniques
Damage to the database could be either physical and relate which will
result in the loss of the data stored or just inconsistency of the database
state after the failure. For each we can have a recover mechanism:
1. If database has been damaged:
Need to restore last backup copy of database and reapply
updates of committed transactions using log file.
Extensive damage/catastrophic failure: physical media
failure; is restored by using the backup copy and by re
executing the committed transactions from the log up to
the time of failure.
30
Advanced Database Systems Lecture Note
Deferred Update
Updates are not written to the database until after a transaction has
reached its commit point.
If transaction fails before commit, it will not have modified database
and so no undoing of changes required.
May be necessary to redo updates of committed transactions as their
effect may not have reached database.
If a transaction aborts, ignore the log record for it. And do nothing
with transaction having a ―transaction start‖ and ―Transaction
abort‖ log records
A transaction first modifies all its data items and then writes all its
updates to the final copy of the database. No change is going to be
recorded on the database before commit. The changes will be made
only on the local transaction workplace. Update on the actual
database is made after commit and after the change is recorded on
the log. Since there is no need to perform undo operation it is also
called NO-UNDO/REDO Algorithm
The redo operations are made in the order they were written to log.
31
Advanced Database Systems Lecture Note
Shadow Paging
Maintain two page tables during life of a transaction: current page
and shadow page table.
When transaction starts, two pages are the same.
Shadow page table is never changed thereafter and is used to restore
database in event of failure.
During transaction, current page table records all updates to
database.
When transaction completes, current page table becomes shadow
page table.
No log record management
However, it has an overhead of managing pages i.e. page
replacement issues have to be handled.
32
Advanced Database Systems Lecture Note
Chapter Two
Query Processing and Optimization
DB system features
Crash Recovery
Integrity Checking
Security
Concurrency Control
Query Processing and Optimization
File Organization and Optimization
Query Processing
The aim of query processing is to find information in one or more databases and
deliver it to the user quickly and efficiently. Traditional techniques work well for
databases with standard, single-site relational structures, but databases
containing more complex and diverse types of data demand new query
processing and optimization techniques.
Consider the following query from two relations; staff and branch
33
Advanced Database Systems Lecture Note
Assume
i. One record is accessed at a time, n staff , m branches, x non-manager,
and y non-Addis branches for some integers n,m,x,y.
ii. intermediate results are stored on disk
iii. ignore about the final result(write) because it is the same for all the
expressions
Then, this high level SQL can be transformed in the following three
low level equivalent relational algebra expressions.
(position=’manager’)(City=’Addis’)(staff.branchNo=branch.branchNo)(StaffXBranch)
Analysis:
i. read each tuple from the two relations n+m reads
ii. create a table of the Cartesian product nXm writes
iii. test each tuple of step 2nXm read
Total No. of Disk access: 2(nXm) +n+m
or
(position=’manager’)(City=’Addis’)(Staff staff.branchNo=branch.branchNo
Branch)
Analysis:
i. read each tuple from the two relations n+m reads
ii. create a table of the Join n writes
iii. test each tuple of step 2n read
Total No. of Disk access: 3(n) +m
Or
(position =’manager’ (Staff )) staff.branchNo=branch.branchNo (
City=’Addis’(Branch))
Analysis:
i. Test each tuple from the two relations n+m reads
ii. create a “manager_Staff” and “addis_Branch” realtions (n-
x) +(m-y) writes
iii. create a join of the two relations at step 2(n-x) + (m-y) reads
Total No. of Disk access: 2(n-x)+2(m-y)+n+m
34
Advanced Database Systems Lecture Note
Query Decomposition
Query decomposition is the process of transforming a high level query into a
relational algebra query, and to check that the query is syntactically and
semantically correct. Query decomposition consists of parsing and validation
There could be tons of tricks (not only in storage and query processing, but also
in concurrency control, recovery, etc.) Different tricks may work better in
different usage scenarios. Same tricks get used over and over again in different
applications
35
Advanced Database Systems Lecture Note
Query processing: Execute transactions in behalf of this query and print the
result. Steps in query processing:
36
Advanced Database Systems Lecture Note
37
Advanced Database Systems Lecture Note
Query Optimization
What is wrong with the ordinary query?
Everyone wants the performance of their database to be optimal. In
particular, there is often a requirement for a specific query or object that is
query based, to run faster.
Problem of query optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner, given the database
structure.
The performance of a query is affected by the tables or queries that underlies
the query and by the complexity of the query.
When data/workload characteristics change
The best navigation strategy changes
The best way of organizing the data changes
Query optimizers are one of the main means by which modern database systems
achieve their performance advantages. Given a request for data manipulation or
retrieval, an optimizer will choose an optimal plan for evaluating the request
from among the manifold alternative strategies. i.e. there are many ways
(access paths) for accessing desired file/record. The optimizer tries to select the
most efficient (cheapest) access path for accessing the data. DBMS is responsible
to pick the best execution strategy based various considerations.
Query optimizers were already among the largest and most complex modules of
database systems.
38
Advanced Database Systems Lecture Note
Method 1
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
NO: goto a.
YES: exit (the result in RAM or on disk).
Performance: Too many accesses.
Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
Performance: Reduces the number of times s blocks are loaded by a factor of
equal to the number of r records than can fit in main memory.
39
Advanced Database Systems Lecture Note
40
Advanced Database Systems Lecture Note
(R)= (R)
L1 L2 L3 L4 L1
(R c1 S)= (S c1 R)
6. Commutativity of SELECTION with THETA JOIN
a. If the predicate c1 involves only attributes of one of the relations (R)
being joined, then the Selection and Join operations commute
41
Advanced Database Systems Lecture Note
S))
(SR )= (S ) (R )
L1 L1 L1
42
Advanced Database Systems Lecture Note
Main Heuristic
The main heuristic is to first apply operations that reduce the size (the cardinality
and/or the degree) of the intermediate relation. That is:
a. Perform SELECTION as early as possible: that will reduce the
cardinality (number of tuples) of the relation.
b. Perform PROJECTION as early as possible: that will reduce the
degree (number of attributes) of the relation.
Both a and b will be accomplished by placing the
SELECT and PROJECT operations as far down the tree as
possible.
c. SELECT and JOIN operations with most restrictive conditions
resulting with smallest absolute size should be executed before
other similar operations. This is achieved by reordering the
nodes with JOIN
43
Advanced Database Systems Lecture Note
Example: consider the following schemas and the query, where the EMPLOYEE
and the PROJECT relations are related by the WORKS_ON relation.
Query: The manager of the company working on road construction would like
to view employees name born before January 1 1965 who are working
on the project named Ring Road.
<FName, LName>
( <DoB<Jan 1 1965 WEmpID=EEmpID PProjID=WProjID PName=’Ring
44
Advanced Database Systems Lecture Note
<FName, LName>
X
PROJECT
WORKS_ON
EMPLOYEE
By applying the first step (cascading the selection) we will come up with the
following structure.
(DoB<Jan 1 1965) ( (
(WEmpID=EEmpID) ( (PProjID=WProjID) (PName=’Ring Road’)
By applying the second step it can be seen that some conditions have attribute
that belong to a single relation ( DoB belongs to EMPLOYEE and PName belongs
to PROJECT) thus the selection operation can be commuted with Cartesian
Operation. Then, since the condition WEmpID=EEmpID base the employee and
WORKS_ON relation the selection with this condition can be cascaded.
45
Advanced Database Systems Lecture Note
<FName, LName>
(PProjID=WProjID)
PROJECT
X
EMPLOYEE
46
Advanced Database Systems Lecture Note
<FName, LName>
( WEmpID=EEmpID)
EMPLOYEE
X
PROJECT
47
Advanced Database Systems Lecture Note
Using the forth step, Perform Cartesian Operations with the subsequent Selection
Operation.
<FName, LName>
( WEmpID=EEmpID)
(DoB<Jan 1 1965)
( PProjID=WProjID)
EMPLOYEE
(PName=’Ring Road’)
WORKS_ON
PROJECT
48
Advanced Database Systems Lecture Note
<FName, LName>
( WEmpID=EEmpID)
< WEmpID >
( PProjID=WProjID) <FName,LName,EEmpID>
(DoB<Jan 1 1965)
WORKS_ON
<PProjID> EMPLOYEE
(PName=’Ring Road’)
PROJECT
49
Advanced Database Systems Lecture Note
The main idea is to minimize he cost of processing a query. The cost function is
comprised of:
I/O cost + CPU processing cost + communication cost + Storage cost
The DBMs will use information stored in the system catalogue for the purpose of
estimating cost. The main target of of query optimization is to minimize the size
of the intermediate relation. The size will have effect in the cost of:
Disk Access
Data Transpiration
Storage space in the Primary Memory
Writing on Disk
The statistics in the system catalogue used for cost estimation purpose are:
Cardinality of a relation: the number of tuples contained in a relation
currently (r)
Degree of a relation: number of attributes of a relation
Number of tuples on a relation that can be stored in one block of memory
Total number of blocks used by a relation
Number of distinct values of an attribute (d)
Selection Cardinality of an attribute (S): that is average number of records
that will satisfy an equality condition S=r/d
By sing the above information one could calculate the cost of executing a query
and selecting the best strategy, which is with the minimum cost of processing.
The costs of query execution can be calculated for the following major process we
have during processing.
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage, as an query will be
needing some part of the data stored in the database. The disk access cost can
again be analyzed in terms of:
Searching
Reading, and
Writing, data blocks used to store some portion of a relation.
50
Advanced Database Systems Lecture Note
The disk access cost will vary depending on the file organization used and
the access method implemented for the file organization. In addition to
the file organization, the data allocation scheme, whether the data is
stored contiguously or in scattered manner, will affect the disk access cost.
2. Storage Cost
While processing a query, as any query would be composed of many
database operations, there could be one or more intermediate results before
reaching the final output. These intermediate results should be stored in
primary memory for further processing. The bigger the intermediate relation,
the larger the memory requirement, which will have impact on the limited
available space. This will be considered as a cost of storage.
3. Computation Cost
Query is composed of many operations. The operations could be database
operations like reading and writing to a disk, or mathematical and other
operations like:
Searching
Sorting
Merging
Computation on field values
4. Communication Cost
In most database systems the database resides in on station and various
queries originate from different terminals. This will have impact on the
performance of the system adding cost for query processing. Thus, the cost of
transporting data between the database site and the terminal from where the
query originate should be analyzed.
51
Advanced Database Systems Lecture Note
Pipelining
If a query would like to extract supervisors with salary greater than 2000, the
relational algebra representation of the query will be:
1. Approach One
R1 = Position=Supervisor) (Employee)
(
R1 = (Salary>2000) (R1)
2. Approach Two
One can select a single tuple from the relation Employee and perform both tests
in a pipeline and create the final relation at once. This is what is called
PIPELINING
52
Advanced Database Systems Lecture Note
Chapter three
Database Integrity and Security
Topics:
Database Integrity Rules
A good database security management system has not only the following
characteristics: data independence, shared access, minimal redundancy,
data consistency, and data integrity but also the following characteristics:
privacy, integrity, and availability.
Privacy signifies that an unauthorized user cannot disclose data
Integrity ensures that an unauthorized user cannot modify data
Availability ensures that data be made available to the authorized
user unfailingly
Copyright ensures the native rights of individuals as a creator of
information.
Validity ensures activities to be accountable by law.
53
Advanced Database Systems Lecture Note
There are certain security policy issues that we should recognize, where
we should consider administrative control policies, decide which security
features offered by the DBMS is used to implement the system, decide
whether the focus of security administration is left with DBA and whether
it is centralized or decentralized. Besides, one should decide on ownership
of shared data as well.
When we talk about the levels of security protection, it may start from
organization & administrative security, physical & personnel security,
communication security and Information systems security
Database security and integrity is about protecting the database from being
inconsistent and being disrupted. We can also call it database misuse.
Like wise, even though there are various threats that could be categorized
in this group,
Intentional misuse could be:
Unauthorized reading of data
Unauthorized modification of data or
Unauthorized destruction of data
54
Advanced Database Systems Lecture Note
55
Advanced Database Systems Lecture Note
These policies
should be known by the system: should be encoded in the system
should be remembered: should be saved somewhere (th catalogue)
56
Advanced Database Systems Lecture Note
57
Advanced Database Systems Lecture Note
Examples of threats:
Using another persons’ means of access
Unauthorized amendment/modification or copying of data
Program alteration
Inadequate policies and procedures that allow a mix of
confidential and normal out put
Wire-tapping
Illegal entry by hacker
Blackmail
Theft of data, programs, and equipment
Failure of security mechanisms, giving greater access than
normal
Staff shortages or strikes
Inadequate staff training
Viewing and disclosing unauthorized data
Electronic interference and radiation
Data corruption owing to power loss or surge
Fire (electrical fault, lightning strike, arson), flood, bomb
Physical damage to equipment
Breaking cables or disconnection of cables
Introduction of viruses
An organization deploying a database system needs to
identify the types of threat it may be subjected to and initiate
58
Advanced Database Systems Lecture Note
Views
A view is the dynamic result of one or more relational
operations operation on the base relations to produce another
relation
A view is a virtual relation that does not actually exist in the
database, but is produced upon request by a particular user
The view mechanism provides a powerful and flexible
security mechanism by hiding parts of the database from
certain users
59
Advanced Database Systems Lecture Note
Integrity
Integrity constraints contribute to maintaining a secure
database system by preventing data from becoming invalid
and hence giving misleading or incorrect results
Domain Integrity: setting the allowed set of values
Entity integrity: demanding Primary key values not to assume
a NULL value
Referential integrity: enforcing Foreign Key values to have a
value that already exist in the corresponding Candidate Key
attribute(s) or be NULL.
Key constraints: the rules the Relational Data Model has on
different kinds of Key.
60
Advanced Database Systems Lecture Note
Encryption
Authorization may not be sufficient to protect data in database
systems, especially when there is a situation where data should be
moved from one location to the other using network facilities.
61
Advanced Database Systems Lecture Note
Types of Cryptosystems
62
Advanced Database Systems Lecture Note
Any database access request will have the following three major
components
1. Requested Operation: what kind of operation is requested
by a specific query?
2. Requested Object: on which resource or data of the database
is the operation sought to be applied?
3. Requesting User: who is the user requesting the operation
on the specified object?
Authentication
All users of the database will have different access levels and
permission for different data objects, and authentication is the
process of checking whether the user is the one with the
privilege for the access level.
Is the process of checking the users are who they say they are.
Each user is given a unique identifier, which is used by the
operating system to determine who they are
63
Advanced Database Systems Lecture Note
Thus the system will check whether the user with a specific
username and password is trying to use the resource.
Associated with each identifier is a password, chosen by the
user and known to the operation system, which must be
supplied to enable the operating system to authenticate who
the user claims to be
Authorization/Privilege
Authorization refers to the process that determines the mode in
which a particular (previously authenticated) client is allowed to
access a specific resource controlled by a server.
Most of the time, authorization is implemented by using Views.
Views are unnamed relations containing part of one or more
base relations creating a customized/personalized view for
different users.
Views are used to hide data that a user needs not to see.
64
Advanced Database Systems Lecture Note
Different users, depending on the power of the user, can have one
or the combination of the above forms of authorization on
different data objects.
65
Advanced Database Systems Lecture Note
66
Advanced Database Systems Lecture Note
67
Advanced Database Systems Lecture Note
68
Advanced Database Systems Lecture Note
69
Advanced Database Systems Lecture Note
Chapter Four
70
Advanced Database Systems Lecture Note
71
Advanced Database Systems Lecture Note
o Local DBMS
o Distributed DDBMS
o Global System Catalog(GSC)
o Data communication (DC)
72
Advanced Database Systems Lecture Note
73
Advanced Database Systems Lecture Note
Functions of a DDBMS
DDBMS have the following functionality.
Extended Communication Services to provide
access to remote sites.
Extended Data Dictionary- to store data distribution
details a need for global system catalog.
Distributed Query Processing - optimization of
query remote data access.
Extended security- access control to a distributed
data
Extended Concurrency Control –maintain
consistency of replicated data.
Extended Recovery Services- failures of individual
sites and the communication line.
74
Advanced Database Systems Lecture Note
Issues in DDBMS
Replication:
o System maintains multiple copies of similar data
(identical data)
o Stored in different sites, for faster retrieval and fault
tolerance.
o Duplicate copies of the tables can be kept on each system
(replicated). With this option, updates to the tables can
become involved (of course the copies of the tables can be
read-only).
o Advantage: Availability, Increased parallelism (if only
reading)
o Disadvantage: increased overhead of update
Fragmentation:
o Relation is partitioned into several fragments stored in
distinct sites
The partitioning could be vertical, horizontal or
both.
o Horizontal Fragmentation
Systems can share the responsibility of storing
information from a single table with individual
systems storing groups of rows
Performed by the Selection Operation
The whole content of the relation is reconstructed
using the UNION operation
75
Advanced Database Systems Lecture Note
o Vertical Fragmentation
Systems can share the responsibility of storing
particular attributes of a table.
Needs attribute with tuple number (the primary key
value be repeated.)
Performed by the Projection Operation
The whole content of the relation is reconstructed
using the Natural JOIN operation using the
attribute with Tuple number (primary key values)
76
Advanced Database Systems Lecture Note
77
Advanced Database Systems Lecture Note
Data transparency:
The degree to which system user may remain unaware of the
details of how and where the data items are stored in a
distributed system.
78
Advanced Database Systems Lecture Note
79
Advanced Database Systems Lecture Note
1. Why DDBMS/Advantages
2. Many existing systems
Maybe you have no choice.
Possibly there are many different existing system, with
possible different kinds of systems (Oracle, Informix, others)
that need to be used together.
3. Data sharing and distributed control:
User at one site may be able access data that is available at
another site.
Each site can retain some degree of control over local data
We will have local as well as global database administrator
4. Reliability and availability of data
If one site fails the rest can continue operation as long as
transaction does not demand data from the failed system
and the data is not replicated in other sites
5. Speedup of query processing
If a query involves data from several sites, it may be possible
to split the query into sub-queries that can be executed at
several sites which is parallel processing
Query can be sent to the least heavily loaded sites
6. Expansion (Scalability)
In a distributed environment you can easily expand by
adding more machines to the network.
Disadvantages of DDBMS
1. Software Development Cost
Is difficult to install, thus is costly
2. Greater Potential for Bugs
Parallel processing may endanger correctness of algorithms
3. Increased Processing Overhead
Exchange of message between sites – high communication
latency
Due to communication jargons
80
Advanced Database Systems Lecture Note
4. Communication problems
5. Increased Complexity and Data Inconsistency
Problems
Since clients can read and modify closely related data stored
in different database instances concurrently.
Query Processing
There are different strategies to process a specific query, which in
turn increase the performance of the system by minimizing
processing time and cost. In addition to the cost estimates we have
for a centralized database (disk access, relation size, etc), we have to
consider the following in distributed query processing:
For the case of fragmentation, update woks more like the centralized
database but reconstruction of the whole relation will require
accessing data from all sites containing part of the relation.
Let the distributed database has three sites (S1, S2, and S3). And two
relations, EMPLOYEE and DEPARTMENT are located at S1 and S2
respectively without any fragmentation. And a query is initiated
from S3 to retrieve employees [First Name (15 byte long), Last name
81
Advanced Database Systems Lecture Note
(15 byte long) and Department name (10 byte long) total of 40 bytes
with the department they are working in.
Let:
For EMPLOYEE we have the following information
1. 10,000 records
2. each record is 100 bytes long
For DEPARTMENT we have the following information
3. 100 records
4. each record is 35 bytes long
There are three ways of executing this query:
1. Transfer DEPARTMENT and EMPLOYEE to S3 and perform
the join there: needs transfer of 10,000*100+100*35=1,003,500
byte.
2. Transfer the DEPARTMENT to S1, perform the join there which
will have 40*10,000 = 400,000 bytes and transfer the result to S3.
we need 1,000,000+400,000=1,400,000 byte to be transferred
3. Transfer the EMPLOYEE to S2, perform the join there which
will have 40*10,000 = 400,000 bytes and transfer the result to S3.
We need 3,500+400,000=403,500 byte to be transferred.
Then one can select the strategy that will reduce the data transfer cost
for this specific query. Other steps of optimization may also be
included to make the processing more efficient by reducing the size of
the relations using projection.
Transaction Management
Transaction is a logical unit of work constituted by one or more
operations executed by a single user. A transaction begins with the
user's first executable query statement and ends when it is committed
or rolled back.
82
Advanced Database Systems Lecture Note
o For example, the following query accesses data from the dept
table in the Addis schema (the site) of the remote sales database:
83
Advanced Database Systems Lecture Note
Remote query
select
client_nm
from
[email protected];
Distributed query
select
project_name, student_nm
from
[email protected] i, student s
where
s.stu_id = i.stu_id
Remote Update
84
Advanced Database Systems Lecture Note
Distributed Update
1. Non-Replicated Scheme
o No data is replicated in the system
o All sites will maintain a local lock manager (local lock and
unlock)
o If site Si needs a lock on data in site Sj it send message to
lock manager of site Sj and the locking will be handled by
site Sj
o All the locking and unlocking principles are handled by
the local lock manager in which the data object resides.
o Is simple to implement
o Need three message transfers
o To request a lock
o To notify grant of lock
o To request unlock
85
Advanced Database Systems Lecture Note
In general there are three varieties of the 2PL (Two phase locking)
protocol in the DDBMS environment. Implementing the basic 2PL in
distributed systems assumes that data is distributed across multiple
machines.
Centralized 2PL:
A single site responsible for granting and releasing locks
Each site‘s transaction manager communicates with this
centralized lock manager and with its own local data manager
Has only one lock manager for the entire site.
Primary 2PL:
Each replicated data item is assigned a primary copy; the lock
manager on that primary copy is responsible for granting and
releasing locks(distributed locking) and updates are propagated
as soon as possible to the slave copies
Distributes the lock manager to a number of sites
Distributed 2PL:
Assumes data is completely replicated
The schedulers (Lock managers) at each site are responsible in
granting and releasing locks as well as forwarding operations to
the local data manager.
Distributes the lock manager to every site in the DDBS.
Has communication overhead than the others
86
Advanced Database Systems Lecture Note
Chapter Five
Introduction to Object-Oriented
Database Systems
Object Orientation
• Object Orientation
• Set of design and development principles
• Based on autonomous computer structures known as
objects
• Software be constructed from standard reusable
components
• OO Contribution areas
• Programming Languages
• Graphical User Interfaces
• Databases
• Design
• Operating Systems
Evolution of OO Concepts
• Concepts stem from object-oriented programming
languages (OOPLs)
• Ada, ALGOL, LISP, SIMULA
• OOPLs goals
• Easy-to-use development environment
• Powerful modeling tools for development
• Decrease in development time
• Make reusable code
• OO Attributes
• Data set not passive
• Data and procedures bound together
• Objects can act on itself
87
Advanced Database Systems Lecture Note
88
Advanced Database Systems Lecture Note
5. Limited operations
Relational model has fixed operations on the data
Does not allow additional/new operations
7. Impedance mismatch
Mixing of different programming paradigms
Mismatch between the languages used
89
Advanced Database Systems Lecture Note
OO Concepts
• Object is a uniquely identifiable entity that contains both the
attributes that describes the state of the ‗real world‘ object and
the action that are associated with it.
• OODBMS can manage complex, highly interrelated
information.
• Abstract representation of a real-world entity
• Unique identity
• Embedded properties
• Ability to interact with other objects and self
OID (Object Identity)
• Each object is unique in OO systems
• Unique to object
• Not a primary key (PK is unique only for a relation, PK is
selected from attribute making it dependent on the state)
• Is invariant (will not change)
• Independent of values of attributes ( two objects can have
same state but will have different OID)
• Is invisible to users
• Entity integrity is enforced
• Relationship: embedding the OID of one object into the other (
embed OID for a branch to employee object)
90
Advanced Database Systems Lecture Note
Attributes
• Called instance variables
• Domain
Object state
• Object values at any given time
• Values of attributes at any given point in time.
Methods
91
Advanced Database Systems Lecture Note
Messages
• Means by which objects communicate
• Request from one object to the other to activate one of its
methods
• Invokes method/calls method to be applied
• Sent to object from real world or other object
• Notation: Object.Method
• Eg: StaffObject.updatesalary(slary)
92
Advanced Database Systems Lecture Note
Classes
• Blueprint for defining similar objects
• Objects with similar attributes and respond to same
message are grouped together
• Defined only once and used by many objects
• Collection of similar objects
• Shares attributes and structure
• Objects in a class are called instances of the class
• Eg: Class Definition : defining the class BRANCH
BRANCH
Attributes
brabchNo
street
city
postcode
Methods
Print()
getPostCode()
numberofStaff()
93
Advanced Database Systems Lecture Note
Object Characteristics
Class Hierarchy
• Super class
• Subclass
Inheritance
• Ability of object to inherit the data structure and behavior
of classes above it
• Single inheritance – class has one immediate super class
• Multiple – class has more than one immediate super class
94
Advanced Database Systems Lecture Note
Method Overriding
• Method redefined at subclass
Polymorphism
• Allows different objects to respond to same message in
different ways. i.e. Use the same method differently.
Object Classification
• Simple
Only single-valued attributes
No attributes refer to other objects
• Composite
At least one multi-valued attribute
No attributes refer to other object
• Compound
At least one attribute that references other object
• Hybrid
Repeating group of attributes
At least one refers to other object
95
Advanced Database Systems Lecture Note
96
Advanced Database Systems Lecture Note
OODBMS Advantages
• More semantic information
• Support for complex objects
• Extensibility of data types (user defined data types)
• May improve performance with efficient caching
• Versioning
• Polymorphism: one operation shared by many objects and each
acting differently
• Reusability
• Inheritance speeds development and application: defining new
objects in terms of previously defined objects Incremental
Definition)
• Potential to integrate DBMSs into single environment
• Relationship between objects is represented explicitly
supporting both navigational and associative access to
information.
OODBMS Disadvantages
• Strong opposition from the established RDBMSs
• Lack of theoretical foundation
• No standard
• No single data model
• Throwback to old pointer systems
• Lack of standard ad hoc query language
• Lack of business data design and management tools
• Steep learning curve
• Low market presence
• Lack of compatibility between different OODBMSs
97
Advanced Database Systems Lecture Note
And therefore, the code to make objects persistent and to read objects back
from the database depends on the strategy chosen. In all the approaches,
there is some semantics (like ambiguity of which class is supper and which
is sub) that is lost and hence, we have to build that semantics in each
application which is subject to duplication of code and potential
inconsistencies.
98
Advanced Database Systems Lecture Note
Chapter Six
Data warehousing
Data warehouse is an integrated, subject-oriented, time-
variant, non-volatile database that provides support for
decision making.
Integrated centralized, consolidated database that
integrates data derived from the entire organization.
Consolidates data from multiple and diverse
sources with diverse formats.
Helps managers to better understand the
company‘s operations.
Subject-Oriented Data warehouse contains data
organized by topics. Eg. Sales, Marketing, Finance,
Customer etc.
Time variant: In contrast to the operational data that
focus on current transactions, the warehouse data
represent the flow of data through time.
Data warehouse contains data that reflect what
happened last week, last month, past five years, and
so on.
Snapshot of data in the organization at different
point in time.
Non volatile Once data enter the data warehouse,
they are never changed. Because the data in the
warehouse represent the company‘s entire history not
operational data.
99
Advanced Database Systems Lecture Note
100
Advanced Database Systems Lecture Note
101
Advanced Database Systems Lecture Note
102
Advanced Database Systems Lecture Note
Data Mining
In simple definition, data mining is the extraction or ‗mining‘ of
knowledge from large amount of data. ―Mining of gold from
sand or soil is not sand or soil mining but GOLD mining‖.
Similarly, data mining is actually knowledge mining or
knowledge discovery that is useful for decision making.
Most organizations, these days, are in data rich but information
poor situation.
Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions.
The availability of large amount of data coupled with the need to
have powerful data analysis tool motivate the development of
data mining tools.
Data mining tools can answer business questions that
traditionally were too time-consuming to resolve.
Evolution of Data Mining:
Data collectiondata stored and retrievedmore
information need
Similar terminologies:
• Knowledge Mining
• Knowledge extraction
• Knowledge discovery
• Pattern analysis
103
Advanced Database Systems Lecture Note
• Data archaeology
104
Advanced Database Systems Lecture Note
105
Advanced Database Systems Lecture Note
106
Advanced Database Systems Lecture Note
107
Advanced Database Systems Lecture Note
108