Unit 4 Notes
Unit 4 Notes
Active: The initial state; the transaction stays in this state while it is executing
Partially committed: After the final statement has been executed.
Failed: After the discovery that normal execution can no longer proceed.
Aborted: After the transaction has been rolled back and the database restored to its state prior to the start
of the transaction.
Two options after it has been aborted:
i. restart the transaction ii. kill the transaction
Committed: After successful completion. A transaction is said to have terminated if it has either
committed or aborted.
ACID Properties
To preserve the integrity of data the database system must ensure:
Atomicity: Either all operations of the transaction are properly reflected in the database or none are.
Consistency: Execution of a transaction in isolation preserves the consistency of the database.
Isolation: Although multiple transactions may execute concurrently, each transaction must be unaware
of other concurrently executing transactions. Intermediate transaction results must be hidden from other
concurrently executed transactions.
i. That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution
before Ti started, or Tj started execution after Ti finished.
Durability: After a transaction completes successfully, the changes it has made to the database persist,
even if there are system failures.
Durability requirement
Once the user has been notified that the transaction has completed (i.e., the transfer of the $50 has taken
place), the updates to the database by the transaction must persist even if there are software or hardware
failures.
Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
Consistency requirement in above example:
The sum of A and B is unchanged by the execution of the transaction
In general, consistency requirements include
A transaction must see a consistent database.
During transaction execution the database may be temporarily inconsistent.
When the transaction completes successfully the database must be consistent
Isolation requirement:
If between steps 3 and 6, another transaction T2 is allowed to access the partially updated database, it
will see an inconsistent database (the sum A + B will be less than it should be).
T1 T2
1. read(A)
2. A := A – 50
3. write(A) write(A) read(A),
read(B), print(A+B)
4. read(B)
5. B := B + 50
6. write(B)
Isolation can be ensured by running transactions serially that is, one after the other. However, executing
multiple transactions concurrently has significant benefits.
Schedules
Schedule: A sequences of instructions that specify the chronological order (i.e. starting with the earliest
date and finishing with most recent) in which instructions of concurrent transactions are executed
a schedule for a set of transactions must consist of all instructions of those transactions
must preserve the order in which the instructions appear in each individual transaction.
A transaction that successfully completes its execution will have a commit instruction as the last statement
by default, transaction assumed to execute commit instruction as its last step
A transaction that fails to successfully complete its execution will have an abort instruction as the last
statement.
A schedule is the order in which the operations of multiple transactions appear for execution.
Serial schedules are always consistent.
Non-serial schedules are not always consistent.
Serial Schedules-
In serial schedules,
Characteristics-
Serial schedules are always-
Consistent
Recoverable
Cascadeless
Strict
Example-01:
In this schedule,
There are two transactions T1 and T2 executing serially one after the other.
Transaction T1 executes first.
After T1 completes its execution, transaction T2 executes.
So, this schedule is an example of a Serial Schedule.
Non-Serial Schedules-
In non-serial schedules,
Characteristics-
Consistent
Recoverable
Cascadeless
Strict
Example-01:
In this schedule,
Schedule 1.2
Sched
Schedule 2
In Schedules 1.1,1.2 and 2, the sum A + B is preserved
Serializability
Some non-serial schedules may lead to inconsistency of the database.
Serializability is a concept that helps to identify which non-serial schedules are correct and will maintain
the consistency of the database.
Serializable Schedules-
If a given non-serial schedule of ‘n’ transactions is equivalent to some serial schedule of ‘n’ transactions,
then it is called as a serializable schedule.
Characteristics-
Consistent
Recoverable
Casacadeless
Strict
Serial Schedules Vs Serializable Schedules-
Serial schedules lead to less resource utilization Serializable schedules improve both resource
and CPU throughput. utilization and CPU throughput.
Serial Schedules are less efficient as compared to Serializable Schedules are always better than
serializable schedules. serial schedules.
(due to above reason) (due to above reason)
Types of Serializability-
1. Conflict Serializability
2. View Serializability
If a given non-serial schedule can be converted into a serial schedule by swapping its non-conflicting
operations, then it is called as a conflict serializable schedule.
Conflicting Operations-
Two operations are called as conflicting operations if all the following conditions hold true for them-
Example-
In this schedule,
Step-01:
Step-02:
Start creating a precedence graph by drawing one node for each transaction.
Step-03:
Draw an edge for each conflict pair such that if Xi (V) and Yj (V) forms a conflict pair then draw an edge
from Ti to Tj.
This ensures that Ti gets executed before Tj.
Step-04:
Problem-01:
Check whether the given schedule S is conflict serializable or not. If yes, then determine all the possible
serialized schedules-
Solution-
Step-01:
List all the conflicting operations and determine the dependency between the transactions-
Step-02:
All the possible topological orderings of the above precedence graph will be the possible serialized
schedules.
The topological orderings can be found by performing the Topological Sort of the above precedence
graph.
After performing the topological sort, the possible serialized schedules are-
1. T1 → T3 → T4 → T2
2. T1 → T4 → T3 → T2
3. T4 → T1 → T3 → T2
View Serializability-
If a given schedule is found to be view equivalent to some serial schedule, then it is called as a view
serializable schedule.
Consider two schedules S1 and S2 each consisting of two transactions T1 and T2.
Schedules S1 and S2 are called view equivalent if the following three conditions hold true for them-
Condition-01:
For each data item X, if transaction Ti reads X from the database initially in schedule S1, then in schedule
S2 also, Ti must perform the initial read of X from the database.
Thumb Rule
“Initial readers must be same for all the data items”.
Condition-02:
If transaction Ti reads a data item that has been updated by the transaction Tj in schedule S1, then in
schedule S2 also, transaction Ti must read the same data item that has been updated by the transaction
Tj.
Thumb Rule
“Write-read sequence must be same.”.
Condition-03:
For each data item X, if X has been updated at last by transaction T i in schedule S1, then in schedule S2
also, X must be updated at last by transaction Ti.
Thumb Rule
“Final writers must be same for all the data items”.
If the given schedule is conflict serializable, then it is surely view serializable. Stop and report your
answer.
If the given schedule is not conflict serializable, then it may or may not be view serializable. Go and check
using other methods.
Thumb Rules
All conflict serializable schedules are view serializable.
All view serializable schedules may or may not be conflict serializable.
Method-02:
If there does not exist any blind write, then the schedule is surely not view serializable. Stop and report
your answer.
If there exists any blind write, then the schedule may or may not be view serializable. Go and check using
other methods.
Thumb Rule
No blind write means not a view serializable schedule.
Method-03:
Problem-01:
Solution-
Step-01:
List all the conflicting operations and determine the dependency between the transactions-
Now,
Since, the given schedule S is not conflict serializable, so, it may or may not be view serializable.
To check whether S is view serializable or not, let us use another method.
Let us check for blind writes.
Now,
Non-Serializable Schedules-
Characteristics-
Non-serializable schedules-
Irrecoverable Schedules-
If in a schedule,
Example-
Here,
Recoverable Schedules-
If in a schedule,
Here,
The commit operation of the transaction that performs the dirty read is delayed.
This ensures that it still has a chance to recover if the uncommitted transaction fails later.
Example-
Here,
Method-01:
If the given schedule is conflict serializable, then it is surely recoverable. Stop and report your answer.
If the given schedule is not conflict serializable, then it may or may not be recoverable. Go and check
using other methods.
Thumb Rules
All conflict serializable schedules are recoverable.
All recoverable schedules may or may not be conflict serializable.
Method-02:
If there does not exist any dirty read operation, then the schedule is surely recoverable. Stop and report
your answer.
If there exists any dirty read operation, then the schedule may or may not be recoverable.
If there exists a dirty read operation, then follow the following cases-
Case-01:
If the commit operation of the transaction performing the dirty read occurs before the commit or abort
operation of the transaction which updated the value, then the schedule is irrecoverable.
Case-02:
If the commit operation of the transaction performing the dirty read is delayed till the commit or abort
operation of the transaction which updated the value, then the schedule is recoverable.
Thumb Rule
No dirty read means a recoverable schedule.
Recoverable Schedules-
If in a schedule,
1. Cascading Schedule
2. Cascadeless Schedule
3. Strict Schedule
Cascading Schedule-
If in a schedule, failure of one transaction causes several other dependent transactions to rollback or
abort, then such a schedule is called as a Cascading Schedule or Cascading Rollback or Cascading Abort.
It simply leads to the wastage of CPU time.
Example-
Here,
In this schedule,
NOTE-
If the transactions T2, T3 and T4 would have committed before the failure of transaction T1, then the
schedule would have been irrecoverable.
Cascadeless Schedule-
If in a schedule, a transaction is not allowed to read a data item until the last transaction that has written
it is committed or aborted, then such a schedule is called as a Cascadeless Schedule.
In other words,
Example-
NOTE-
Example-
Strict Schedule-
If in a schedule, a transaction is neither allowed to read nor write a data item until the last transaction that
has written it is committed or aborted, then such a schedule is called as a Strict Schedule.
In other words,
Example-
Remember-
Strict schedules are more strict than cascadeless schedules.
All strict schedules are cascadeless schedules.
All cascadeless schedules are not strict schedules.
Equivalence of Schedules-
In DBMS, schedules may have the following three different kinds of equivalence relations among them-
1. Result Equivalence
2. Conflict Equivalence
3. View Equivalence
If any two schedules generate the same result after their execution, then they are called as result
equivalent schedules.
This equivalence relation is considered of least significance.
This is because some schedules might produce same results for some set of values and different results
for some other set of values.
If any two schedules satisfy the following two conditions, then they are called as conflict equivalent
schedules-
Problem-01:
Let X = 2 and Y = 5.
Failure Classification
Transaction failure:
Logical errors: transaction cannot complete due to some internal error condition
System errors: the database system must terminate an active transaction due to an error condition (e.g.,
deadlock)
System crash: a power failure or other hardware or software failure causes the system to crash.
Disk failure: a head crash or similar disk failure destroys all or part of disk storage.
Recoverability from Transaction Failure
To ensure atomicity despite failures, we first output information describing the modifications to stable
storage without modifying the database itself.
We study two approaches:
1. log-based recovery, and
2. shadow -paging
Log-Based Recovery
A log is kept on stable storage.
The log is a sequence of log records, and maintains a record of update activities on the database.
Each and every log record is given a unique id called log sequence number.
A log record is written for each of the following actions:
1. Updating a page
2. Commit
3. Abort
4. End
5. Undoing an update
Every log record contains the following fields:
Prev LSN |Trans Id |Type| Page ID| Length| Offset |Before Image |After image
Other special log records exist to record significant events during transaction processing, such as the start
of a transaction and the commit or abort of a transaction. We denote the various types of log records as:
Whenever a transaction performs a write, it is essential that the log record for that write be created before
the database is modified. Once a log record exists, we can output the modification that has already been
output to the database. Also we have the ability to undo a modification that has already been output to the
database, by using the old-value field in the log records.
For log records to be useful for recovery from system and disk failures, the log must reside on stable storage.
However, since the log contains a complete record of all database activity, the volume of data stored in the
log may become unreasonable large.
The immediate-update technique allows database modifications to be output to the database while the
transaction is still in the active state. These modifications are called uncommitted modifications. In the
event of a crash or transaction failure, the system must use the old-value field of the log records to
restore the modified data items.
Transactions T0 and T1 executed one after the other in the order T0 followed by T1. The portion of
the log containing the relevant information concerning these two transactions appears in the following,
Portion of the system log corresponding to T0 and T1
< T0 start >
< T0, A, 1000, 950 >
< T0, B, 2000, 2050 >
< T0 commit >
< T1 start >
< T1, C, 700, 600 >
< T0 commit >
Checkpoints
When a system failure occurs, we must consult the log to determine those transactions that need to be
redone and those that need to be undone. Rather than reprocessing the entire log, which is time-
consuming and much of it unnecessary, we can use checkpoints:
Output onto stable storage all the log records currently residing in main memory.
Output to the disk all modified buffer blocks.
Output onto stable storage a log record, <checkpoint>.
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>, it puts
the transaction in the redo-list.
If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts the
transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the transactions in
the redo-list and their previous logs are removed and then redone before saving their logs.
Shadow Paging
Shadow paging is an alternative to log-based recovery techniques, which has both advantages and
disadvantages. It may require fewer disk accesses, but it is hard to extend paging to allow multiple
concurrent transactions. The paging is very similar to paging schemes used by the operating system for
memory management.
The idea is to maintain two page tables during the life of a transaction: the current page table and
the shadow page table. When the transaction starts, both tables are identical. The shadow page is never
changed during the life of the transaction. The current page is updated with each writeoperation. Each
table entry points to a page on the disk. When the transaction is committed, the shadow page entry
becomes a copy of the current page table entry and the disk block with the old data is released. If the
shadow is stored in nonvolatile memory and a system crash occurs, then the shadow page table is copied
to the current page table. This guarantees that the shadow page table will point to the database pages
corresponding to the state of the database prior to any transaction that was active at the time of the crash,
making aborts automatic.
1. Commit overhead. The commit of a single transaction using shadow paging requires multiple
blocks to be output -- the current page table, the actual data and the disk address of the current
page table. Log-based schemes need to output only the log records.
2. Data fragmentation. Shadow paging causes database pages to change locations (therefore, no
longer contiguous.
3. Garbage collection. Each time that a transaction commits, the database pages containing the old
version of data changed by the transactions must become inaccessible. Such pages are considered
to be garbage since they are not part of the free space and do not contain any usable information.
Periodically it is necessary to find all of the garbage pages and add them to the list of free pages.
This process is called garbage collection and imposes additional overhead and complexity on the
system.
Deadlock Handling
System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for
another transaction in the set.
Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some
prevention strategies:
1. Require that each transaction locks all its data items before it 74 begins execution (predeclaration).
2. Impose partial ordering of all data items and require that a transaction can lock data items only in the
order specified by the partial order (graph-based protocol
Following schemes use transaction timestamps for the sake of deadlock prevention alone.
wait-die scheme - non-preemptive
1. older transaction may wait for younger one to release data item. Younger transactions never wait for
older ones; they are rolled back instead.
2. a transaction may die several times before acquiring needed data item
wound-wait scheme – preemptive
1. older transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger
transactions may wait for older ones.
2. may be fewer rollbacks than wait-die scheme.
Deadlock prevention
Both in wait-die and in wound-wait schemes, a rolled back transaction is restarted with its original
timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided.
Deadlock Detection
Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E), V is a set of vertices (all
the transactions in the system) E is a set of edges; each element is an ordered pair Ti →Tj.
If Ti → Tj is in E, then there is a directed edge from Ti to Tj, implying that T is wai ng for T to release a data
item.
When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the wait-for
graph. This edge is removed only when Tj is no longer holding a data item needed by Ti.
The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a deadlock-
detection algorithm periodically to look for cycles.
Deadlock Recovery
When deadlock is detected:
Some transaction will have to rolled back (made a victim) to break deadlock. Select that transaction as
victim that will incur minimum cost.
Rollback -- determine how far to roll back transaction Total rollback: Abort the transaction and then restart
it.
More effective to roll back transaction only as far as necessary to break deadlock.
Starvation happens if same transaction is always chosen as victim. Include the number of rollbacks in the
cost factor to avoid starvation
Improved Performance
Design Issues
There are 2 ways in which data can be stored on different sites. These are:
1. Replication
In this approach, the entire relation is stored redundantly at 2 or more sites. If the entire database is
available at all sites, it is a fully redundant database. Hence, in replication, systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also, now query requests can
be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any change made at
one site needs to be recorded at every site that relation is stored or else it may lead to inconsistency. This
is a lot of overhead. Also, concurrency control becomes way more complex as concurrent access now
needs to be checked over a number of sites.
2. Fragmentation
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and each of the
fragments is stored in different sites where they’re required. It must be made sure that the fragments are
such that they can be used to reconstruct the original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a problem.
Fragmentation of relations can be done in two ways:
Horizontal fragmentation – Splitting by rows – The relation is fragmented into groups of tuples so
that each tuple is assigned to at least one fragment.
Vertical fragmentation – Splitting by columns – The schema of the relation is divided into smaller
schemas. Each fragment must contain a common candidate key so as to ensure lossless join.
Directory System
A Derby database-A Derby database contains dictionary objects such as tables, columns, indexes,
and jar files. A Derby database can also store its own configuration information.
Note: An in-memory database does not use the file system, but the size limits listed in the table later in
this topic still apply. For some limits, the maximum value is determined by the available main memory
instead of the available disk space and file system limitations.
log directory
Contains files that make up the database transaction log, used internally for data recovery (not
the same thing as the error log).
seg0 directory
Contains one file for each user table, system table, and index (known as conglomerates).
service.properties file
tmp directory
(might not exist.) A temporary directory used by Derby for large sorts and deferred updates and
deletes. Sorts are used by a variety of SQL statements. For databases on read-only media, you
might need to set a property to change the location of this directory. See
"Creating Derby Databases for Read-Only Use".
jar directory
(might not exist.) A directory in which jar files are stored when you use database class loading.
The following figure shows the files and directories in the Derby database directories that are used by
the Derby software.
Derby imposes relatively few limitations on the number and size of databases and database objects. The
following table shows some size limitations of Derby databases and database objects.