Iare DBMS Lecture Notes PDF
Iare DBMS Lecture Notes PDF
(Autonomous)
Dundigal - Hyderabad- – 500 043
LECTURE NOTES
ON
BRANCH/YEAR : CSE/II
SEMESTER : IV
PREPARED BY
1|Page
UNIT-I
----------------------------------------------------------------------------------------------------------------
Basic definitions
DBMS Application
View of DATA
Data models
Relational model
E-R model
ER diagram
Relational model
Database languages
DBMS Architecture
2|Page
UNIT-1
CONCEPTUAL MODELLING
What Is a DBMS?
A Database Management System (DBMS) is a software package designed to interact with end-
users, other applications, store and manage databases. A general-purpose DBMS allows the
definition, creation, querying, update, and administration of databases.
A very large, integrated collection of data.
Models real-world enterprise. Entities (e.g., students, courses) Relationships (e.g.,
Madonna is taking CS564).
3|Page
DBMS contains information about a particular enterprise
Collection of interrelated data
Set of programs to access the data
An environment that is both convenient and efficient to use
Why Use a DBMS?
A database management system stores, organizes and manages a large amount of information
within a single software application. It manages data efficiently and allows users to perform
multiple tasks with ease.
Shift from computation to information at the “low end”: scramble to webspace (a
mess!) at the “high end”: scientific applications
Datasets increasing in diversity and volume. Digital libraries, interactive video, Human
Genome project, EOS project ... need for DBMS exploding
DBMS encompasses most of CS OS, languages, theory, AI, multimedia, logic.
In the early days, database applications were built directly on top of file systems. A DBMS
provides users with a systematic way to create, retrieve, update and manage data. It is a
middleware between the databases which store all the data and the users or applications which
need to interact with that stored database. A DBMS can limit what data the end user sees, as
well as how that end user can view the data, providing many views of a single database
schema.
4|Page
Difficulty in accessing data.
Need to write a new program to carry out each new task.
Data isolation — multiple files and formats
Integrity problems
Hard to add new constraints or change existing ones
Atomicity of updates
Failures may leave database in an inconsistent state with partial updates carried out
Example: Transfer of funds from one account to another should either complete or not
happen at all
Concurrent access by multiple users
Concurrent accessed needed for performance
Uncontrolled concurrent accesses can lead to inconsistencies
Example: Two people reading a balance and updating it at the same time
Security problems
Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
A file processing system is a collection of programs that store and manage files in computer
hard-disk. On the other hand, a database management system is collection of programs that
enables to create and maintain a database. File processing system has more data redundancy, less
data redundancy in dbms.
Application must stage large datasets between main memory and secondary
storage (e.g., buffering, page-oriented access, 32-bit addressing, etc.)
Special code for different queries
Must protect data from inconsistency due to multiple concurrent users
Crash recovery
Security and access control
5|Page
View of Data
A database system is a collection of interrelated data and a set of programs that allow users to
access and modify these data. The main task of database system is to provide abstract view of
data i.e hides certain details of storage to the users.
Data Abstraction:
Major purpose of dbms is to provide users with abstract view of data i.e. the system hides cert
ain details of how the data are stored and maintained. Since database system users are not
computer trained,developers hide the complexity from users through 3 levels of abstraction , to
simplify user’s interaction with the system.
Levels of Abstraction
Physical level of data abstraction: Describes how a record (e.g., customer) is stored. This
is the lowest level of abstraction which describes how data are actually stored.
Logical level of data abstraction: The next highest level of abstraction which hides what
data are actually stored in the database and what relations hip exists among them. Describes
data stored in database, and the relationships among the data.
type customer = record;
customer_id:string;
customer_name:string;
6|Page
customer_stree:string;
customer_city : string;
end;
View Level of data abstraction: The highest level of abstraction provides security
mechanism to prevent user from accessing certain parts of database. Application programs
hide details of data types. Views can also hide information (such as an employee’s salary)
for security purposes and to simplify the interaction with the system.
Summary
Similar to types and variables in programming languages. Database changes over time when
information is inserted or deleted.
Instance – the actual content of the database at a particular point in time analogous to the value
of a variable is called an instance of the database.
Schema – the logical structure of the database called the database schema. Schema is of three
types: Physical schema, logical schema and view schema.
Example: The database consists of information about a set of customers and accounts
and the relationship between them)Analogous to type information of a variable in a
program
7|Page
Physical schema: Database design at the physical level is called physical schema. How the
data stored in blocks of storage is described at this level.
Logical schema: database design at the logical level Instances and schemas, programmers and
database administrators work at this level, at this level data can be described as certain types of
data records gets stored in data structures, however the internal details such as implementation
of data structure is hidden at this level.
View schema: Design of database at view level is called view schema. This generally describes
end user interaction with database systems.
Physical Data Independence – The ability to modify the physical schema without changing
the logical schema.
In general, the interfaces between the various levels and components should be well defined so
Conceptual schema:
Course_info(cid:string,enrollment:integer)
Data Independence:
8|Page
Physical data independence: Protection from changes in physical structure of data.
1980s:
1990s:
9|Page
–Emergence of Web commerce
2000s:
Data Models:
– Data
– Data relationships
– Data semantics
– Data constraints
– Relational model
– Entity-Relationship data model (mainly for database design)
– Object-based data models (Object-oriented and Object-relational)
– Semi structured data model (XML)
– Other older models:
Network model
Hierarchical model
Every relation has a schema, which describes the columns, or fields.
Relational model: The relational model uses a collection of tables to represent both data and
relationships among those data. Each table has multiple columns with unique name.
– It is example of record based model
10 | P a g e
– These models are structured is fixed-format of several types.
– Each table contains records of particular type
– Each record type defines fixed number of fields, or attributes.
– The columns of the table correspond to attributes of the record type.
The relational data model is the most widely used data model and majority of current database
systems are based on relational model.
Entity-relationship model: The E-R model is based on a perception of real world that
consists of basic objects called entities and relationships among these objects. An entity is a
‘thing’ or ‘object’ in the real world, E-R model is widely used in database design.
–What information about these entities and relationships should we store in the
database?
diagrams).
ER Model:
11 | P a g e
Entity: Real-world object distinguishable from other objects. An entity is described (in
DB) using a set of attributes.
– All entities in an entity set have the same set of attributes. (Until we consider
ISA hierarchies, anyway!)
–An n-ary relationship set R relates n entity sets E1 ... En; each relationship in R
involves entities e1 E1, ..., en En
Same entity set could participate in different relationship sets, or in different “roles” in
same set.
Modeling:
–a collection of entities,
12 | P a g e
Entities and Entity Sets:
An entity set is a set of entities of the same type that share the same properties.
Attributes:
Attribute types:
13 | P a g e
–Single-valued and multi-valued attributes Example:
Composite Attributes
Express the number of entities to which another entity can be associated via a relationship
set.
For a binary relationship set the mapping cardinality must be one of the following
types:
–One to one
–One to many
–Many to one
–Many to many
14 | P a g e
Mapping Cardinalities:
Note: Some elements in A and B may not be mapped to any elements in the other set
Mapping Cardinalities
Note: Some elements in A and B may not be mapped to any elements in the other set
15 | P a g e
Relationships and Relationship Sets
entity sets
– Example:
16 | P a g e
For instance, the depositor relationship set between entity sets customer and account
Relationship sets that involve two entity sets are binary (or degree two). Generally,
most relationship sets in a database system are binary.
Relationships between more than two entity sets are rare. Most relationships are binary.
17 | P a g e
Weak Entities
A weak entity can be identified uniquely only by considering the primary key of another
(owner) entity.
Owner entity set and weak entity set must participate in a one-to-many
Weak entity set must have total participation in this identifying relationship set.
An entity set that does not have a primary key is referred to as a weak entity set.
The existence of a weak entity set depends on the existence of a identifying entity set
it must relate to the identifying entity set via a total, one-to-many relationship
The discriminator (or partial key) of a weak entity set is the set of attributes that
The primary key of a weak entity set is formed by the primary key of the strong entity
set on which the weak entity set is existence dependent, plus the weak entity set’s
discriminator.
depict a weak entity set by double rectangles.
Note: the primary key of the strong entity set is not explicitly stored with the weak
If loan_number were explicitly stored, payment could be made a strong entity, but then
the relationship between payment and loan would be duplicated by an implicit
relationship defined by the attribute loan_number common to payment and loan
More Weak Entity Set Examples
weak entity
attribute.
Then the relationship with course would be implicit in the course_number attribute
Aggregation
However, some works_on relationships may not correspond to any manages relationships.
So we can’t discard the works_on relationship
Design choices:
22 | P a g e
Constraints in the ER Model:
Depends upon the use we want to make of address information, and the semantics of the
data:
If we have several addresses per employee, address must be an entity (since attributes cannot
be set-valued).
If the structure (city, street, etc.) is important, e.g., we want to retrieve employees in a given
city, address must be modeled as an entity (since attribute values are atomic).
An example in the other direction: a ternary relation Contracts relates entity sets Parts,
Departments and Suppliers, and has descriptive attribute qty. No combination of binary
. relationships is an adequate substitute:
S “can-supply” P, D “needs” P,and D “deals-with” S does not imply that D
has agreed to buy P from S.
–How do we record qty?
arity.
– Schema : specifies name of relation, plus name and type of each column. E.G.
Students (sid: string, name: string, login: string, age: integer, gpa: real).
Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).
Example Instance of Students Relation
Relational Query Languages A major strength of the relational model: supports simple,
powerful querying of data.
Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.
– Allows the optimizer to extensively re-order operations, and still ensure that the answer does
not change.
Creating Relations in SQL
Creates the Students relation. Observe that the type of each field is specified, and enforced
by the DBMS whenever tuples are added or modified.
As another example, the Enrolled table holds information about courses that students take.
IC: condition that must be true for any instance of the database; e.g., domain constraints.
27 | P a g e
DBMS should not allow illegal instances.
If the DBMS checks ICs, stored data is more faithful to real-world meaning.
No two distinct tuples can have same values in all key fields, and
– If there’s >1 key for a relation, one of the keys is chosen (by DBA) to be the
primary key.
E.g., sid is a key for Students. (What about name?) The set {sid, gpa} is a superkey.
Possibly many candidate keys (specified using UNIQUE), one of which is chosen as the
primary key.
Foreign Keys, Referential Integrity
Foreign key : Set of fields in one relation that is used to `refer’ to a tuple in another
relation. (Must correspond to primary key of the second relation.) Like a `logical
pointer’.
E.g. sid is a foreign key referring to Students:
28 | P a g e
Foreign Keys in SQL
Only students listed in the Students relation should be allowed to enroll for courses.
Consider Students and Enrolled; sid in Enrolled is a foreign key that references
Students.
(Reject it!)
– (In SQL, also: Set sid in Enrolled tuples that refer to it to a special value null,
– SET NULL / SET DEFAULT (sets foreign key value of referencing tuple)
ICs are based upon the semantics of the real-world enterprise that is being described in the
database relations.
We can check a database instance to see if an IC is violated, but we can NEVER infer
– From example, we know name is not a key, but the assertion that sid is a key is
given to us.
Key and foreign key ICs are the most common; more general ICs supported too.
Data Control Language (DCL): It is used to control privilege in database. To perform any
operations like creating tables, view and modifying we need privileges which are of two types.
30 | P a g e
Object:- Any command or query to work on tables comes under object privilege.
CONNECTING TO ORACLE:
Provide roles:
Provide privileges:
31 | P a g e
DDL: Data Definition Language
All DDL commands are auto-committed. That means it saves all the changes permanently in
the database.
Command Description
CREATE command:
create is a DDL command used to create a table or a database.
Creating a database
To create a database in RDBMS, create command is uses. Following is the Syntax,
Create database database-name;
Example for creating database
Create database Test;
The above command will create a database named Test.
Creating a table
create command is also used to create a table. We can specify names and datatypes of various
columns along. Following is the Syntax,
create table table-name
{
Column-name1 datatype1,
Column-name2 datatype2,
Column-name1 datatype3,
Column-name2 datatype4
};
Create table command will tell the database system to create a new table with given table
name and column information.
32 | P a g e
Example for creating table
Create table Student(id int, name varchar, age int);
The above command will create a new table Student in database system with 3 columns,
namely id, name and age.
ALTER command
alter command is used for alteration of table structures. There are various uses
of alter command, such as,
to add a column to existing table
to rename any existing column
to change datatype of any column or to modify its size.
alter is also used to drop a column.
33 | P a g e
alter command can add a new column to an existing table with default values. Following is the
Syntax,
alter table table_name add (column_name datatype default data);
To rename a column
Using alter command you can rename an existing column. Following is the Syntax,
alter table table-name rename old-column-name to new-column-name;
Here is an Example for this,
alter table Student rename address to Location;
The above command will rename address column to Location.
To drop a column
alter command is also used to drop columns also. Following is the Syntax,
alter table table-name drop(column-name);
Here is an Example for this,
alter table Student drop(address);
The above command will drop address column from the Student table
truncate command is different from delete command. delete command will delete all the rows
from a table whereas truncate command re-initializes a table(like a newly created table).
eg. If you have a table with 10 rows and an auto_increment primary key, if you use delete
command to delete all the rows, it will delete all the rows, but will not initialize the primary
key, hence if you will insert any row after using delete command, the auto_increment primary
key will start from 11. But in case of truncatecommand, primary key is re-initialized.
drop command
drop query completely removes a table from database. This command will also destroy the
table structure. Following is its Syntax,
drop table table-name;
Here is an Example explaining it.
drop table Student;
The above query will delete the Student table completely. It can also be used on Databases.
For Example, to drop a database,
drop database Test;
The above query will drop a database named Test from the system.
rename query
rename command is used to rename a table. Following is its Syntax,
rename table old-table-name to new-table-name
Here is an Example explaining it.
rename table Student to Student-record;
The above query will rename Student table to Student-record.
DML COMMANDS:
INSERT command
Insert command is used to insert data into a table. Following is its general syntax,
35 | P a g e
insert into table_name values(data1,data2,…….);
Lets see an example,
Consider a table Student with following fields.
101 Adam 15
101 Adam 15
102 Alex
101 Adam 15
102 Alex
103 chris 14
Suppose the age column of student table has default value of 14.
36 | P a g e
Also, if you run the below query, it will insert default value into the age column, whatever the
default value may be.
INSERT into Student values(103,'Chris');
UPDATE command
Update command is used to update a row of a table. Following is its general syntax,
UPDATE table-name set column-name = value where condition;
Lets see an example,
update Student set age=18 where s_id=102;
101 Adam 15
102 Alex 18
103 chris 14
101 Adam 15
102 Alex 18
103 Abhi 17
3) Delete command
Delete command is used to delete data from a table. Delete command can also be used with
condition to delete a particular row. Following is its general syntax, DELETE from table-
name;
Example to Delete all Records from a Table
37 | P a g e
DELETE from Student;
The above command will delete all the records from Student table.
101 Adam 15
102 Alex 18
103 Abhi 17
101 Adam 15
102 Alex 18
TCL command
Transaction Control Language(TCL) commands are used to manage transactions in
database.These are used to manage the changes made by DML statements. It also allows
statements to be grouped together into logical transactions.
Commit command
Commit command is used to permanently save any transaaction into database.
Following is Commit command's syntax,
commit;
Rollback command
This command restores the database to last commited state. It is also use with savepoint
command to jump to a savepoint in a transaction. Following is Rollback command's
syntax,
38 | P a g e
rollback to savepoint-name;
Savepoint command
savepoint command is used to temporarily save a transaction so that you can rollback to
that point whenever necessary.
Following is savepoint command's syntax,
savepoint savepoint-name;
ID NAME
1 abhi
2 adam
4 alex
Lets use some SQL queries on the above table and see the results.
INSERT into class values(5,'Rahul');
commit;
UPDATE class set name='abhijit' where
id='5'; savepoint A;
INSERT into class values(6,'Chris');
savepoint B;
INSERT into class values(7,'Bravo');
savepoint C;
SELECT * from class;
The resultant table will look like,
Now rollback to savepoint B
rollback to B;
SELECT * from class;
The resultant table will look like
Now rollback to savepoint A
rollback to A;
SELECT * from class;
The result table will look like
DCL command
41 | P a g e
–Language extensions to allow embedded SQL
be sent to a database.
Customer:
Example: Find the balances of all accounts held by the customer with customer-Id 192-83-
7465.
SQL>select account.balance
from depositor,account
where depositor.customer_id=‘192-83-7465’and
depositor.account_number = account.account_number;
Database Architecture:
The architecture of a database systems is greatly influenced by the underlying computer system
on which the database is running:
Centralized
Client-server
Distributed
42 | P a g e
Overall System Structure
43 | P a g e
Transaction Management:
A database system is partitioned into modules that deal with each of the responsibilities of the
overall system. The functional components of the database system are
– Storage management
– Query processing
– Transaction processing
Storage Management
Storage manager is a program module that provides the interface between the low-level
data stored in the database and the application programs and queries submitted to the
system.
44 | P a g e
Issues:
– Storage access
– File organization
– Indexing and hashing
Query Processing
– Equivalent expressions
Cost difference between a good and a bad way of evaluating a query can be enormous
– Depends critically on statistical information about relations which the database must
maintain
45 | P a g e
– Need to estimate statistics for intermediate results to compute cost of complex
expressions
Database Users
Users are differentiated by the way they expect to interact with the system
Specialized users – write specialized database applications that do not fit into the
Naïve users – invoke one of the permanent application programs that have been written
previously
– Examples, people accessing database over the web, bank tellers, clerical staff
Database Administrator
–Backing up data
–Database tuning.
46 | P a g e
UNIT-II
Relational Approach
-----------------------------------------------------------------------------------------------------
Relational Algebra
Operations
Query examples
Relational Calculus
47 | P a g e
Basic operations:
Additional operations:
48 | P a g e
Deletes attributes that are not in projection list.
Schema of result contains exactly the fields in the projection list, with the same names that
they had in the (only) input relation.
– Note: real systems typically don’t do duplicate elimination unless the user
explicitly asks for it. (Why not?)
49 | P a g e
Selects rows that satisfy selection condition.
Result relation can be the input for another relational algebra operation! (Operator
composition.)
Set Operations:
All of these operations take two input relations, which must be union-compatible:
50 | P a g e
Cross-Product
Result schema has one field per field of S1 and R1, with field names `inherited’ if
possible.
51 | P a g e
52 | P a g e
Condition Join:
Equi-Join: A special case of condition join where the condition c contains only
equalities.
Result schema similar to cross-product, but only one copy of fields for which equality is
specified.
53 | P a g e
Find names of sailors who’ve reserved boat #103
Solution 1:
Information about boat color only available in Boats; so need an extra join:
Can identify all red or green boats, then find sailors who’ve reserved one of these boats:
Previous approach won’t work! Must identify sailors who’ve reserved red boats, sailors
who’ve reserved green boats, then find the intersection (note that sid is a key for Sailors):
54 | P a g e
Relational Calculus:
Comes in two flavors: Tuple relational calculus (TRC) and Domain relational calculus
(DRC).
Calculus has variables, constants, comparison ops, logical connectives and quantifiers.
Expressions in the calculus are called formulas. An answer tuple is essentially an assignment
of constants to variables that make the formula evaluate to true.
TRC Formulas
Composite expressions:
(∀t) (F), (∃t) (F) where F is an expression and t is a tuple variable Free Variables
Obtain the rollNo, name of all girl students in the Maths Dept
55 | P a g e
{s.rollNo,s.name | student(s) ^ s.sex=‘F’ ^ (∃ d)(department(d) ^ d.name=‘Maths’ ^ d.deptId = s.deptNo)}
student (rollNo, name, degree, year, sex, deptNo, advisor) department (deptId, name,
hod, phone)
{d.name|department(d) ^ ¬(∃ s)(student(s) ^ s.sex =‘F’ ^ s.deptNo = d.deptId)
{c.name | course(c) ^ (∃s) (∃e) ( student(s) ^ enrollment(e) ^ s.name = “Mahesh” ^ s.rollNo = e.rollNo ^ c.courseId = e.courseId }
Get the names of students who have scored ‘S’ in all subjects they have enrolled.
Assume that every student is enrolled in at least one course.
{s.name | student(s) ^ (∀e)(( enrollment(e) ^ e.rollNo = s.rollNo) → e.grade =‘S’)}
Get the names of students who have taken at least one course taught by their advisor
{s.name | student(s) ^ (∃e)(∃t)(enrollment(e) ^ teaching(t) ^ e.courseId = t.courseId ^ e.rollNo = s.rollNo ^ t.empId = s.advisor}
DRC Formulas
Atomic formula:
– , or X op Y, or X op constant
– op is one of
56 | P a g e
Formula:
– an atomic formula, or
The condition ensures that the domain variables I, N, T and A are bound to fields of the same
Sailors tuple.
• The term to the left of `|’ (which should be read as such that) says that every tuple
that satisfies T>7 is in the answer.
– Find sailors who are older than 18 or have a rating under 9, and are called ‘Joe’.
Note the use of to find a tuple in Reserves that `joins with’ the Sailors tuple under
consideration.
Observe how the parentheses control the scope of each quantifier’s binding.
57 | P a g e
This may look cumbersome, but with a good user interface, it is very intuitive. (MS Access,
QBE)
• Find all sailors I such that for each 3-tuple either it is not a tuple in Boats or there is a
tuple in Reserves showing that sailor I has reserved it.
It is possible to write syntactically correct calculus queries that have an infinite number of
answers! Such queries are called unsafe.
– e.g.,
It is known that every query that can be expressed in relational algebra can be expressed as a
safe query in DRC / TRC; the converse is also true.
Relational Completeness: Query language (e.g., SQL) can express every query that is
expressible in relational algebra/calculus.
58 | P a g e
UNIT-III
----------------------------------------------------------------------------------------------------------------
Functional Dependencies
Normal Forms
Decompositions
59 | P a g e
The Form of a Basic SQL Queries:
History
IBM Sequel language developed as part of System R project at the IBM San Jose
Research Laboratory
– SQL-86
– SQL-89
– SQL-92
– SQL:2003
Commercial systems offer most, if not all, SQL-92 features, plus varying feature sets from
later standards and special proprietary features.
Integrity constraints
60 | P a g e
create table r (A1 D1 , A2 D2, ..., An Dn ,
(integrity-constraint1),
...,
(integrity-constraintk))
Example:
real, double precision. Floating point and double-precision floating point numbers, with
machine-dependent precision.
not null
61 | P a g e
primary key (A1, ..., An )
from account
The drop table command deletes all information about the dropped relation from the
database.
where A is the name of the attribute to be added to relation r and D is the domain of A.
– All tuples in the relation are assigned null as the value for the new attribute.
The alter table command can also be used to drop attributes of a relation:
62 | P a g e
Basic Query Structure
–P is a predicate.
This query is equivalent to the relational algebra expression. The result of an SQL query is
a relation.
The select clause list the attributes desired in the result of a query
Õbranch_name (loan)
NOTE: SQL names are case insensitive (i.e., you may use upper- or lower-case letters.)
To force the elimination of duplicates, insert the keyword distinct after select.
63 | P a g e
Find the names of all branches in the loan relations, and remove duplicates select
* from loan
The select clause can contain arithmetic expressions involving the operation, +, –, *, and /, and
operating on constants or attributes of tuples.
E.g.:
The where clause specifies conditions that the result must satisfy
To find all loan number for loans made at the Perryridge branch with loan amounts greater
than $1200.
Comparison results can be combined using the logical connectives and, or, and not.
64 | P a g e
Select *from borrower, loan
old-name as new-name
E.g. Find the name, loan number and loan amount of all customers; rename the column
name loan_number as loan_id.
Tuple Variables
Tuple variables are defined in the from clause via the use of the as clause.
Find the customer names and their loan numbers and amount for all customers having a
loan at some branch.
We will use these instances of the Sailors and Reserves relations in our examples.
65 | P a g e
If the key for the Reserves relation contained only the attributes sid and bid, how would the
semantics differ?
relation-list A list of relation names (possibly with a range-variable after each name).
DISTINCT is an optional keyword indicating that the answer should not contain duplicates.
Default is that duplicates are not eliminated!
This strategy is probably the least efficient way to compute a query! An optimizer will find
more efficient strategies to compute the same answers.
66 | P a g e
A Note on Range Variables
Really needed only if the same relation appears twice in the FROM clause. The previous
query can also be written as:
67 | P a g e
Would adding DISTINCT to this query make a difference?
What is the effect of replacing S.sid by S.sname in the SELECT clause? Would adding
DISTINCT to this variant of the query make a difference?
Illustrates use of arithmetic expressions and string pattern matching: Find triples (of ages of
sailors and two fields defined by expressions) for sailors whose names begin and end with B
and contain at least three characters.
LIKE is used for string matching. `_’ stands for any one character and `%’ stands for 0
or more arbitrary characters.
String Operations
Find the names of all customers whose street includes the substring “Main”.
select customer_name
from customer
where customer_street like '% Main%'
68 | P a g e
– concatenation (using “||”)
List in alphabetic order the names of all customers having a loan in Perryridge branch
We may specify desc for descending order or asc for ascending order, for each attribute;
ascending order is the default.
Duplicates
In relations with duplicates, SQL can define how many copies of tuples appear in the
result.
Multiset versions of some of the relational algebra operators – given multiset relations r 1
and r 2:
(r1): If there are c1 copies of tuple t1 in r1, and t1 satisfies selections ,, then there are
c1 copies of t1 in (r1).
2. A (r ): For each copy of tuple t1 in r 1, there is a copy of tuple A (t1) in A (r1) where A
(t1) denotes the projection of the single tuple t1.
r1 x r2 : If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2, there are c1
x c2 copies of the tuple t1. t2 in r 1 x r 2
69 | P a g e
Then B(r1) would be {(a), (a)}, while B(r1) x r2 would be
Nested Queries:
A very powerful feature of SQL: a WHERE clause can itself contain an SQL query!
(Actually, so can FROM and HAVING clauses.)
To understand semantics of nested queries, think of a nested loops evaluation: For each
Sailors tuple, check the qualification by computing the subquery.
70 | P a g e
EXISTS is another set comparison operator, like IN.
If UNIQUE is used, and * is replaced by R.bid, finds sailors with at most one reservation
for boat #103. (UNIQUE checks for duplicate tuples; * denotes all attributes. Why do we have
to replace * by R.bid?)
Illustrates why, in general, subquery must be re-computed for each Sailors tuple.
A common use of subqueries is to perform tests for set membership, set comparisons, and
set cardinality.
The set operations union, intersect, and except operate on relations and correspond to the
relational algebra operations
Each of the above operations automatically eliminates duplicates; to retain all duplicates
use the corresponding multiset versions union all, intersect all and except all.
71 | P a g e
– min(m,n) times in r intersect all s
We’ve already seen IN, EXISTS and UNIQUE. Can also use NOT IN, NOT EXISTS and
NOT UNIQUE.
Find sailors whose rating is greater than that of some sailor called Horatio:
72 | P a g e
To find names (not sid’s) of Sailors who’ve reserved both red and green boats, just
replace S.sid by S.sname in SELECT clause. (What about INTERSECT query?)
Division in SQL
Aggregate Operators:
These functions operate on the multiset of values of a column of a relation, and return a
value
73 | P a g e
Aggregate Operators examples
74 | P a g e
Motivation for Grouping
Consider: Find the age of the youngest sailor for each rating level.
– In general, we don’t know how many rating levels exist, and what the rating values for these
levels are!
– Suppose we know that rating values go from 1 to 10; we can write 10 queries
that look like this (!):
The target-list contains (i) attribute names (ii) terms with aggregate operations (e.g., MIN
(S.age)).
– The attribute list (i) must be a subset of grouping-list. Intuitively, each answer
tuple corresponds to a group, and these attributes must have a single value per group. (A group
is a set of tuples that have the same value for all attributes in grouping-list.)
Conceptual Evaluation
The cross-product of relation-list is computed, tuples that fail qualification are discarded,
`unnecessary’ fields are deleted, and the remaining tuples are partitioned into groups by the
value of attributes in grouping-list.
75 | P a g e
The group-qualification is then applied to eliminate some groups. Expressions in
group-qualification must have a single value per group!
Find age of the youngest sailor with age 18, for each rating with at least 2 such sailors
• Find age of the youngest sailor with age 18, for each rating with at least 2 such sailors
and with every sailor under 60.
• Find age of the youngest sailor with age 18, for each rating with at least 2 sailors
between 18 and 60.
For each red boat, find the number of reservations for this boat Grouping over a join of
three relations.
76 | P a g e
What do we get if we remove B.color=‘red’ from the WHERE clause and add a HAVING
clause with this condition?
Find age of the youngest sailor with age > 18, for each rating with at least 2 sailors (of
any age)
Compare this with the query where we considered only ratings with 2 sailors over 18!
Find those ratings for which the average age is the minimum over all ratings
Find the names of all branches where the average account balance is more than $1,200.
77 | P a g e
Null Values:
Field values in a tuple are sometimes unknown (e.g., a rating has not been assigned) or
inapplicable (e.g., no spouse’s name).
– Is rating>8 true or false when rating is equal to null? What about AND, OR
It is possible for tuples to have a null value, denoted by null, for some of their attributes
78 | P a g e
The predicate is null can be used to check for null values.
– Example: Find all loan number which appear in the loan relation with null
values for amount.
select loan_number
from loan
where amount is null
Logical Connectives:AND,OR,NOT
79 | P a g e
Total all loan amounts
All aggregate operations except count(*) ignore tuples with null values on the
aggregated attributes.
80 | P a g e
“Some” Construct
81 | P a g e
“All” Construct
Find the names of all branches that have greater assets than all branches located in
Brooklyn.
“Exists” Construct
Find all customers who have an account at all branches located in Brooklyn.
The unique construct tests whether a subquery has any duplicate tuples in its result.
Find all customers who have at most one account at the Perryridge branch.
82 | P a g e
select T.customer_name
from depositor as T
where unique (
select R.customer_name
from account, depositor as R
where T.customer_name = R.customer_name and
R.account_number = account.account_number and
account.branch_name = 'Perryridge')
Example Query
Find all customers who have at least two accounts at the Perryridge branch.
83 | P a g e
Example Query
Delete the record of all accounts with balances below the average at the bank.
Provide as a gift for all loan customers of the Perryridge branch, a $200 savings
account. Let the loan number serve as the account number for the new savings account
The select from where statement is evaluated fully before any of its results are inserted
84 | P a g e
–Motivation: insert into table1 select * from table1
Increase all accounts with balances over $10,000 by 6%, all other accounts receive 5%.
update account
set balance = balance 1.06
where balance > 10000
update account
set balance = balance 1.05
where balance 10000
Same query as before: Increase all accounts with balances over $10,000 by 6%, all
update account
set balance = case
when balance <= 10000 then balance *1.05
else balance * 1.06
end
Joined Relations**
Join operations take two relations and return as a result another relation.
These additional operations are typically used as subquery expressions in the from
clause
85 | P a g e
Join condition – defines which tuples in the two relations match, and what attributes are
present in the result of the join.
Join type – defines how tuples in each relation that do not match any tuple in the other
Relation loan
86 | P a g e
Joined Relations – Examples
Natural join can get into trouble if two relations have an attribute with
Solution:
Derived Relations
Find the average account balance of those branches where the average account balance is
87 | P a g e
select branch_name, avg_balance
from (select branch_name, avg (balance)
from account
group by branch_name ) as branch_avg ( branch_name, avg_balance
) where avg_balance > 1200
Note that we do not need to use the having clause, since we compute the temporary
(view) relation branch_avg in the from clause, and the attributes of branch_avg can be used
directly in the where clause.
Types of IC’s: Domain constraints, primary key constraints, foreign key constraints,
general constraints.
88 | P a g e
General Constraints
89 | P a g e
Triggers and Active Databases:
Trigger: procedure that starts automatically if specified changes occur to the DBMS
Three parts:
90 | P a g e
INSERT
FROM NewSailors N
Logical DB Design:
91 | P a g e
–Keys for each participating entity set (as foreign keys).
• Each dept has at most one manager, according to the key constraint on Manages.
Since each department has a unique manager, we could instead combine Manages and
Departments.
92 | P a g e
Views and Security
Views can be used to present necessary information (or a summary), while hiding
–Given YoungStudents, but not Students or Enrolled, we can find students s who
have are enrolled, but not the cid’s of the courses they are enrolled in.
View Definition
A relation that is not of the conceptual model but is made visible to a user as a “virtual
relation” is called a view.
A view is defined using the create view statement which has the form
where <query expression> is any legal SQL expression. The view name is represented
by v.
Once a view is defined, the view name can be used to refer to the virtual relation that the
view generates.
Example Queries
Uses of Views
–Consider a user who needs to know a customer’s name, loan number and branch
–Define a view
(create view cust_loan_data as
select customer_name, borrower.loan_number,
branch_name from borrower, loan
where borrower.loan_number = loan.loan_number )
–Grant the user permission to read cust_loan_data, but not borrower or loan
Predefined queries to make writing of other queries easier
–Processing of Views
– the query expression is stored in the database along with the view name
expression defining v 1
View Expansion
Let view v1 be defined by an expression e1 that may itself contain uses of view relations.
repeat
Find any view relation vi in e1
Replace the view relation vi by the expression defining
vi until no more view relations are present in e1
As long as the view definitions are not recursive, this loop will terminate
With Clause
The with clause provides a way of defining a temporary view whose definition is
Find all branches where the total account deposit is greater than the average of the total
Update of a View
Create a view of all loan data in the loan relation, hiding the amount attribute
Destroys the relation Students. The schema information and the tuples are deleted.
Views
A view is just a relation, but we store a definition, rather than a set of tuples.
Introduction To Schema Refinement:
Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD, or
Storing the same information redundantly, that is, in more than one place within a
Consider a relation obtained by translating a variant of the Hourly Emps entity set
Ex: Hourly Emps(ssn, name, lot, rating, hourly wages, hours worked)
102 | P a g e
The key for Hourly Emps is ssn. In addition, suppose that the hourly wages attribute
is determined by the rating attribute. That is, for a given rating value, there is only
Decompositions:
Functional dependencies (ICs) can be used to identify such situations and to suggest
The essential idea is that many problems arising from redundancy can be addressed by
8 10
5 7
123-22-3666 Attishoo 48 8 40
231-31-5368 Smiley 22 8 30
131-24-3650 Smethurst 35 5 30
434-26-3751 Guldu 35 5 32
612-67-4134 Madayan 35 8 40
103 | P a g e
replacing a relation with a collection of smaller relations.
Each of the smaller relations contains a subset of the attributes of the original relation.
We refer to this process as decomposition of the larger relation into the smaller relations
We can deal with the redundancy in Hourly Emps by decomposing it into two relations:
Unless we are careful, decomposing a relation schema can create more problems than it
solves.
To help with the rst question, several normal forms have been proposed for relations.
If a relation schema is in one of these normal forms, we know that certain kinds of
A functional dependency XY holds over relation R if, for every allowable instance r
of R:
– i.e., given two tuples in r, if the X values agree, then the Y values must also
104 | P a g e
agree. (X and Y are sets of attributes.)
– Given some allowable instance r1 of R, we can check if it violates some FD f, but we cannot
tell if f holds over R!
Notation: We will denote this relation schema by listing the attributes: SNLRWH
105 | P a g e
name. (e.g., Hourly_Emps for SNLRWH)
Suppose that we have entity sets Parts, Suppliers, and Departments, as well as a
relationship set Contracts that involves all of them. We refer to the schema for
Contracts as CQPSD. A contract with contract id
C species that a supplier S will supply some quantity Q of a part P to a department D.
We might have a policy that a department purchases at most one part from any given
supplier.
Thus, if there are several contracts between the same supplier and department,
we know that the same part must be involved in all of them. This constraint is an FD,
DS ! P.
– Reflexivity: If X → Y, then Y → X
106 | P a g e
These are sound and complete inference rules for FDs!
– JP → C
– D → P
• SD → P implies SDJ → JP
Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in #
attrs!)
An efficient check:
– Check if Y is in
107 | P a g e
Closure of a Set of FDs
The set of all FDs implied by a given set F of FDs is called the closure of F and is
denoted as F+.
An important question is how we can infer, or compute, the closure of a given set F of
FDs.
The following three rules, called Armstrong's Axioms, can be applied repeatedly to
Armstrong's Axioms are sound in that they generate only FDs in F+ when applied to a set
F of FDs.
They are complete in that repeated application of these rules will generate all FDs in the
closure F+.
These additional rules are not essential; their soundness can be proved using
Armstrong's Axioms.
Attribute Closure
108 | P a g e
set F of FDs,
we can do so eciently without computing F+. We rst compute the attribute closure X+
with respect to F,
which is the set of attributes A such that X → A can be inferred using the Armstrong
Axioms.
closure = X;
Normal Forms:
The normal forms based on FDs are rst normal form (1NF), second normal form (2NF),
These forms have increasingly restrictive requirements: Every relation in BCNF is also in
3NF,
every relation in 3NF is also in 2NF, and every relation in 2NF is in 1NF.
A relation
is in first normal form if every field contains only atomic values, that is, not lists or
sets.
Although some of the newer database systems are relaxing this requirement
109 | P a g e
Normal Forms
Returning to the issue of schema refinement, the first question to ask is whether any
refinement is needed!
If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds
• Given A,B: Several tuples could have the same A value, and if so,
they’ll all have the same B value!
a relation R in 2NF if and only if it is in 1NF and every nonkey column depends
110 | P a g e
on a key not a subset of a key
all nonprime attributes of R must be fully functionally dependent on a whole key(s) of the
relation, not a part of the key
SSN ENAME
PNO PNAME
a relation R in 3NF if and only if it is in 2NF and every nonkey column does not depend on
another nonkey column
111 | P a g e
STATE TAX (nonkey nonkey)
In other words, R is in BCNF if the only non-trivial FDs that hold over R are key
constraints.
– If we are shown two tuples that agree upon the X value, we cannot infer the A value in one
tuple from the A value in the other.
112 | P a g e
BCNF:
Properties of Decompositions :
– Each new relation scheme contains a subset of the attributes of R (and no attributes that do
not appear in R), and
Intuitively, decomposing R means we will store instances of the relation schemes produced
by the decomposition, instead of instances of R.
Example Decomposition
– Second FD causes violation of 3NF; W values repeatedly associated with R values. Easiest
way to fix this is to create a relation RW to store these associations , and to remove W from the
main schema:
The information to be stored consists of SNLRWH tuples. If we just store the projections of
these tuples onto SNLRH and RW, are there any potential problems that we should be aware
of?
113 | P a g e
Problems with Decompositions
– Given instances of the decomposed relations, we may not be able to reconstruct the
corresponding instance of the original relation!
– Checking some dependencies may require joining the instances of the decomposed relations.
Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every
– (r) (r) = r
– In general, the other direction does not hold! If it does, the decomposition is lossless-join.
It is essential that all decompositions used to deal with redundancy be lossless! (Avoids
Problem (2).)
114 | P a g e
Dependency Preserving Decomposition
– If R is decomposed into X, Y and Z, and we enforce the FDs that hold on X, on Y and on Z,
then all FDs that were given to hold on R must also hold. (Avoids Problem (3).)
if (FX union FY ) + = F +
115 | P a g e
– i.e., if we consider only dependencies in the closure F + that can be checked in X without
considering Y, and in Y without considering X, these imply all dependencies in F +.
XY.
– Repeated application of this idea will give us a collection of relations that are in BCNF;
lossless join decomposition, and guaranteed to terminate.
In general, several dependencies may cause violation of BCNF. The order in which we
– e.g., CSZ, CS Z, Z C
116 | P a g e
preserving (w.r.t. the FDs JP C, SD P and J S).
–However, it is a lossless join decomposition.
– In this case, adding JPC to the collection of relations gives us a dependency preserving
decomposition.
Obviously, the algorithm for lossless join decomp into BCNF can be used to obtain a
– Problem is that XY may violate 3NF! e.g., consider the addition of CJP to
`preserve’ JP C. What if we also have J C?
Refinement: Instead of the given set of FDs F, use a minimal cover for F.
Consider the Hourly Emps relation again. The constraint that attribute ssn is a key can be
expressed as an FD:
{ ssn }-> { ssn, name, lot, rating, hourly wages, hours worked}
For brevity, we will write this FD as S -> SNLRWH, using a single letter to denote each
attribute
In addition, the constraint that the hourly wages attribute is determined by the rating
attribute is an
117 | P a g e
FD: R -> W.
The previous example illustrated how FDs can help to rene the subjective decisions
but one could argue that the best possible ER diagram would have led to the same nal set
of relations.
Our next example shows how FD information can lead to a set of relations that
in particular, it shows that attributes can easily be associated with the `wrong' entity set
during ER design.
The ER diagram shows a relationship set called Works In that is similar to the Works In
relationship set
Using the key constraint, we can translate this ER diagram into two relations:
118 | P a g e
Identifying Entity Sets
Let Reserves contain attributes S, B, and D as before, indicating that sailor S has a
In addition, let there be an attribute C denoting the credit card to which the reservation is
charged.
Suppose that every sailor uses a unique credit card for reservations. This constraint is
expressed by the FD S -> C. This constraint indicates that in relation Reserves, we store the
credit card number
Multivalued Dependencies:
Suppose that we have a relation with attributes course, teacher, and book, which we
denote as CTB.
The meaning of a tuple is that teacher T can teach course C, and book B is a
There are no FDs; the key is CTB. However, the recommended texts for a course are
There is redundancy. The fact that Green can teach Physics101 is recorded once per
recommended text for the course. Similarly, the fact that Optics is a text for Physics101
is recorded once per potential teacher.
Let R be a relation schema and let X and Y be subsets of the attributes of R. Intuitively,
The redundancy in this example is due to the constraint that the texts for a course are
should model this situation using two binary relationship sets, Instructors with attributes
CT and Text with attributes CB.
Because these are two essentially independent relationships, modeling them with a
WX →→ YZ.
X →→ (Z − Y ).
R is said to be in fourth normal form (4NF) if for every MVD X →→Y that holds over
120 | P a g e
R, one of the following statements is true:
Y subset of X or XY = R, or
X is a superkey.
Join Dependencies:
XY,X(R−Y)}
As an example, in the CTB relation, the MVD C ->->T can be expressed as the join
Unlike FDs and MVDs, there is no set of sound and complete inference rules for JDs.
A relation schema R is said to be in fth normal form (5NF) if for every JD ∞{ R1,….
Ri = R for some i, or
The JD is implied by the set of those FDs over R in which the left side is a key for R.
The following result, also due to Date and Fagin, identies conditions|again, detected using
If a relation schema is in 3NF and each of its keys consists of a single attribute,it is also in
5NF.
Inclusion Dependencies:
MVDs and JDs can be used to guide database design, as we have seen, although they
121 | P a g e
are less common than FDs and harder to recognize and reason about.
In contrast, inclusion dependencies are very intuitive and quite common. However, they
The main point to bear in mind is that we should not split groups of attributes that
Most inclusion dependencies in practice are key-based, that is, involve only keys.
122 | P a g e
UNIT-IV
Transaction Management
----------------------------------------------------------------------------------------------------------------
ACID Properties
Need for concurrency control
Transaction and its properties
Schedule and Recoverability
Serializability and schedules
Concurrency control
Types of Locks
Two phase locking
Deadlock
Time stamp based concurrency control
Recovery Techniques
Immediate update
Deferred update
Shadow paging
123 | P a g e
ACID Properties
Consistency:
Execution of a transaction in isolation (that is, with no other transaction executing
concurrently) preserves the consistency of the database. This is typically the responsibility of
the application programmer who codes the transactions.
Atomicity:
Either all operations of the transaction are reflected properly in the database, or none are.
Clearly lack of atomicity will lead to inconsistency in the database.
Isolation:
When multiple transactions execute concurrently, it should be the case that, for every pair of
transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or Tj
started execution after Ti finished. Thus, each transaction is unaware of other transactions
executing concurrently with it. The user view of a transaction system requires the isolation
property, and the property that concurrent schedules take the system from one consistent state
to another. These requirements are satisfied by ensuring that only serializable schedules of
individually consistency preserving transactions are allowed.
Durability:
After a transaction completes successfully, the changes it has made to the database persist, even
if there are system failures.
124 | P a g e
Write(X): Which transfers the data item X from the local buffer of the transaction to
write back to the database.
In real Database system, the write operation temporarily stored in memory and updates later on
disk.
Example:
Bank transactions like credit, debit or transfer of amount from one account to another or
updates on same account.
Let Ti be a transaction that transfers 5000 from account A to account B. Initially in account
A 10000 and account B 20000 balance existed. This can be represented as:
125 | P a g e
Implicit integrity constraints
e.g. sum of balances of all accounts, minus sum of loan amounts must equal value of cash-
in-hand
A transaction must see a consistent database. During transaction execution the database
may be temporarily inconsistent. When the transaction completes successfully the database
must be consistent. Erroneous transaction logic can lead to inconsistency.
Atomicity: All operations in the transaction should be executed without any failure. Before
execution of transaction T i , the A nad B accounts with initial values as 10000 and 20000.
Suppose during the transfer transaction a failure due to power failure, hardware and
software errors will occurs. Suppose, after the write(A) and before write(B), a failure
occurs then the values of A and B are 5000 and 20000. The system destroys 5000 as a
result of this transaction. Therefore sum(A+B) after and before transactions are not
consistent, then it leads to inconsistency.
Durability:
The durability property guarantees that, once the transaction completes successfully, all the
updates on the database must be persistent, even if there is a failure after the transaction
completes.
Ensuring durability is the responsibility of recovery management component. Hence the
user has been notified about successful completion of transaction, it must be the case with
Initially before Transaction:
A=10000 and B=20000
A+B =10000+20000=30000
After Transaction (transfer of 5000 from A to B)
A=10000-5000=5000
Let Failure occurs at this point
Now A+B=5000+20000=25000.
Hence, the sum of database content before and after is not same as 30000
and 25000.
no system failure will result no loss of data corresponding to the transfer of funds.
Isolation:
Isolation can be ensured trivially by running transactions serially that is, one after the other.
126 | P a g e
However, executing multiple transactions concurrently has significant benefits, as we
will see later. For concurrent operations of multiple transactions leads to inconsistent state.
Ensuring isolation is the responsibility of concurrency control component.
Let Ti and T j are two transactions executed concurrently, their operations interleaved in
desirable way resulting an inconsistent state.
Transaction State:
A transaction must be in one of the following states:
Active
Failed Active
Active State:
The initial state of the transaction while it is executing.
Partially Committed:
After the final statement of the transaction has been executed.
Failed:
The transaction no longer proceed with normal execution, then it is in failed state.
Aborted:
After the transaction has been rolled back and the database has been restored to the prior to
the state of the transaction. Two options after it has been aborted:
Restart the transaction can be done only if no internal logical error
Kill the transaction
Committed: After successful completion of the transaction.
127 | P a g e
consist of all instructions of those transaction must preserve the order in which the instructions
appear in each individual transaction.
A transaction that successfully completes its execution will have a commit instructions as
the last statement by default transaction assumed to execute commit instruction as its last
step.
A transaction that fails to successfully complete its execution will have an abort instruction
as the last statement.
Concurrent executions:
Transaction processing system will allow multiple transactions to run concurrently. It leads to
several problems like inconsistency of the data. Ensuring consistency of concurrent operations
requires additional work to make serializable. Even though concurrent transactions has two
major reasons:
Improved throughput and resource utilization.
Reduced waiting time.
128 | P a g e
Schedule 2
A serial schedule in which T2 is followed by T1 :
Schedule 1 and schedule 2 are serial schedules. Each schedule consists various transactions,
where series of instructions belonging to single transaction appear together in one schedule.
Schedule 3 is example of concurrent transaction. In this two transactions T 1 and T2 running
concurrently. In this the OS may execute a part from T 1 and switch to the second transactions
T2 and then switch back to the first transaction for some time and so on with multiple
transactions. i.e. CPU time is shared among all the transactions
Schedule 3
Let T1 and T2 be the transactions defined previously. The following scheme is not a
129 | P a g e
Schedule 4
T1 T2
read(A)
A:=A-50
read(B)
temp=A*0.1
A;=A-temp
write(A)
read(B)
write(A)
read(B)
B:=B+50
write(B)
B:=B+temp
write(B)
In schedule 4, the CPU slicing is in different way to execute the transactions. It leads to the
sum of A and B are different from before and after transactions as 950 and 2100. So this leads
to inconsistent state.
Ii Ij
130 | P a g e
Shedule- 3
In the above schedule, the write(A) of T 1 conflicts with the read(A) of T 2 . Howerver
write(A) of T 2 does not reflect with read(B) of T 1, because the two operations doest not
refer the same data item.
T1 T2
read(A)
write(A)
read(A)
read(B)
write(A)
write(B)
read(B)
write(B)
Schedule 5 – schedule 3 after swapping of pair of instructions
T1 T2
read(A)
write(A)
read(B)
write(B)
read(A)
write(A)
read(B)
write(B)
Schedule 6 – A serial schedule euivallent to schedule 3
Conflicting Instructions
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists
some item Q accessed by both li and lj, and at least one of these instructions wrote Q.
li = read(Q), l j = read(Q). li and lj don’t conflict.
li = read(Q), l j = write(Q). They conflict.
131 | P a g e
li = write(Q), l j = read(Q). They conflict
li = write(Q), l j = write(Q). They conflict
Intuitively, a conflict between li and lj forces a (logical) temporal order between them. If li and
lj are consecutive in a schedule and they do not conflict, their results would remain the same
even if they had been interchanged in the schedule.
Conflict Serializability
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-
conflicting instructions, we say that S and S´ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
schedule
Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows T1, by
2. View Serializability:
Let S and S´ be two schedules with the same set of transactions. S and S´ are view
equivalent if the following three conditions are met, for each data item Q,
If in schedule S, transaction Ti reads the initial value of Q, then in schedule S’ also
transaction Ti must read the initial value of Q.
If in schedule S transaction Ti executes read(Q), and that value was produced by transaction
Tj (if any), then in schedule S’ also transaction Ti must read the value of Q that was produced by the
same write(Q) operation of transaction Tj .
The transaction (if any) that performs the final write(Q) operation in schedule S must also
perform the final write(Q) operation in schedule S’.
As can be seen, view equivalence is also based purely on reads and writes alone.
132 | P a g e
A schedule S is view serializable if it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.
Below is a schedule which is view-serializable but not conflict serializable.
What serial schedule is above equivalent to?
Every view serializable schedule that is not conflict serializable has blind writes.
Other Notions of Serializability
The schedule below produces same outcome as the serial schedule < T1, T5 >, yet is not
conflict equivalent or view equivalent to it.
Determining such equivalence requires analysis of operations other than read and write.
Recoverability:
Recoverable schedule — if a transaction Tj reads a data item previously written by a
transaction Ti , then the commit operation of Ti appears before the commit operation of
Tj.
The following schedule (Schedule 11) is not recoverable if T9 commits
133 | P a g e
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent
database state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks:
Cascading rollback – a single transaction failure leads to a series of transaction
rollbacks. Consider the following schedule where none of the transactions has yet
committed (so the schedule is recoverable)
Implementation of Isolation:
Schedules must be conflict or view serializable, and recoverable, for the sake of
134 | P a g e
database consistency, and preferably cascadeless.
A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.
Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
135 | P a g e
vertices in the graph.
– (Better algorithms take order n + e where e is the number of edges.)
136 | P a g e
Concurrency control protocols generally do not examine the precedence graph as it is
being created
Instead a protocol imposes a discipline that avoids nonseralizable schedules.
Different concurrency control protocols provide different tradeoffs between the
amount of concurrency they allow and the amount of overhead that they incur.
Tests for serializability help us understand why a concurrency control protocol is
correct.
Weak Levels of Consistency
Some applications are willing to live with weak levels of consistency, allowing
schedules that are not serializable
o E.g. a read-only
transaction that wants to get an approximate total balance of all
Accounts.
E.g. database statistics computed for query optimization can be approximate (why?)
Serializable — default
Repeatable read — only committed records to be read, repeated reads of same record must
return same value. However, a transaction may not be serializable – it may find some records
inserted by a transaction but not find others.
Read committed — only committed records can be read, but successive reads of record may
return different (but committed) values.
Read uncommitted — even uncommitted records may be read.
Transaction Definition in SQL Data manipulation language must include a construct for
specifying the set of actions that comprise a transaction.
In SQL, a transaction begins implicitly.
A transaction in SQL ends by:
Commit work commits current transaction and begins a new one.
Rollback work causes current transaction to abort.
In almost all database systems, by default, every SQL statement also commits implicitly
if it executes successfully Implicit commit can be turned off by a database directive
E.g. in JDBC, connection.setAutoCommit(false);
137 | P a g e
Types of Locks
There are various modes to lock data items. They are
Shared(S): If a transaction Ti has shared mode lock on data item Q then Ti can
read but not write Q. lock-S(Q) instruction is used in shared mode.
Exclusive(X): If a transaction has obtained an exclusive mode lock on data item
Q, then Ti can perform both read and write. lock-X(Q) instruction is used to lock
in exclusive mode.
A lock is a mechanism to control concurrent access to a data item. Lock requests are made to
concurrency-control manager. Transaction can proceed only after request is granted.
Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with
locks already held on the item by other transactions Any number of transactions can hold
shared locks on an item, but if any transaction holds an exclusive on the item no other
transaction may hold any lock on the item. If a lock cannot be granted, the requesting
transaction is made to wait till all incompatible locks held by other transactions have been released.
The lock is then granted.
Example of a transaction performing locking:
138 | P a g e
Locking as above is not sufficient to guarantee serializability — if A and B get updated in-
between the read of A and B, the displayed sum would be wrong.
A locking protocol is a set of rules followed by all transactions while requesting and releasing
locks. Locking protocols restrict the set of possible schedules. Consider the partial schedule
Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release
its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A. Such
a situation is called a deadlock. To handle a deadlock one of T3 or T4 must be rolled back and
its locks released. The potential for deadlock exists in most locking protocols. Deadlocks are a
necessary evil.
Starvation is also possible if concurrency control manager is badly designed. For example: A
transaction may be waiting for an X-lock on an item, while a sequence of other transactions
request and are granted an S-lock on the same item. The same transaction is repeatedly rolled
back due to deadlocks. Concurrency control manager can be designed to prevent starvation.
139 | P a g e
Two-Phase Locking Protocol
This protocol ensures conflict-serializable schedules.
Phase 1: Growing Phase
transaction may obtain locks
transaction may not release locks
Phase 2: Shrinking Phase
transaction may release locks
transaction may not obtain locks
The protocol assures serializability. It can be proved that the transactions can be
serialized in the order of their lock points (i.e. the point where a transaction acquired its
final lock).
Two-phase locking does not ensure freedom from deadlocks. Cascading roll-back is possible
under two-phase locking. To avoid this, follow a modified protocol called strict two-phase
locking. Here a transaction must hold all its exclusive locks till it commits/
This protocol assures serializability. But still relies on the programmer to insert the various
locking instructions.
Automatic Acquisition of Locks :
A transaction Ti issues the standard read/write instruction, without explicit locking calls.
The operation read(D) is processed as:
if Ti has a lock on D
then
read(D)
else begin
if necessary wait until no other
transaction has a lock-X on D
grant Ti a lock-S on D;
read(D)
end
write(D) is processed as:
if Ti has a lock-X on D
then
write(D)
else begin
if necessary wait until no other trans. has any lock on
D, if Ti has a lock-S on D
then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D
140 | P a g e
write(D)
end;
All locks are released after commit or abort Implementation of Locking
A lock manager can be implemented as a separate process to which transactions send lock and
unlock requests
The lock manager replies to a lock request by sending a lock grant messages (or a message
asking the transaction to roll back, in case of a deadlock) The requesting transaction waits
until its request is answered
The lock manager maintains a data-structure called a lock table to record granted locks and
pending requests
The lock table is usually implemented as an in-memory hash table indexed on the name of
the data item being locked
Deadlock: A deadlock is a condition wherein two or more tasks are waiting for each other in
order to be finished but none of the task is willing to give up the resources that other task
needs. In this situation no task ever gets finished and is in waiting state forever.
Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has
time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti)
<TS(Tj). The protocol manages concurrent execution such that the time-stamps determine the
serializability order. In order to assure such behavior, the protocol maintains for each data Q
two timestamp values:
141 | P a g e
W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q)
successfully.
R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q)
successfully.
The timestamp ordering protocol ensures that any conflicting read and write operations
are executed in timestamp order. Suppose a transaction T i issues a read(Q)
o If TS(Ti) W-timestamp(Q), then Ti needs to read a value of Q that was already
overwritten. Hence, the read operation is rejected, and Ti is rolled back.
If TS(Ti) W-timestamp(Q), then the read operation is executed, and R-
timestamp(Q) is set to max(R-timestamp(Q), TS(Ti)).Suppose that transaction Ti
issues write(Q).
If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously, and the system assumed that that value would never be produced.
Recovery Techniques
To see where the problem has occurred we generalize the failure into various categories,
as follows:
one of these modifications have been made but before all of them are made.
142 | P a g e
To ensure atomicity despite failures, we first output information describing the modifications to
stable storage without modifying the database itself. Two approaches for recovery are log-
based recovery, and shadow-paging. Assume (initially) that transactions run serially, that is,
one after the other.
Recovery Algorithms
Recovery algorithms are techniques to ensure database consistency and transaction atomicity
and durability despite failures. Recovery algorithms have two parts:
Actions taken during normal transaction processing to ensure enough information exists
to recover from failures
Actions taken after a failure to recover the database contents to a state that ensures
atomicity, consistency and durability.
Log-Based Recovery:
A log is kept on stable storage.
The log is a sequence of log records, and maintains a record of update activities on the
database. Log record has 3 fields:
Transaction Identifier: Unique identifier of the transaction that performed write operation.
Data item identifier: Unique identification of the data item written
Old value: Value of the item prior to the write
New value: Value of the item after write transaction
Various log records are:
<Ti start> log record Before Ti executes write(X),
<Ti, X, V1, V2> is written, where V1 is the value of X before the write, and V2 is the
value to be written to X. Log record notes that Ti has performed a write on data item
Xj Xj had value V1 before the write, and will have value V2 after the write.
<Ti commit> Transaction Ti has committed
<Ti abort> Transaction T i has aborted
143 | P a g e
12. Immediate update
Immediate Database Modification
The immediate database modification scheme allows database updates of an uncommitted
transaction to be made as the writes are issued since undoing may be needed, update logs must
have both old value and new value Update log record must be written before database item is
written. Assume that the log record is output directly to stable storage can be extended to
postpone log record output, so long as prior to execution of an output(B) operation for a data
block B, all log records corresponding to items B must be flushed to stable storage.
Output of updated blocks can take place at any time before or after transaction commit
Order in which blocks are output can be different from the order in which they are
written.
Recovery procedure has two operations instead of one:
undo(Ti) restores the value of all data items updated by Ti to their old values, going backwards
from the last log record for Ti
redo(Ti) sets the value of all data items updated by Ti to the new values, going forward from
the first log record for Ti
Both operations must be idempotent, i.e., even if the operation is executed multiple times the
effect is the same as if it is executed once. Needed since operations may get re-executed during
recovery.
When recovering after failure:
Transaction Ti needs to be undone if the log contains the record
<Ti start>, but does not contain the record <Ti commit>.
Transaction Ti needs to be redone if the log contains both the record <Ti start> and the record
<Ti commit>.
Undo operations are performed first, then redo operations.
Example: Immediate Database Modification
Crashes can occur while the transaction is executing the original updates, or while recovery
action is being taken example transactions T0 and T1 (T0 executes before T1):
T0: T1 :
read (A) read (C)
- A - 50 C:-C- 100
Write (A) write (C)
read (B)
B:- B + 50
write (B)
144 | P a g e
Let accounts A, B and C initially has 1000, 2000 and 700 respectively. The log entry of both
the transactions are:
Below we show the log as it appears at three instances of time. Recovery actions in each case
above are:
undo (T0): B is restored to 2000 and A to 1000.
undo (T1) and redo (T0): C is restored to 700, and then A and B are set to 950 and 2050
respectively.
redo (T0) and redo (T1): A and B are set to 950 and 2050 respectively. Then C is set to 600
145 | P a g e
Finally, the log records are read and used to actually execute the previously deferred
writes. During recovery after a crash, a transaction needs to be redone if and only if
both
<Ti start> and<Ti commit> are there in the log.
Redoing a transaction Ti
< redoTi> sets the value of all data items updated by the transaction to the new
values.
Crashes can occur while the transaction is executing the original updates, or while
recovery action is being taken example transactions T0 and T1 (T0 executes before T1):
T0: T1 :
read (A) read (C)
- A - 50 C:-C- 100
Write (A) write (C)
read (B)
B:- B + 50
write (B)
Let accounts A,B and C initially has 1000, 2000 and 700 respectively. The log entry of both the
transactions are:
<T0 start>
<T0, A, 950>
<T0, B, 2050>
<T0, commit>
<T1 start>
<T1, C, 600>
<T1, commit>
146 | P a g e
To start with, both the page tables are identical. Only current page table is used for data item
accesses during execution of the transaction.
Whenever any page is about to be written for the first time, A copy of this page is made onto an
unused page.
The current page table is then made to point to the copy
The update is performed on the copy
To commit a transaction :
Flush all modified pages in main memory to disk
Output current page table to disk
Make the current page table the new shadow page table, as follows:
keep a pointer to the shadow page table at a fixed (known) location on disk.
to make the current page table the new shadow page table, simply update the
pointer to point to current page table on disk
Once pointer to shadow page table has been written, transaction is committed.
No recovery is needed after a crash — new transactions can start right away,
using the shadow page table.
Pages not pointed to from current/shadow page table should be freed (garbage
collected).
Advantages of shadow-paging over log-based schemes
no overhead of writing log records
recovery is trivial
147 | P a g e
Disadvantages :
Copying the entire page table is very expensive
Can be reduced by using a page table structured like a B +-tree
No need to copy entire tree, only need to copy paths in the tree that lead
to updated leaf nodes
Commit overhead is high even with above extension
Need to flush every updated page, and page table
Data gets fragmented (related pages get separated on disk)
After every transaction completion, the database pages containing old
versions of modified data need to be garbage collected
to run concurrently
Hard to extend algorithm to allow transactions
Easier to extend log based schemes
148 | P a g e
UNIT-V
Data Storage and Query Processing
Record storage and primary file organization
Secondary storage devices
Operations on files
Heap File
Sorted files
Hashing techniques
Index structures for files
Different types of indexes
B tree and B+ tree
Query processing
149 | P a g e
Record storage and primary file organization
Storage Hierarchy
150 | P a g e
Modification: do not allow records to cross block boundaries
Deletion of record i: alternatives:
move records i + 1, . . ., n to i, . . . , n – 1
move record n to i
do not move records, but link all free records on a free list
Variable-Length Records
151 | P a g e
Sequential – store records in sequential order, based on the value of the search key of each
record
Hashing – a hash function computed on some attribute of each record; the result specifies
in which block of the file the record should be placed
Records of each relation may be stored in a separate file. In a multitable clustering file
organization records of several different relations can be stored in the same file
Motivation: store related records on the same block to minimize I/O
Operations on files
152 | P a g e
Indexes are data structures that allow us to find the record ids of records with given values in
index search key fields
Architecture: Buffer manager stages pages from external storage to main memory buffer pool.
File and index layers make calls to the buffer manager.
Alternative File Organizations:
Many alternatives exist, each ideal for some situations, and not so good in others:
Heap (random order) files: Suitable when typical access is a file scan retrieving all records.
Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is
needed.
Indexes: Data structures to organize records via trees or hashing.
Like sorted files, they speed up searches for a subset of records, based on values in certain
(“search key”) fields. Updates are much faster than in sorted files.
Primary and secondary Indexes:
Primary vs. secondary: If search key contains primary key, then called primary index.
Unique index: Search key contains a candidate key.
Clustered and uncluttered:
If order of data records is the same as, or `close to’, order of data entries, then called clustered
index.
Alternative 1 implies clustered; in practice, clustered also implies Alternative
1 (since sorted files are rare).
A file can be clustered on at most one search key.
Cost of retrieving data records through index varies greatly based on whether index is
clustered or not!
Clustered vs. Unclustered Index
Suppose that Alternative (2) is used for data entries, and that the data records are stored
in a Heap file.
To build clustered index, first sort the Heap file (with some free space on
each page for future inserts).
153 | P a g e
Index entries
direct search for
data entries
UNCLUSTERED
Overflow pages may be needed for inserts. to’, but (Thus, order of data recs is `close
not identical to, the sort order.)
Index Data Structures:
An index on a file speeds up selections on the search key fields for the index.
Any subset of the fields of a relation can be the search key for an index on the
relation.
Search key is not the same as key (minimal set of fields that uniquely identify a record in a
relation).
154 | P a g e
Non-leaf
Pages
Leaf
Pages
index entry
P K K 2 P2 Km Pm
0 1 P1
An index contains a collection of data entries, and supports efficient retrieval of all data
entries k* with a given key value k.
Given data entry k*, we can find record with key k in at most one disk I/O.
(Details soon …)
B+ Tree Indexes
Example B+
Tree
155 | P a g e
–And change sometimes bubbles up the tree
Hash-Based Indexing:
Hash-Based Indexes
Good for equality selections.
Index is a collection of buckets.
– Bucket = primary page plus zero or more overflow pages.
– Buckets contain data entries.
Hashing function h: h(r) = bucket in which (data entry for) record r belongs. h looks at
the search key fields of r. No need for “index entries” in this scheme.
Alternatives for Data Entry k* in Index
In a data entry k* we can store:
– Data record with key value k, or
– <k, rid of data record with search key value k>, or
– <k, list of rids of data records with search key k>
Choice of alternative for data entries is orthogonal to the indexing technique used to
locate data entries with a given key value k.
Tree Based Indexing:
–Examples of indexing techniques: B+ trees, hash-based structures
–Typically, index contains auxiliary information that directs searches to the desired
data entries
Alternative 1:
–If this is used, index structure is a file organization for data records (instead of a Heap file
or sorted file).
–At most one index on a given collection of data records can use Alternative 1. (Otherwise, data records
are duplicated, leading to redundant storage and potential inconsistency.)
–If data records are very large, # of pages containing data entries is high.
Implies size of auxiliary information in the index is also large, typically.
Alternatives 2 and 3:
–Data entries typically much smaller than data records. So, better than
Alternative 1 with large data records, especially if search keys are small.
(Portion of index structure used to direct search, which depends on size of data
entries, is much smaller than with Alternative 1.)
156 | P a g e
–Alternative 3 more compact than Alternative 2, but leads to variable sized data entries
even if search keys are of fixed length.
Cost Model for Our Analysis
We ignore CPU costs, for simplicity:
– B: The number of data pages
– R: Number of records per page
– D: (Average) time to read or write disk page
– Measuring number of page I/O’s ignores gains of pre-fetching a sequence of
pages; thus, even I/O cost is only approximated.
– Average-case analysis; based on several simplistic assumptions.
Comparison of File Organizations:
Heap files (random order; insert at eof)
Sorted files, sorted on <age, sal>
Clustered B+ tree file, Alternative (1), search key <age, sal>
Heap file with unclustered B + tree index on search key <age, sal>
Heap file with unclustered hash index on search key <age, sal>
Operations to Compare
Scan: Fetch all records from disk
Equality search
Range selection
Insert a record
Delete a record
Assumptions in Our Analysis
Heap Files:
– Equality selection on key; exactly one match.
Sorted Files:
– Files compacted after deletions.
Indexes:
– Alt (2), (3): data entry size = 10% size of record
– Hash: No overflow buckets.
• 80% page occupancy => File size = 1.25 data size
– Tree: 67% occupancy (this is typical).
• Implies file size = 1.5 data size
Scans:
157 | P a g e
– Leaf levels of a tree-index are chained.
– Index data-entries plus actual file scanned for unclustered indexes.
Range searches:
We use tree indexes to restrict the set of data records fetched, but ignore hash
indexes.
Choice of Indexes
What indexes should we create?
Which relations should have indexes? What field(s) should be the search key?
158 | P a g e
Should we build several indexes?
For each index, what kind of an index should it be?
Clustered? Hash/tree?
One approach: Consider the most important queries in turn. Consider the best plan using
the current indexes, and see if a better plan is possible with an additional index. If so,
create it.
– Obviously, this implies that we must understand how a DBMS evaluates queries
and creates query evaluation plans!
– For now, we discuss simple 1-table queries.
Before creating an index, must also consider the impact on updates in the workload!
– Trade-off: Indexes can make queries go faster, updates slower. Require disk
space, too.
159 | P a g e
Examples of composite key
11,80 11
12,10 12
12,20 nameage sal 12
13,75 bob 12 10 13
<age, sal> cal 11 80 <age>
joe 12 20
10,12 sue 13 75 10
20,12 Data records 20
75,13 75
80,11 80
<sal, age> sorted by name <sal>
Data entries
sorted by <sal>
160 | P a g e
Index-Only Plans
A number of queries can be answered without retrieving any tuples from one or more of the
relations involved if a suitable index is available.
Summary
Many alternative file organizations exist, each appropriate in some situation.
If selection queries are frequent, sorting the file or building an index is important.
Hash-based indexes only good for equality search.
Sorted files and tree-based indexes best for range search; also good for equality search.
(Files rarely kept sorted in practice; B+ tree index is better.)
Index is a collection of data entries plus a way to quickly find entries with given key values.
Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.
– Choice orthogonal to indexing technique used to locate data entries with a given
key value.
Can have several indexes on a given file of data records, each with a different search key.
Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs.
sparse. Differences have important consequences for utility/performance.
As for any index, 3 alternatives for data entries k*:
Data record with key value k
<k, rid of data record with search key value k>
<k, list of rids of data records with search key k>
Choice is orthogonal to the indexing technique used to locate data entries k*.
search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order
161 | P a g e
Hash indices: search keys are distributed uniformly across “buckets” using a “hash
function”. Index Evaluation Metrics
Access types supported efficiently. E.g., records with a specified value in the attribute or
records with an attribute value falling in a specified range of values.
Access time
Insertion time
Deletion time
Space overhead
Ordered indices: In an ordered index, index entries are stored sorted on the search key value.
E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file. Also called clustering index. The search key of a primary index is
usually but not necessarily the primary key.
Secondary index: an index whose search key specifies an order different from the
sequential order of the file. Also called non-clustering index. Index-sequential file:
ordered sequential file with a primary index.
Hash Function:
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block). In a hash file organization we obtain the bucket of a record directly from its search-
key value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses B.
Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket; thus entire
bucket has to be searched sequentially to locate a record. Example:
162 | P a g e
Hash Indices:
Hashing can be used not only for file organization, but also for index-structure creation.
A hash index organizes the search keys, with their associated record pointers, into a hash
file structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on it
using the same search-key is unnecessary.
However, we use the term hash index to refer to both secondary index structures
and hash organized files.
163 | P a g e
Hash Based Indexing:
Bucket: Hash file stores data in bucket format. Bucket is considered a unit of storage.
Bucket typically stores one complete disk block, which in turn can store one or more
records.
Hash Function: A hash function h, is a mapping function that maps all set of search-
keys K to the address where actual records are placed. It is a function from search
Static Hashing:
In static hashing, when a search-key value is provided the hash function always computes the
same address. For example, if mod-4 hash function is used then it shall generate only 5 values.
The output address shall always be same for that function. The numbers of buckets provided
remain same at all times.
164 | P a g e
[Image: Static Hashing]
Operation:
Insertion: When a record is required to be entered using static hash, the hash function h,
computes the bucket address for search key K, where the record will be stored.
Bucket address = h(K)
Search: When a record needs to be retrieved the same hash function can be used to
retrieve the address of bucket where the data is stored.
Delete: This is simply search followed by deletion operation.
Bucket Overflow:
The condition of bucket-overflow is known as collision. This is a fatal state for any static hash
function. In this case overflow chaining can be used.
Overflow Chaining: When buckets are full, a new bucket is allocated for the same
hash result and is linked after the previous one. This mechanism is called Closed
Hashing.
165 | P a g e
[Image: Overflow chaining]
Linear Hashing:
Linear Probing: When hash function generates an address at which data is already
stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.
For a hash function to work efficiently and effectively the following must match:
Extendable Hashing:
Dynamic Hashing
Problem with static hashing is that it does not expand or shrink dynamically as the size of
database grows or shrinks. Dynamic hashing provides a mechanism in which data buckets are
added and removed dynamically and on-demand. Dynamic hashing is also known as
extended hashing.
Hash function, in dynamic hashing, is made to produce large number of values and only a few
are used initially.
166 | P a g e
[Image: Dynamic Hashing]
Organization
The prefix of entire hash value is taken as hash index. Only a portion of hash value is used for
computing bucket addresses. Every hash index has a depth value, which tells it how many bits
are used for computing hash function. These bits are capable to address 2n buckets. When all
these bits are consumed, that is, all buckets are full, then the depth value is increased linearly
and twice the buckets are allocated.
Operation
Querying: Look at the depth value of hash index and use those bits to compute the
bucket address.
Update: Perform a query as above and update data.
Deletion: Perform a query to locate desired data and delete data.
Insertion: compute the address of bucket
If the bucket is already full
Add more buckets
Add additional bit to hash value
Re-compute the hash function
Else
Add data to the bucket
If all buckets are full, perform the remedies of static hashing.
Hashing is not favorable when the data is organized in some ordering and queries require range
of data. When data is discrete and random, hash performs the best.
167 | P a g e
Hashing algorithm and implementation have high complexity than indexing. All
hash operations are done in constant time.
Extendable Vs. Linear Hashing:
Benefits of extendable hashing:
hash performance doesn’t degrade with growth of file
minimal space overhead
Disadvantages of extendable hashing:
extra level of indirection (bucket address table) to find desired record
bucket address table may itself become very big (larger than memory)
o need a tree structure to locate desired record in the structure!
Changing size of bucket address table is an expensive operation
Linear hashing: is an alternative mechanism which avoids these disadvantages at the
possible cost of more bucket overflows
B tree and B+ tree
B+-tree indices are an alternative to indexed-sequential files.
Disadvantage of indexed-sequential files
Performance degrades as file grows, since many overflow blocks get created.
Example of B+Tree:
168 | P a g e
+
B -tree properties:
All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between n/2 and n children.
A leaf node has between (n–1)/2 and n–1 values
Special cases: If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0
and (n–1) values.
+
B -Tree Node Structure
Typical node
169 | P a g e
Pn points to next leaf node in search-key order
+
Example of B -tree File Organization
Good space utilization important since records use more space than pointers.
To improve space utilization, involve more sibling nodes in redistribution during splits
and merges
Involving 2 siblings in redistribution (to avoid split / merge where possible) results in
each node having at least entries
B-Tree Index Files
Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates
redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer
field for each search key in a nonleaf node must be included.
Generalized B-tree leaf node
Non leaf node – pointers Bi are the bucket or file record pointers
B – tree indexing
170 | P a g e
Advantages of B-Tree indices:
May use less tree nodes than a corresponding B +-Tree.
Sometimes possible to find search-key value before reaching leaf node.
Disadvantages of B-Tree indices:
Only small fraction of all search-key values are found early
Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have
greater depth than corresponding B +-Tree
Insertion and deletion more complicated than in B +-Trees
Implementation is harder than B +-Trees.
Typically, advantages of B-Trees do not out weigh disadvantages.
Query processing
Basic steps in Query Processing:
Parsing and translation
Optimization
Evaluation
171 | P a g e
E.g.,
salary75000(salary(instructor)) is equivalent to
salary(salary75000(instructor))
Each relational algebra operation can be evaluated using one of several different algorithms
Correspondingly, a relational-algebra expression can be evaluated in many ways.
Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.
E.g., can use an index on salary to find instructors with salary < 75000,
or can perform complete relation scan and discard instructors with salary 75000
Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest
cost. Cost is estimated using statistical information from the database catalog. e.g. number
of tuples in each relation, size of tuples, etc.
Measures of Query Cost
Cost is generally measured as total elapsed time for answering query
a. Many factors contribute to time cost
disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is also relatively easy to estimate.
172 | P a g e