DBMS Total Notes
DBMS Total Notes
Course Outcomes:
● Gain knowledge of fundamentals of DBMS, database design and normal forms
● Master the basics of SQL for retrieval and management of data.
● Be acquainted with the basics of transaction processing and concurrency control.
● Familiarity with database storage structures and access techniques
UNIT - I
Database System Applications: A Historical Perspective, File Systems versus a DBMS, the Data
Model, Levels of Abstraction in a DBMS, Data Independence, Structure of a DBMS
Introduction to Database Design: Database Design and ER Diagrams, Entities, Attributes, and
Entity Sets, Relationships and Relationship Sets, Additional Features of the ER Model, Conceptual
Design With the ER Model
UNIT - II
Introduction to the Relational Model: Integrity constraint over relations, enforcing integrity
constraints, querying relational data, logical database design, introduction to views, destroying/altering
tables and views.
Relational Algebra, Tuple relational Calculus, Domain relational calculus.
UNIT - III
SQL: QUERIES, CONSTRAINTS, TRIGGERS: form of basic SQL query, UNION, INTERSECT, and
EXCEPT, Nested Queries, aggregation operators, NULL values, complex integrity constraints in SQL,
triggers and active databases.
Schema Refinement: Problems caused by redundancy, decompositions, problems related to
decomposition, reasoning about functional dependencies, First, Second, Third normal forms, BCNF,
lossless join decomposition, multivalued dependencies, Fourth normal form, Fifth normal form.
UNIT - IV
Transaction Concept, Transaction State, Implementation of Atomicity and Durability, Concurrent
Executions, Serializability, Recoverability, Implementation of Isolation, Testing for serializability, Lock
Based Protocols, Timestamp Based Protocols, Validation- Based Protocols, Multiple Granularity,
Recovery and Atomicity, Log–Based Recovery, Recovery with Concurrent Transactions.
UNIT - V
Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary
Indexes, Index data Structures, Hash Based Indexing, Tree based Indexing, Comparison of File
Organizations, Indexes- Intuitions for tree Indexes, Indexed Sequential Access Methods (ISAM), B+
Trees: A Dynamic Index Structure.
TEXT BOOKS:
1. Database System Concepts, Silberschatz, Korth, McGraw hill, V edition.3rd Edition
2. Database Management Systems, Raghurama Krishnan, Johannes Gehrke, Tata Mc Graw Hill
Data:
UNIT - I
Facts that can be
recorded.
It is raw or unprocessed form.
You cannot take decision based on data.
Data has no meaning.
Example(s):
Text, numbers, images, videos etc….
25, Karthik, Karimnagar.
Processing
Information
Example(s):
Example:
Various types of databases can be Traditional databases (text & numbers), Multimedia
databases (audio, video, movies, speech), Geographically Information System (satellite
images), Real time databases (store).
Data ware house: Data ware house is a kind of database in which the data is going to be
very huge and historical.
Example: Data stored about past 100 years of a company .stock market rates.
***Good decisions require good information that is derived from raw facts (Data).
Database Management systems: It is the software system that allows the user to define
and maintain the database and provide control access to the database or
DBMS is a collection of interrelated data and a set of programs to access data. The
primary goal of a DBMS is to provide a way to store and retrieve database information
that is both convenient and efficient.
3
Summary:
Functionalities:
1. Define: Specifying the data type, structure and constraints for the data to be
stored.
2. Construct: Process of storing data on some storage medium.
3. Manipulate: Querying the database to retrieve specific data, updating database
and generating reports.
4. Share: Allows multiple users and programs to access the database concurrently.
4
Database System Application:
A Historical Perspective:
5
1960s:
Charles Bachman designed first general – purpose DBMS, was called integrated
data store.
IDS formed the basis for the network data model.
Late 1960s:
1970:
1980s:
The relational model consolidated its position as the dominant DBMS paradigm,
and database systems continued to gain widespread use,
The SQL query language for relational databases, developed as part of IBM’s
project, is now the standard query language.
Several vendors (e.g IBMS’s DB2, Oracle 8, Informix) extended their systems
with the ability to store new data types such as images, videos and to ask more
complex queries.
Specialized systems have been developed by numerous vendors for creating data
warehouses.
Present era:
DBMS are entered the Internet age. While the first generation of websites stored
their data in operating system files, the use of a database accessed through a web
browser is becoming wide spread.
Queries are generated through web-accessible forms and answers are formatted
using a markup language such as HTML.
Database Management Systems
6
File Systems vs a DBMS
1. Data Redundancy:
There are no methods to validate the insertion of duplicate data in the system. Any user
can enter any data. File system doesn’t validate for the kind of data being entered nor
doesn’t validate for previous existence of the same data in the same file. Duplicate data in
the system is not appreciated as it is a waste of space and always lead to confusion and
misleading of data. When there are duplicate data in the file and if we need to update or
delete the record, we might end up in updating / deleting one of the records leaving the
other record in the file.
2. Data inconsistency:
For Example student and student_report files have student’s address in it, and there was a
change request for one particular student’s address. The program search only student file
for the address and updated it correctly. There is another program which prints the
student’s report and mails it to the address mentioned in the student_report file. There is a
mismatch in the actual address and his report is sent to his old address. This mismatch in
different copies of same data is called data inconsistency.
3. Data Isolation:
Imagine we have to generate a single report of student, who is studying in particular
class, his study report, his library book details and hostel information. All these
information are stored in different files. How do we get all these details in one report?
We have to write a program. But before writing the program , the programmer should
find out which all files have the information needed, what is the format of each file, how
to search data in each file etc. Once all these analysis is done, he writes a program. If
there is 2-3 files involved, programming would be bit simple. Imagine if there is lot many
files involved in it? It would require lot of effort for the programmer. Since all the data
are isolated from each other in different files, programming becomes difficult.
4. Security:
Each file can be password protected. But what if we have to give access to only few
records in the file? For Example, user has to be given access to view only their bank
account details in the file. This is very difficult in the file system.
1. Hierarchical Model.
2. Network Model
3. Entity – Relationship Model
4. Relational Model
Hierarchical Model:
This database model organizes data into a tree-like-structure, with a single root, to which all the
other data is linked. The hierarchy starts from the Root data, and expands like a tree, adding
child nodes to the parent nodes. In this model, a child node will only have a single parent node.
This model efficiently describes many real-world relationships like index of a book, recipes etc.
It has one-to-many relationship between two different types of data, for example, one department
can have many courses, many professors and many students.
This is an extension of the Hierarchical model. In this model data is organized more like a graph,
and are allowed to have more than one parent node.
In this database model data is more related as more relationships are established in this database
model. Also, as the data is more related, hence accessing the data is also easier and fast. This
database model was used to map many-to-many data relationships.
E-R Models are defined to represent the relationships into pictorial form to make it easier for
different stakeholders to understand.
This model is good to design a database, which can then be turned into tables in relational model.
Let's take an example, If we have to design a School Database, then Student will be an entity
with attributes name, age, address etc. As Address is generally complex, it can be another
entity with attributes street name, pincode, city etc, and there will be a relationship between
them.
To simplify user's interaction with the system, the complexity is hidden from the database users
through several levels of abstraction.
View Level:
Example: If we have a login-id and password in a university system, then as a student, we can
view our marks, attendance, fee structure, etc. But the faculty of the university will have a
Example: Let us take an example where we use the relational model for storing the data. We
have to store the data of a student, the columns in the student table will be student_name, age,
mail_id, roll_no etc. We have to define all these at this level while we are creating the database.
Though the data is stored in the database but the structure of the tables like the student table,
teacher table, books table, etc are defined here in the conceptual level or logical level. Also, how
the tables are related to each other is defined here. Overall, we can say that we are creating a
blueprint of the data at the conceptual level.
Physical Level:
It tells the actual location of the data that is being stored by the user. The Database
Administrators (DBA) decides that which data should be kept at which particular disk drive, how
the data has to be fragmented, where it has to be stored etc. They decide if the data has to be
centralized or distributed. Though we see the data in the form of tables at view level the data
Data Independence
Data independence is ability to modify a schema definition in one level without affecting a
schema definition in the next higher level. or Capacity of changing the schema at one level
without effecting the another level.
Physical Data Independence refers to the characteristic of changing the physical level without
affecting the logical level or conceptual level. Using this property we can easily change the
storage device of the database without affecting the logical schema.
Due to Physical independence, any of the below change will not affect the conceptual layer.
It refers to the characteristics of changing the logical level without affecting the external or
view level. This also helps in separating the logical level from the view level. If we do any
changes in the logical level then the user view of the data remains unaffected. The changes in
the logical level are required whenever there is a change in the logical structure of the
database.
Due to Logical independence, any of the below change will not affect the external layer.
Application programmers:
As its name shows, application programmers are the one who writes application programs that
uses the database. These application programs are written in programming languages like PHP,
Java etc. These programs meet the user requirement and made according to user requirements.
Retrieving information, creating new information and changing existing information is done by
these application programs.
Sophisticated Users:
They are database developers, who write SQL queries to select/insert/delete/update data. They do
not use any application or programs to request the database. They directly interact with the
database by means of query language like SQL. These users will be scientists, engineers,
analysts who thoroughly study SQL and DBMS to apply the concepts in their requirement.
Database Administrators:
The life cycle of database starts from designing, implementing to administration of it. A database
for any kind of requirement needs to be designed perfectly so that it should work without any
issues. Once all the design is complete, it needs to be installed. Once this step is complete, users
start using the database. The database grows as the data grows in the database. When the
database becomes huge, its performance comes down. Also accessing the data from the database
becomes challenge. There will be unused memory in database, making the memory inevitably
huge. These administration and maintenance of database is taken care by database Administrator
(DBA).A DBA has many responsibilities. A good performing database is in the hands of DBA.
DBA is responsible for installing a new DBMS server for the new projects. He is also
responsible for upgrading these servers as there are new versions comes in the market or
requirement. If there is any failure in up gradation of the existing servers, he should be
able revert the new changes back to the older version, thus maintaining the DBMS
working.
Performance tuning:
Since database is huge and it will have lots of tables, data, constraints and indices, there
will be variations in the performance from time to time. Also, because of some designing
issues or data growth, the database will not work as expected. It is responsibility of the
DBA to tune the database performance. He is responsible to make sure all the queries and
programs works in fraction of seconds.
Sometimes, users using oracle would like to shift to SQL server or MySQL. It is the
responsibility of DBA to make sure that migration happens without any failure, and there
is no data loss.
Proper backup and recovery programs needs to be developed by DBA and has to be
maintained him. This is one of the main responsibilities of DBA. Data/objects should be
backed up regularly so that if there is any crash, it should be recovered without much
effort and data loss.
Security:
DBA is responsible for creating various database users and roles, and giving them
different levels of access rights.
DBA should be properly documenting all his activities so that if he quits or any new
DBA comes in, he should be able to understand the database without any effort. He
should basically maintain all his installation, backup, recovery, security methods. He
should keep various reports about database performance.
Data Interpreter: It interprets DDL statements and records them in tables containing metadata.
DML Compiler: The DML commands such as insert, update, delete, retrieve from the
application program are sent to the DML compiler for compilation into object code for database
access. The object code is then optimized in the best way to execute a query by the query
optimizer and then send to the data manager.
Query Evaluation Engine: It executes low-level instructions generated by the DML compiler.
Buffer Manager: It is responsible for retrieving data from disk storage into main memory. It
enables the database to handle data sizes that are much larger than the size of main memory.
File Manager: It manages the allocation of space on the disk storage and the data structure used
to represent information stored on disk.
Authorization and Integrity Manager: Checks the authority of users to access data and
satisfaction of the integrity constraints.
Transaction Manager: It ensures that the database remains in a consistent state despite the
system failures and that concurrent transaction execution proceeds without conflicting.
Data dictionary: Data Dictionary, which stores metadata about the database, in particular the
schema of the database such as names of the tables, names of attributes of each table, length of
attributes, and number of rows in each table.
Indices: An index is a small table having two columns in which the 1 st column contains a copy
of the primary or candidate key of a table and the second column a set of pointers holding the
address of the disk block.
These statistics provide the optimizer with information about the state of the tables that will be
accessed by the SQL statement that is being optimized. The types of statistical information
stored in the system catalog include:
Information about tables including the total number of rows, information about
compression, and total number of pages.
Information about columns including number of discrete values for the column and the
distribution range of values stored in the column.
Information about table spaces including the number of active pages.
Current status of the index including whether an index exists or not, the organization of
the index (number of leaf pages and number of levels), the number of discrete values for
the index key, and whether the index is clustered.
Information about the table space and partitions.
1. Entity sets
2. Attributes
3. Relationship sets
Types of Entities:
Entity set: Entity set is a group of similar entities that share the same properties i.e
it represents schema / structure.
Symbol:
Student
Attributes: Attributes are the units that describe the characteristics / properties of
entities.
Types of Attributes:
Example: DOB
Name
Student
Example:
L_name F_name
Name
Student Roll_No
Single attribute: Single attribute can have only one value at an instance of time.
Represented by oval
Example:
DOB
Student
Multi-valued attribute: Multi-valued attribute can have more than one value at an
instance of time.
Represented by double oval.
Student
Stored attribute: Stored attribute is an attribute which are physically stored in the
database.
Example:
DOB
Student
Derived Attribute: Derived attribute are the attributes that do not exist in the
physical database, but their values are derived from other attributes present in the
database.
Represented by dotted oval.
Example:
DOB
Student
age
Example:
Teacher teaches
Student
Since
Case Study:
E1. . D1
E2. .
E3. . D2
. D3 D4
E4.
.
E5.
.
E6. .
N Works 1 Department
employe for
Figure - 1
1. Participation constraints
Total participation is represented using a double line between the entity set and
relationship set.
Example:
Partial participation: It specifies that each entity in the entity set may or may not
participate in the relationship instance in that relationship set. It is also called as
optional participation.
Partial participation is represented using a single line between the entity set and
relationship set.
N
employee Works 1 Department
for
One to one
One to many
Many to one
Many to many
1. One to one:
Example: Every dept should have a hod and only one employee manages a
dept and an employee can manage only one dept.
E1. . D1
E2. .
E3. . D2
. D3 D4
E4.
.
E5.
.
E6. .
N
employee Works 1 Department
for
1 N
Dept Has Employee
Database Management Systems
30
4. Many to many: An entity in A is associated with any number (zero or
more) of entities in B, and an entity in B is associated with any number (zero
or more) of entities in A.
Example: Every emp is supported to work at least one project and he can
work on many projects, a project is supported to have many emp and a
project should have at least one emp.
E1 . P1
.
E2 .
E3 . .
. P2 P3
E4 . P4
E5 . .
E6 . .
.
.
.
.
M N
Employee Works project
on
Student
Strong entity: An entity type is strong if its existence does not depend on some
other entity type. Such entity is called Strong Entity.
Representation
Example:
Weak Entity:
A weak entity is an entity set that does not have sufficient attributes for Unique
Identification of its records. Simply a weak entity is nothing but an entity which
does not have a key attribute.
It contains a partial key called as discriminator which helps in identifying a
group of entities from the entity set.
Discriminator is represented by underlining with a dashed line.
Representation
Example:
Example:
If we have a person entity type who has attributes such as Name, Phone_no,
Address. Now, suppose we are making software for a shop then the person can be
Generalization:
It is the process of extracting shared characteristics from two or more classes and
combining them into a generalized super class, shared characteristics can be
attributes.
Reverse of Specialization
Bottom – up approach.
We can see that the three attributes i.e. Name, Phone, and Address are common
here. When we generalize these two entities, we form a new higher-level entity
type Person. The new entity type formed is a generalized entity. We can see in the
below E-R diagram that after generalization the new generalized entity Person
contains only the common attributes i.e. Name, Phone, and Address. Employee
entity contains only the specialized attribute like Employee_id and Salary.
Similarly, the Customer entity type contains only specialized attributes like
Customer_id, Credit, and Email. So from this example, it is also clear that when
we go from bottom to top, it is a Generalization and when we go from top to
bottom it is Specialization. Hence, we can say that Generalization is the reverse of
Specialization.
relationship Representation:
Example:
Drawbacks:
After aggregation
While identifying the attributes of an entity set, it is sometimes not clear whether a
property should be modeled as an attribute or as an entity set. For example,
consider adding address information to the Employees entity set. One option is to
use an attribute address. This option is appropriate if we need to record only one
address per employee, an alternate is to create an entity set called Addresses and to
record associations between employees and address using a relationship (say,
Has_address). This more complex alternative is necessary in two situations.
Relational Model
Relational Model was proposed by E.F Codd in 1970. At that time, most database
systems were based on one of two older data models (hierarchical mode and network
model).
Relational model were designed to model data in the form of relations or tables. After
designing the conceptual model of database using ER diagram, we need to convert the
conceptual model in the relational model which can be implemented using any RDBMS
languages like MySQL, Oracle etc.
A database is a collection of one or more relations, where each relation is a table with
rows and columns.
Advantages:
Relation: The main construct for representing data in the relational model is a relation .
A Relation consists of a relation schema and relation instance.
Relation schema: Relation schema specifies / represents the name of the relation, name
of each field (or column or attribute), and the domain of each field.
The set of permitted values for each attribute is called domain” or “A domain is referred
to in a relation schema by the attribute name and has a set of associated values”
Students (sid: string, name: string, login: string, age: integer, gpa: real)
Here field name sid has a domain named string. The set of values associated with
domain string is the set of all character strings.
Relation instance:
Example:
An integrity constraint (IC) is a condition on a database schema and restricts the data
that can be stored in an instance of the database.
Integrity constraints are a set of rules. It is used to maintain the quality of information.
Integrity constraints ensure that changes (update deletion, insertion) made to the
database by authorized users do not result in a loss of data consistency. Thus, integrity
constraints guard against accidental damage to the database.
1. Domain constraints
2. General constraints
3. Key constraints
4. Foreign key constraints.
Domain constraints: Domain integrity means the definition of a valid set of values for
an attribute. The values that appear in a column must be drawn from the domain
associated with that column. The domain of a field is essentially the type of that filed.
Example:
Constraints are the rules that are to be followed while entering data into columns of the
database table.
Constraints ensure that data entered by the user into columns must be within the criteria
specified by the condition
For example, if you want to maintain only unique IDs in the employee table or if you
want to enter only age less than 18 in the student table etc.
NOT NULL
DEFAULT
CHECK
1. NOT NULL: Once not null is applied to a particular column, you cannot enter null
values to that column and restricted to maintain only some proper value other than null.
Example:
Example:
CREATE TABLE test (ID int NOT NULL, name char(50),age int, phone
varchar (50));
DESC test;
INSERT INTO test values(100,"Deepak",30,"9966554744");
SELECT * FROM test;
INSERT INTO test VALUES (NULL,"Raju",40,"9966554744");//Error
Example:
3. CHECK:
Suppose in real-time if you want to give access to an application only if the age
entered by the user is greater than 18 this is done at the back-end by using a
check constraint.
Check constraint ensures that the data entered by the user for that column is
within the range of values or possible values specified.
Example:
SUPER KEY: A super key is combination of single or multiple keys which uniquely
identifies rows in a table.
Note: For any table we have at least one super key which is the combination of all the
attributes of the table.
Example:
{ID}: As no two students will have the same Id, it will help to uniquely access the
student details; hence it is a super key.
{NAME,ID}: Even if more than one student has the same name then the
combination of name and id will help to recognize the record, as student id will
break the tie and this combination is a super key.
{PHONE}: As no two students will have the same phone number, a phone number
will help to uniquely access the student details and hence phone number is a super
key.
No Physical representation in database.
A set of minimal attribute(s) that can identify each tuple uniquely in the given relation is
called as a candidate key. Or
Example:
A primary key is a candidate key that the database designer selects while designing the
database. Or
A primary key is a column / field or a set of columns / fields that uniquely identifies each
row in the table. The primary key follows these rules:
A primary key must contain unique values. If the primary key consists of
multiple columns, the combination of values in these columns must be unique.
A primary key column cannot have NULL values. Any attempt to insert or update
NULL to primary key columns will result in an error.
A table can have one only one primary key.
Primary key column often has the AUTO_INCREMENT attribute that automatically
generates a sequential integer whenever you insert a new row into the table.
It is recommended to use INT or BIGINT data type for the primary key column.
Syntax:
CREATE TABLE table_name (column1 data_type PRIMARY KEY,
column2 data_type….);
Example:
Syntax:
Example:
CREATE TABLE student1 (sid int, name varchar(50),address varchar (50)
, PRIMARY KEY(sid));
DESC student1;
INSERT INTO student1 VALUES (1,"Deepak","HZB");
SELECT * FROM student1;
Syntax:
CREATE TABLE table_name (column1 data_type, column2 data_type, CONSTRAINT
constraint_name PRIMARY KEY (column1));
Example:
Syntax:
ALTER TABLE table_name ADD PRIMARY KEY (column_name);
Example:
Syntax:
ALTER TABLE table_name ADD CONSTRAINT constraint_name PRIMARY KEY
(column_name);
Example:
Syntax:
Example:
CREATE TABLE student6(name char(50),fname char(50),PRIMARY
KEY (name,fname));
DESC student6;
INSERT INTO student6 VALUES (“Raju”, ”Rajireddy”);
SELECT * FROM student6;
INSERT INTO student6 VALUES ("Raju","Rajaiah");
SELECT * FROM student6;
INSERT INTO student6 VALUES ("Karthik","Rajireddy");
SELECT * FROM student6;
INSERT INTO student6 VALUES ("Raju","Rajaiah");// Error(Duplicate Entry)
Note:
CREATE TABLE student7(rollno int PRIMARY KEY, aadhar int PRIMARY KEY);
Syntax:
Example:
Candidate keys that are left unimplemented or unused after implementing the primary key
are called as alternate keys. Or
Example:
Candidate keys:
{Emp_id, Emp_mailid}
These are the candidate keys we concluded from the above attributes. Now, we have to
choose one primary key, which is the most appropriate out of the two, and it is Emp_Id. So,
the primary key is Emp_Id. Now, the remaining one candidate key is Emp_emailid.
Therefore, Emp_emailid is the alternate key.
A key comprising of multiple attributes and not just a single attribute is called as a
composite key. It is also known as compound key.
Any key such as super key, primary key, candidate key etc. can be called composite
key if it has more than one attributes.
If a single column alone fails to be served as a primary key then
combination columns would help to uniquely access a record from table such type
of keys or nothing but composite keys.
Example:
cust_Id order_Id product_code product_count
C01 O001 P007 23
C02 O123 P007 19
C02 O123 P230 82
C01 O001 P890 42
None of these columns alone can play a role of key in this table.
Based on this, it is safe to assume that the key should be having more than one attributes:
Key in above table: {cust_id, product_code}.
This is a composite key as it is made up of more than one attributes.
Sometimes we need to maintain only unique data in the column of a database table,
this is possible by using a unique constraint.
Unique constraint ensures that all values in a column are unique.
Unique key accepts NULL.
Unique key is a single field or combination of fields that ensure all values going to
store into the column will be unique. It means a column cannot stores duplicate
values.
It allows us to use more than one column with UNIQUE constraint in a table.
Example:
Syntax:
Case 1:
Case 2:
Syntax:
Case 3:
Example:
CREATE TABLE student2 (sid int UNIQUE KEY, aadhar int UNIQUE KEY,
name varchar(50));
DESC student2;
INSERT INTO student2 VALUES (101,998563256,"Mahesh");
SELECT * FROM student2;
INSERT INTO student2 values (101,986532147,"Karthik");// Error (Duplicate sid)
INSERT INTO student2 values(102,998563256,"Karthik");// Error (Duplicate aadhar)
INSERT INTO student2 values(NULL,998513256,"Karthik");// NULL Accepted for sid
SELECT * FROM student2;
INSERT INTO student2 values(105,NULL,"Karun"); //NULL Accepted for Aadhar
SELECT * FROM student2;
INSERT INTO student2 values (NULL,NULL,"Kamal"); //NULL Accepted for sid and Aadhar
SELECT * FROM student2;
Example:
CREATE TABLE student3 (sid int,aadhar int,UNIQUE(sid,aadhar));
DESC student1;
INSERT INTO student3 VALUES (101,1234);
SELECT * FROM student3;
INSERT INTO student3 VALUES(101,1235);
SELECT * FROM student3;
INSERT INTO student3 VALUES(102,1234);
SELECT * FROM student3;
INSERT INTO student3 VALUES(101,1234); // Duplicate entry
Statement Syntax:
Example:
DROP Unique
Key Syntax:
Example:
In simple words you can say that, a foreign key in one table used to point primary
key in another table.
Foreign key is used to make relation between two or more than two tables.
Example:
Primary key
Foreign key
One table should contain primary key and other table contains foreign key.
A common column in both tables.
The common column data type must be same in both tables.
Syntax: Following is the basic syntax used for defining a foreign key using
CREATE TABLE OR ALTER TABLE statement
Example:
Example:
Show create table activity;
ALTER TABLE activity DROP FOREIGN KEY
fk_ac; Show create table activity;
ICs are specified when a relation is created and enforced when a relation is
modified. The impact of domain, PRIMARY KEY, and UNIQUE constraints is
straightforward: If an insert, delete, or update command causes a violation, it is
rejected. Every potential IC violation is generally checked at the end of each SQL
statement execution.
Example:
CREATE TABLE students (sid int PRIMARY KEY,name varchar(50),login
varchar(50),age int);
The following insertion violates the primary key constraint because there is
already a tuple with the s'id 53899, and it will be rejected by the DBMS:
The following insertion violates the constraint that the primary key cannot contain
null:
Deletion does not cause a violation of domain, primary key or unique key
constraints. However, an update can cause violations, similar to an insertion.
This update violates the primary key constraint because there is already a tuple
with sid 5000.
The impact of foreign key constraints is more complex because SQL sometimes
tries to rectify a foreign key constraint violation instead of simply rejecting the
change.
ON DELETE CASCADE:
DROP TABLE employee;
DROP TABLE department;
CREATE TABLE department (dno int PRIMARY KEY,dname
varchar(50));
DESC department;
A relational database query (query, for short) is a question about the data, and the
answer consists of a new relation containing the result. A query language is a
specialized language for writing queries.
SQL is the most popular commercial query language for a relational DBMS. We
now present some SQL examples that illustrate how easily relations can be
queried.
The symbol ‘*’ means that we retain all fields of the relation.
4. Write a query to display details of employee who belongs to cse
department? Ans: SELECT * FROM employee where dept='cse';
Where clause is used to filter records, to extract only those records that
fulfill a specified condition.
6. Find the names of the employee who has salary greater than
4000. Ans: SELECT * FROM employee where salary > 4000;
First step of any relational database design is to make ER Diagram for it and then
convert it into relational Model.
It is very convenient to design the database using the ER Model by creating an ER
diagram and later on converting it into relational model.
Relational Model represents how data is stored in database in the form of table.
Not all the ER Model constraints and components can be directly transformed into
relational model, but an approximate schema can be derived.
Consider we have entity STUDENT in ER diagram with attributes id, name, address
and age.
A table with name Student will be created in relational model, which will have 4
columns, id, name, age, address and id will be the primary key for this table.
CREATE TABLE student (id int PRIMARY KEY, name varchar (50),age int, address
char(50));
In relational model don’t represent the composite attributes as it is, just take the
compositions.
The attributes of a relation includes the simple attributes of an entity set.
TA
ename
eno
salary DA
employee HRA
basic
CREATE TABLE employee (eno int PRIMARY KEY, ename varchar(50),TA int,
DA int, HRA int, basic int);
eno ename
Phone_no
Employee
Employee employee_phne
A Relationship set, like an entity set, is mapped to a relation in the relational model.
The attributes of a relation includes.
since
Ename did dname
eno
salary
1:1
When a row in a table is related to only one row in another table and vice versa, we
say that is a one to one relationship. This relationship can be created using Primary
key-Unique foreign key constraints.
Here, two tables will be required. Either combines ‘R’ with ‘A’ or ‘B’
Way-01:
1. AR ( a1 , a2 , b1 )
2. B ( b1 , b2 )
Way-02:
1. A ( a1 , a2 )
2. BR ( a1 , b1 , b2 )
Example:
1 1
Scholarship
Student
sch_year
73
Database Management Systems
74
STUDENT TABLE SCHOLARSHIP TABLE
This is where a row from one table can have multiple matching rows in another table
this relationship is defined as a one to many relationships. This type of relationship
can be created using Primary key-Foreign key relationship.
1. A ( a1 , a2 )
2. BR ( a1 , b1 , b2 )
Example:
Country
customer address
The following
cid SQL statement captures
cname cidthe preceding
Addr_id information.
city country
1 Sravan 1 34 Huzurabad India
2 Sravanthi
CREATE 1 PRIMARY
TABLE customer (cid int 46 KEY,Karimnagar India
cname varchar(50));
2 55 Hyderabad India
CREATE TABLE address (cid int, addr_id int PRIMARY KEY, city varchar
(50) , country varchar(50), FOREIGN KEY(cid) REFERENCES
customer(cid));
A row from one table can have multiple matching rows in another table, and a row in
the other table can also have multiple matching rows in the first table this relationship
is defined as a many to many relationship. This type of relationship can be created
using a third table called “Junction table” or “Bridging table”. This Junction or
Bridging table can be assumed as a place where attributes of the relationships between
two lists of entities are stored.
This kind of Relationship, allows a junction or bridging table as a connection for the
two tables.
1. A ( a1 , a2 )
2. R ( a1 , b1 )
3. B ( b1 , b2 )
Example:
sid cid cname
sname
N M
student takes Course
2 102
Because of the total participation constraint, foreign key acquires NOT NULL
constraint i.e. now foreign key cannot be null.
Example:
1 Can have 1
Scholarship
Student
sch_year
1 1
Can Scholarship
Student have
sch_year
CREATE TABLE student (sid int PRIMARY KEY, sname varchar(50),schid int
UNIQUE KEY NOT NULL, schtype varchar(50),schyear date );
1. A ( a1 , a2 )
2. BR ( a1 , b1 , b2 )
Example:
Employee Dependent
View: Views are Virtual Relations or virtual tables, through which a selective
portion of the data from one or more relations (or tables) can be seen.
command.
Views provide a level of security in the database because the view can restrict
users to only specified columns and specified rows in a table. For example, if you
have a company with hundreds of employees in several Departments, you could
give the secretary of each department a view of only certain attributes and for the
employees that belong only to that secretary’s department
Syntax:
CREATE VIEW view_name AS
SELECT column_list
FROM table_name [where condition];
Example:
CREATE table student (rollno int,sname varchar(50),gender
char(50),gmail varchar(50),DOB date,password varchar(50));
DESC student;
INSERT INTO student VALUES
(601,"karthik","M","[email protected]",'1990-05-18',"123456"),
(602,"raju","M","[email protected]",'1990-04-21',"996655"),
(603,"rajitha","F","[email protected]",'1990-02-09',"111111");
SELECT * FROM student;
Example:
DROP VIEW:
We can drop the existing VIEW by using the DROP VIEW statement.
Syntax:
Example:
Advantages:
Views improve security of the database by showing only intended data to
authorized users. They hide sensitive data.
Views make life easy as you do not have write complex queries time and again.
Query Language:
In simple words, a Language which is used to store and retrieve data from database
is known as query language. For example – SQL
For example – Let’s take a real world example to understand the procedural
language, you are asking your younger brother to make a cup of tea, if you are just
telling him to make a tea and not telling the process then it is a non-procedural
language, however if you are telling the step by step process like switch on the
stove, boil the water, add milk etc. then it is a procedural language.
SELECT (σ)
Unary Operators
PROJECT (π)
RENAME (ρ)
UNION (𝖴)
INTERSECTION (∩ ), Set Theory Operations
DIFFERENCE (-)
JOIN
DIVISION.
1. Select Operator (σ): Select Operator is denoted by sigma (σ) and it is used to
find the tuples (or rows) in a relation (or table) which satisfy the given condition.
Notation − σp(r)
This will fetch the tuples (rows) from table Student, for which age will be greater
than 15.
Output:
ID NAME AGE
1 Raju 17
2 Ravi 19
You can also use, and, or etc operators, to specify two conditions, for example,
This will return tuples (rows) from table Student with information of male
students, of age more than 17. (Consider the Student table has an attribute Gender
too.)
∏name
Name
Raju
Ravi
Notation:
First way:
ρ(C(oldname newname or position newname,
relation_name)) Second way: ρ S(B1,B2,….Bn)(R) or ρS(R)or ρ B1,B2,
….Bn(R)
Where the symbol P (rho) is used to denote the RENAME operator, S is the new
relation name, and B1, B2 … Bn are the new attribute names. The first expression
renames both the relation and its attributes, the second renames the relation only,
and the third renames the attributes only. If the attributes of R are (A 1, A, …. An) in
that order, then each Ai is renamed as Bi.
Example:
ρ(C(1 sid1, 5 sid2),R)
4. UNION (𝖴): Let R and S are two relations; R 𝖴 S returns a relation instance
containing all tuples that occur in either relation instance R or relation instance S
(or both). R and S must be union-compatible, and the schema of the result is
defined to be identical to the schema of R.
Two relation instances are said to be union-compatible if the following
conditions hold:
they have the same number of the fields, and
Corresponding fields, taken in order from left to right, have the same domains.
Note that field names are not used in defining union-compatible.
It eliminates the duplicate tuples.
Notation: R 𝖴 S
Example:
Student Employee
Rollno Name
101 Sagar
102 Santhosh
103 Sagarika
104 Sandeep
5. INTERSECTION (∩):
R ∩ S returns a relation instance containing all tuples that occur in both R and S.
The relations R and S must be union-compatible, and the schema of the result is
defined to be identical to the schema of R.
It is denoted by intersection ∩.
It eliminates the duplicate tuples.
Example:
Student Employee
Rollno Name
103 Sagarika
Example:
Student_list Failed_list
R1 R2 R1 X R2
Attributes x y x + y number of attributes
tuples n1 n2 n1 * n2 no of tuples
Example:
Student Course_details
id Name id course
Number of columns in Student
101 Ravi table are 2 102 Java
Number of column 102in course_details
Rajesh are 2 103 DBMS
Number of tuples 103 Rathan
in students are 3 105 C
Number of tuples in course_details are 3
So Number of columns in Student X Course_details are 2 + 2 = 4
Number of tuples (rows) in Student X Course_details is 3 * 3 = 9
Student X Course_details
id Name id Course
101 Ravi 102 Java
101 Ravi 103 DBMS
101 Ravi 105 C
102 Rajesh 102 Java
102 Rajesh 103 DBMS
102 Rajesh 105 C
103 Rathan 102 Java
103 Rathan 103 DBMS
103 Rathan 105 C
The join operation is one of the most useful operations in relational algebra and the
most commonly used way to combine information from two or more relations.
Join can be defined as a cross-product followed by selections and projections.
Types of Joins:
1. Inner Joins
2. Outer Joins
Theta join allows you to merge two tables based on the condition represented by
theta. Theta joins work for all comparison operators. It is denoted by symbol θ.
Syntax:
A ⋈θ or c B (A and B are two relations)
Example:
Mobile Laptop
Answer:
Equi Join:
Equi join is based on matched data as per the equality condition. The equi join uses
the comparison operator (=).
Mobile Laptop
Example:
Customer Order
OUTER JOIN:
In an outer join, along with tuples that satisfy the matching criteria, we also include
some or all tuples that do not match the criteria.
Example:
Student Marks Student Marks
In a full outer join, all tuples from both relations are included in the result,
irrespective of the matching condition.
R1(x) = R1 (z) / R2(y) Result contains all tuples appear in R1 (z) in combination
with every tuple from R2(y) where z=x U y.
Students Subjects
Submission Temp-1
Sailors Reserves
σbid=103 ( Reserves )
Temp1
sid bid day
22 103 10/8/98
31 103 11/6/98
74 103 9/8/98
πsname Temp2
sname
Dustin
Lubber
Horatio
Note: There are of course several ways to express Q1 in relational algebra. Here
is another:
πsname(σbid=103(Reserves⋈ Sailors))
Sailors Reserves
.
.
sailors
sid
22
29
31 sid
32
Set difference 22
58
64 31
74 64
85
95
sailors
sid
29
32
58
74
85
95
Tuple relational calculus is used for selecting those tuples that satisfy the given
condition.
A tuple relational calculus query has the form
{T | P(t)}
Where t = resulting tuples,
P(t) = known as Predicate and these are the conditions that are used to fetch t
Thus, it generates set of all tuples t, such that Predicate P(t) is true for t.
A free variable is a variable that has no limitations.
For example, the x in this function is a free
variable.
f(x) = 3x - 1
Why is it a free variable? It's a free variable because you don't see any limitations
put on it. It can equal a 1 or a -1, a 10 or even -1,000. Also, this function depends
on this x variable. So, the x here is a free variable.
A bound variable, on the other hand, is a variable with
Here, the x is the bound variable. This expression specifies that the value of your x
goes from 1 to 4 in this summation. Because the x is a bound variable with its
values already chosen, the expression isn't dependent on this variable.
{S | S ∈Sailors
(∃R ∈ Reserves
(S.sid = R.sid
∃B ∈ Boats(B.bid = R.bid
(B.color = “Red”
B.color = “Green” ))))}
Domain Relational Calculus: In this, filtering is done based on the domain of the
attributes and not based on the tuple values.
A DRC query has the form {<X1,X2, ... ,Xn > | P(<XI,X2, ... ,Xn >)}, where each Xi
is either a domain variable or a constant and p( <Xl, X2, ... ,xn >) denotes a DRC
formula whose only free variables are the variables among the Xi, 1 Sis n. The
result of this query is the set of all tuples (Xl, X2, ... , xn ) for which the formula
evaluates to true.
A DRC formula is defined in a manner very similar to the definition of a TRC
formula. The main difference is that the variables are now domain variables. Let op
denote an operator in the set {<, >, =, S,~, i-} and let X and Y be domain variables.
An atomic formula in DRC is one of the following:
(X1,X2, ... ,Xn ) ϵ Rel, where Rei is a relation with n attributes; each Xi, 1 ≤ I
≤n is either a variable or a constant.
X op Y
X op constant, or constant op X.
We will present a number of sample queries using the following table definitions:
Queries:
CREATE TABLE sailors (sid int PRIMARY KEY, sname varchar (50),
rating integer ,age int)
CREATE TABLE boats(bid int PRIMARY KEY,bname varchar(50),color
varchar(50));
CREATE TABLE reserves (sid int,bid int,day date,PRIMARY KEY
(sid,bid,day) ,FOREIGN KEY(sid) REFERENCES sailors(sid), FOREIGN
KEY(bid) REFERENCES boats(bid));
Example:
Output:
Output:
Example:
Output:
Q: Find the names of sailors who have reserved boat number 103.
Output:
Output:
Sailors
UNION:
Union is an operator that allows us to combine two or more results from multiple
SELECT queries into a single result set. It comes with a default feature that
removes the duplicate rows from the result set. MySQL always uses the name of
the column in the first SELECT statement will be the column names of the result
set (output).
The number and order of the columns should be the same in all tables that
you are going to use.
The data type must be compatible with the corresponding positions of each
select query.
The column name selected in the different SELECT queries must be in the
same order.
Syntax:
SELECT expression1, expression2, expression_n
FROM tables
[WHERE conditions]
UNION
SELECT expression1, expression2, expression_n
FROM tables
[WHERE conditions];
Q: Find the sid’s and names of sailors who have reserved a red or a green boat.
A: SELECT S.sid,S.sname
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red'
union
SELECT S2.sid,S2.sname
FROM Sailors S2, Boats B2, Reserves R2
WHERE S2.sid = R2.sid AND R2.bid = B2.bid AND B2.color = 'green';
INTERSECT:
The INTERSECT operator is used to fetch the records that are in common between
two SELECT statements or data sets. If a record exists in one query and not in the
other, it will be omitted from the INTERSECT results.
Syntax:
SELECT expression1, expression2, ... expression_n
FROM tables
[WHERE conditions]
INTERSECT
SELECT expression1, expression2, ... expression_n
FROM tables
[WHERE conditions];
The number and order of the columns should be the same in all tables that
you are going to use.
The data type must be compatible with the corresponding positions of each
select query.
The column name selected in the different SELECT queries must be in the
same order.
Q: Find the sid’s and names of sailors who have reserved a red and a green boat.
A: SELECT S.sid,S.sname
FROM Sailors S, Reserves R, Boats B
Output:
EXCEPT:
The EXCEPT clause in SQL is widely used to filter records from more than one
table. This statement first combines the two SELECT statements and returns
records from the first SELECT query, which isn’t present in the second SELECT
query's result. In other words, it retrieves all rows from the first SELECT query
while deleting redundant rows from the second.
This statement behaves the same as the minus (set difference) operator does in
mathematics.
Syntax:
As per the execution process of sub query, it again classified in two ways.
2. Correlated Queries.
In correlated queries, First Outer query is executed and later Inner query will
execute. i.e Inner query always depends on the result of outer query.
Output:
(Q) Find the names of sailors who have reserved a red boat.
(Q) Find the names of sailors who have not reserved a red boat.
A: SELECT S.sname from sailors S WHERE S.sid NOT IN
(SELECT R.sid FROM reserves R where R.bid IN
(SELECT B.bid FROM boats B WHERE B.color="red"));
Output:
Example:
(Q) Find the names of sailors who have reserved boat number 103
SELECT S.sname FROM Sailors S WHERE
EXISTS (SELECT * FROM Reserves R WHERE R.bid = 103 AND R.sid = S.sid)
Important Points:
GROUP BY clause is used with the SELECT statement.
In the query, GROUP BY clause is placed after the WHERE clause.
In the query, GROUP BY clause is placed before ORDER BY clause if
used any.
Syntax:
SELECT column1, function_name(column2)
FROM table_name
WHERE condition
GROUP BY column1, column2
ORDER BY column1, column2;
Examples:
Q: What is the highest and lowest salary where each dept is paying.
A: SELECT dept,MAX(sal),MIN(sal) FROM employees GROUP BY dept;
Output:
Q: Display the number of male and female faculty from each Department.
A: SELECT dept,gender,COUNT(*) FROM employees GROUP BY dept, gender;
Output:
Having Clause:
The HAVING clause is like WHERE but operates on grouped records returned by
a GROUP BY. HAVING applies to summarized group records, whereas WHERE
applies to individual records. Only the groups that meet the HAVING criteria will
be returned.
OUTER JOIN:
In an outer join, along with tuples that satisfy the matching criteria, we also include
some or all tuples that do not match the criteria.
Output:
In the right outer join, operation allows keeping all tuple in the right relation.
However, if there is no matching tuple is found in the left relation, then the attributes
of the left relation in the join result are filled with null values.
In a full outer join, all tuples from both relations are included in the result,
irrespective of the matching condition.
John is the marketing officer in a company. When a new customer data is entered
into the company’s database he has to send the welcome message to each new
customer. If it is one or two customers John can do it manually, but what if the
count is more than a thousand? Well in such scenario triggers come in handy.
Thus, now John can easily create a trigger which will automatically send a
welcome email to the new customers once their data is entered into the database.
Important points:
Triggers can be made to insert, update and delete statements in SQL. We have two
types of triggers:
Syntax:
CREATE [OR REPLACE ] TRIGGER trigger_name
{BEFORE | AFTER | INSTEAD OF }
{INSERT [OR] | UPDATE [OR] | DELETE}
[OF col_name]
ON table_name
[REFERENCING OLD AS o NEW AS n]
[FOR EACH ROW]
WHEN (condition)
DECLARE
Declaration-statements
BEGIN
Executable-statements
EXCEPTION
Exception-handling-statements
END;
BEFORE INSERT:
AFTER INSERT:
DROP TABLE student;
CREATE TABLE student(id int,name char(30),gmail
char(20),pwd char(20));
INSERT INTO student
values(101,"Deepak","[email protected]","123456");
SELECT * FROM student;
BEFORE UPDATE:
1. Schema Refinement:
The Schema Refinement is the process that re-defines the schema of a relation. The
best technique of schema refinement is decomposition.
Normalization or Schema Refinement is a technique of organizing the data in
the database. It is a systematic approach of decomposing tables to eliminate data
redundancy and undesirable characteristics like Insertion, Update and Deletion
Anomalies.
Redundancy refers to repetition of same data or duplicate copies of same data
stored in different locations.
Three types of redundancy:
1. File level
2. Entire record level
3. Few attribute have redundancy.
Here all the data is stored in a single table which causes redundancy of data or say
anomalies as SID and Sname are repeated once for same CID. Let us discuss
anomalies one by one.
1. Insertion anomalies: It may not be possible to store some information unless
some otherinformation is stored as well.
145
2. redundant storage: some information is stored repeatedly
3. Update anomalies: If one copy of redundant data is updated, then
inconsistency is created unless all redundant copies of data are updated.
4. Deletion anomalies: It may not be possible to delete some information
without losing someother information as well.
Problem in updation / updation anomaly – If there is updation in the fee from
5000 to 7000, then we have to update FEE column in all the rows, else data will
become inconsistent.
Insertion Anomaly and Deletion Anomaly- These anomalies exist only due to
redundancy, otherwise they do not exist.
Insertion Anomalies: New course is introduced say c4 with course name DB, but no
student is there who is having DB subject.
Because of insertion of some data, It is forced to insert some other dummy data.
146
Deletion Anomaly:
Deletion of S3 student causes the deletion of course.
Because of deletion of some data, It is forced to delete some other useful data.
147
Problems related to decomposition:
1. Do we need to decompose a relation?
2. That problems (if any) does a given decomposition cause?
Properties of Decomposition:
Consider there is a relation R which is decomposed into sub relations R1, R2 , …. , Rn.
This decomposition is called lossless join decomposition when the join of the sub
relations results in the same relation R that was decomposed.
For lossless join decomposition, we always have
Example:
Consider the following relation R ( A , B , C ):
A B C
1 2 1
2 5 3
148
3 3 3
R(A, B, C)
Consider this relation is decomposed into two sub relations R 1 (A, B) and R2 (B, C)
R1(A,B) R2(B, C)
A B B C
1 decomposition
Now, let us check whether this 2 2 not.1
is lossless or
For lossless decomposition,2we must
5 have- 5 3
3 3 R1 ⋈ R2 = R 3 3
Now, if we perform the natural join (⋈) of the sub relations R1 and R2 , we get
A B C
1 2 1
2 5 3
3 3 3
R(A, B, C)
Example - 2:
R (A, B, C)
A B C
149
1 1 1
2 1 2
3 2 1
4 3 2
R1 (A, R2 (B, C)
B)
A B B C
1 1 1 1
2 1 1 2
3 2 2 1
4 3 3 2
150
Functional dependencies
Functional dependency is a relationship that exists when one attribute uniquely
determines another attribute.
A functional dependency XY in a relation holds true if two tuples having
the same value ofattribute X also have the same value of attribute Y.
i. e if X and Y are the two sets of attributes in a relation R
where X ⊆ R, Y ⊆ R
Then, for a functional dependency to exist from X to Y,
if t1.X=t2.X then t1.Y=t2.Y where t1,t2 are tuples and X,Y are attributes.
Example:
Example:
A B C D E
a 2 3 4 5
2 a 3 4 5
a 2 3 6 5
a 2 3 6 6
2. Augmentation Rule
The augmentation is also called as a partial dependency. In augmentation, if X
determines Y, then XZ determines YZ for any Z
If X → Y then XZ → YZ
Example:
For R (ABCD), if A → B then AC → BC
3. Transitivity Rule
In the transitive rule, if X determines Y and Y determine Z, then X must also
determine Z.
If X → Y and Y → Z then X → Z
Additional rules:
1. Union Rule
Union rule says, if X determines Y and X determines Z, then X must also
determine Y and Z.
If X→ Y and X→Z then X→YZ
2. Composition
If W → Z, X → Y, then WX → ZY.
3. Decomposition
If X→YZ then X→Y and X→Z
Example – 1:
We are given the relation R(A, B, C, D, E). This means that the table R has five
columns: A, B, C, D, and E. We are also given the set of functional dependencies:
{A->B, B->C, C->D, D->E}.
What is {A} +?
First, we add A to {A}+.
What columns can be determined given A? We have A -> B, so we can
determine B. Therefore, {A}+ is now {A, B}.
What columns can be determined given A and B? We have B -> C in the
functional dependencies, so we can determine C. Therefore, {A}+ is now
{A, B, C}.
Now, we have A, B, and C. What other columns can we determine? Well,
we have C -> D, so we can add D to {A}+.
Now, we have A, B, C, and D. Can we add anything else to it? Yes, since D
-> E, we can add E to {A}+.
We have used all of the columns in R and we have all used all functional
dependencies. {A}+ = {A, B, C, D, E}.
A+ = { A , B , C , D , E , F , G }
R(ABCDEFG) R(ABCDE)
A→B A→ BC
BC→ DE CD→ E
AEG →G B→ D
(AC)+={AC} (Using Reflexivity) E→ A
={ABC} (Using A→B) B+={B} (Using Reflexivity)
={ABCDE} (Using BC→ DE) ={BD} (Using B→ D)
R(ABCDE
F) AB→C
BC→AD
D→E
CF→
B
(AB)+={AB} (Using Reflexivity)
={ABC} (Using AB→C)
={ABCD} (Using BC→AD)
Step - 1
Step - 2
Now,
We will check if the essential attributes together can determine all
remaining non-essential attributes.
To check, we find the closure of CE.
{ CE }+ ={C,E}
= { C , E , F } ( Using C → F )
= { A , C , E , F } ( Using E → A )
= { A , C , D , E , F } ( Using EC → D )
= { A , B , C , D , E , F } ( Using A → B )
We conclude that CE can determine all the attributes of the given relation.
So, CE is the only possible candidate key of the relation.
More Examples:
R(ABCD)
AB→CD
D→A
Candidate keys are AB,BD
R(ABCD)
AB→CD
C→A
D→B
Candidate keys are AB, AD, BC,DC
R(WXYZ)
Z→W
Y→XZ
WX→Y
Candidate keys are Y, WX,ZX
R(ABCD)
A→B
B→C
C→A
Candidate keys are AD, BD, CD
R(ABCDE)
AB→C
C→D
D→E
A→B
C→A
Candidate keys are A,C
R(ABCDE)
A→D
AB→C
B→E
D→C
E→A
Candidate key is B
R(ABCDEF)
AB→C
DC→AE
E→F
Candidate key is ABD,BDC
Relation is in 1NF
Partial Dependency
A partial dependency is a dependency where few attributes of the candidate key
determines non-prime attribute(s).
Or
A partial dependency is a dependency where a portion of the candidate key or
incomplete candidate key determines non-prime attribute(s).
In other words,
X → a is called a partial dependency if and only if-
1. X is a proper subset of some candidate key and
2. a is a non-prime attribute or non – key attribute
If any one condition fails, then it will not be a partial dependency.
Example - 2:
Consider a relation- R (A, B, C, D, E, F) with functional dependencies-
A →
BB→
CC→
D D
→E
The possible candidate key for this relation is
AF
From here,
Prime attributes = {A, F}
Non-prime attributes = { B, C, D, E}
Now, if we observe the given
dependencies-
There is partial dependency.
This is because there is an incomplete candidate key (attribute A)
determines non-prime attribute (B).
Thus, we conclude that the given relation is not in 2NF.
Database Management Systems
162
Example – 3:
Consider a relation- R (A, B, C, D) with functional dependencies-
AB → CD
C→A
D→B
The possible candidate key for this relation is
AB, AD, BC, CD
Example - 4:
Consider a relation- R (A, B, C, D) with functional dependencies-
AB → D
B→C
Possible candidate key is AB
Transitive dependency: When a non prime attribute finds another non prime
attribute, it is called transitive dependency.
Example-
Consider a relation- R ( A , B , C , D , E ) with functional dependencies-
A→
BC CD
→E
B→D
E→A
The possible candidate keys for this relation are-
A, E, CD, BC
From here,
Prime attributes = { A , B , C , D , E }
There are no non-prime attributes
Now,
It is clear that there are no non-prime attributes in the relation.
In other words, all the attributes of relation are prime attributes.
Example:
Consider a relation- R ( A , B , C ) with the functional dependencies-
A →
BB→
CC→
A
The possible candidate keys for this relation are- A, B, C
Example:
Example:
Transaction Concept
Transaction:
Collections of operations that form a single logical unit of work are called
transactions. Or
Operations in Transaction-
Transaction Management
Transaction Management ensures that the database remains in a consistent
(correct) state despite system failures (e.g power failures, and operating system
crashes) and transaction failures.
Atomicity:
This property ensures that either the transaction occurs completely or it does
not occur at all.
In other words, it ensures that no transaction occurs partially.
That is why, it is also referred to as “All or nothing rule“.
It is the responsibility of Transaction Control Manager to ensure atomicity
of the transactions.
Example:
If the transaction fails after completion of T1 but before completion of T2.( say,
after write(X) but before write(Y)), then amount has been deducted from X but not
added to Y. This results in an inconsistent database state. Therefore, the transaction
must be executed in totally in order to ensure correctness of database state.
Isolation:
This property ensures that multiple transactions can occur concurrently without
leading to the inconsistency of database state. Transactions occur independently
without interference. Changes occurring in a particular transaction will not be
visible to any other transaction until that particular change in that transaction is
written to memory or has been committed. This property ensures that the execution
of transactions concurrently will result in a state that is equivalent to a state
achieved these were executed serially in some order.
Let X= 500, Y = 500.
T1 T2
Read(X) Read(X)
X: =X*100 Read(Y)
Write(X) Z: = X + Y
Read(Y) Write(Z)
Y: = Y – 50
Write(Y)
Durability:
This property ensures that once the transaction has completed execution, the
updates and modifications to the database are stored in and written to disk and they
persist even if a system failure occurs. These updates now become permanent and
are stored in non-volatile memory.
The effects of the transaction, thus, are never lost.
The ACID properties, in totality, provide a mechanism to ensure consistency of a
database in a way such that each transaction is a group of operations that acts a
single unit, produces consistent results, acts in isolation from other operations and
updates that it makes are durably stored.
Active State
As we have discussed in the DBMS transaction introduction that a transaction is a
sequence of operations. If a transaction is in execution then it is said to be in active
This is the last state in the life cycle of a transaction. After entering the committed
state or aborted state, the transaction finally enters into a terminated state where
its life cycle finally comes to an end.
First, the operating system is asked to make sure that all pages of the new
copy of the database have been written out to disk. (Unix systems use the
flush command for this purpose.)
After the operating system has written all the pages to disk, the database
system updates the pointer db-pointer to point to the new copy of the
database; the new copy then becomes the current copy of the database. The
old copy of the database is then deleted.
Figure below depicts the scheme, showing the database state before and after the
update.
177
If the transaction fails at any time before db-pointer is updated, the old
contents of the database are not affected.
We can abort the transaction by just deleting the new copy of the database.
Once the transaction has been committed, all the updates that it performed
are in the database pointed to by db pointer.
Thus, either all updates of the transaction are reflected, or none of the effects
are reflected, regardless of transaction failure.
Concurrent Execution
Transaction-processing systems usually allow multiple transactions to run
concurrently. Allowing multiple transactions to update data concurrently causes
several complications with consistency of the data.
Ensuring consistency in spite of concurrent execution of transactions requires extra
work; it is far easier to insist that transactions run serially—that is, one at a time,
each starting only after the previous one has completed.
178
Correspondingly, the processor and disk utilization also increase; in other
words, the processor and disk spend less time idle, or not performing any
useful work.
Reduced waiting time:
There may be a mix of transactions running on a system, some short and
some long.
If transactions run serially, a short transaction may have to wait for a
preceding long transaction to complete, which can lead to unpredictable
delays in running a transaction.
If the transactions are operating on different parts of the database, it is better
to let them run concurrently, sharing the CPU cycles and disk accesses
among them.
Concurrent execution reduces the unpredictable delays in running
transactions.
Moreover, it also reduces the average response time: the average time for a
transaction to be completed after it has been submitted.
The idea behind using concurrent execution in a database is essentially the same as
the idea behind using multi programming in an operating system.
The database system must control the interaction among the concurrent
transactions to prevent them from destroying the consistency of the database. It is
achieved using concurrency-control schemes.
Schedule:
The order in which the operations of multiple transactions appear for execution is
called as a schedule.
It represents the order in which instructions of a transaction are executed.
Schedules may be classified as
179
Serial Schedules-
In serial schedules,
Example:
In this schedule,
There are two transactions T1 and T2 executing serially one after the other.
Transaction T1 executes first.
After T1 completes its execution, transaction T2 executes.
So, this schedule is an example of a Serial Schedule.
Non – serial / parallel schedule:
A non serial schedule is, before completing one transaction if it switches to another
transaction.
Non-serial schedules are NOT always consistent.
Example:
180
In this schedule,
There are two transactions T1 and T2 executing concurrently.
The operations of T1 and T2 are interleaved.
So, this schedule is an example of a Non-Serial Schedule.
181
It becomes problematic only when the uncommitted transaction fails and roll
backs later due to some reason.
Example:
Here,
1. T1 reads the value of A.
2. T1 updates the value of A in the buffer.
3. T2 reads the value of A from the buffer.
4. T2 writes the updated the value of A.
5. T2 commits.
6. T1 fails in later stages and rolls
This problem occurs when a transaction gets to read unrepeated i.e. different
values of the same variable in its different read operations even when it has not
updated its value.
Example:
182
Here,
1. T1 reads the value of X (= 10 say).
2. T2 reads the value of X (= 10).
3. T1 updates the value of X (from 10 to 15 say) in the buffer.
4. T2 again reads the value of X (but = 15).
In this example,
T2 gets to read a different value of X in its second reading.
T2 wonders how the value of X got changed because according to it, it is
running in isolation.
3. Lost Update Problem (W – W Conflict)
This problem occurs when multiple transactions execute concurrently and updates
from one or more transactions get lost.
Example:
Here,
1. T1 reads the value of A (= 10 say).
2. T2 updates the value to A (= 15 say) in the buffer.
3. T2 does blind write A = 25 (write without read) in the buffer.
4. T2 commits.
5. When T1 commits, it writes A = 25 in the
183
NOTE
This problem occurs whenever there is a write-write conflict.
In write-write conflict, there are two writes one by each transaction on the
same data item without any read in the middle.
This problem occurs when a transaction reads some variable from the buffer
and when it reads the same variable later, it finds that the variable does not
exist.
Example:
Here,
1. T1 reads X.
2. T2 reads X.
3. T1 deletes X.
4. T2 tries reading X but does not find it.
In this example,
T2 finds that there does not exist any variable X when it tries reading X
again.
T2 wonders who deleted the variable X because according to it, it is running
in isolation.
184
Serializability
Some non-serial schedules may lead to inconsistency of the database.
Serializability is a concept that helps to identify which non-serial schedules are
correct and will maintain the consistency of the database.
Serializable Schedules
If a given non-serial schedule of ‘n’ transactions is equivalent to some serial
schedule of ‘n’ transactions, then it is called as a Serializable schedule.
Types of Serializability
Serializability is mainly of two types-
1. Conflict Serializability
2. View Serializability
Conflict Serializability
If a given non-serial schedule can be converted into a serial schedule by swapping
its non-conflicting operations, then it is called as a conflict Serializable schedule.
Conflicting Operations-
Two operations are called as conflicting operations if all the following conditions
hold true for them-
185
Case – 1 Case – 2 Case – 3
(No conflict) (No conflict) (Conflict operations)
Ti Tj Ti Tj Ti Tj
R(A) R(A) R(A)
R(B) R(A) W(A)
After Swapping After Swapping After Swapping
R(B) R(A) W(A)
R(A) R(A) R(A)
Ti is effecting
Case – 4 Case – 5
(Conflict operations) (Conflict operations)
Ti Tj Ti Tj
W(A) W(A)
R(A) W(A)
After Swapping After Swapping
R(A) W(A)
W(A) W(A)
Tj is effecting DB is effecting
Step-02:
Start creating a precedence graph by drawing one node for each transaction.
Step-03:
Draw an edge for each conflict pair such that if X i (V) and Yj (V) forms a
conflict pair then draw an edge from T i to Tj.
This ensures that Ti gets executed before Tj.
Step-04:
Check if there is any cycle formed in the graph.
If there is no cycle found, then the schedule is conflict Serializable otherwise
not.
186
NOTE
By performing the Topological Sort of the Directed Acyclic Graph so
obtained, the corresponding serial schedule(s) can be found.
Such schedules can be more than 1.
Example -1:
S1
T1 T2 T3
t1 R(X)
t2 R(Y)
t3 R(X)
t4 R(Y)
t5 R(Z)
t6 W(Y)
t7 W(Z)
t8 R(Z)
t9 W(X)
t10 W(Z)
In the above schedule, there are three transactions: T1, T2, and T3. So, the
precedence graph contains three vertices.
To draw the edges between these nodes or vertices, follow the below steps:
Step1: At time t1, there is no conflicting operation for read(X) of Transaction T1.
Step2: At time t2, there is no conflicting operation for read(Y) of Transaction T3.
Step3: At time t3, there exists a conflicting operation Write(X) in transaction T1
for read(X) of Transaction T3. So, draw an edge from T3 to T1.
187
Step4: At time t4, there exists a conflicting operation Write(Y) in transaction T3
for read(Y) of Transaction T2. So, draw an edge from T2 to T3.
Step5: At time t5, there exists a conflicting operation Write (Z) in transaction T1
for read (Z) of Transaction T2. So, draw an edge from T2 to T1.
After all the steps, the precedence graph will be ready, and it does not contain any
cycle or loop, so the above schedule S1 is conflict Serializable. And it is equivalent
to a serial schedule. Above schedule S1 is transformed into the serial schedule by
using the following steps:
Step1: Check the vertex in the precedence graph where indegree=0. So, take the
vertex T2 from the graph and remove it from the graph.
188
Step 2: Again check the vertex in the left precedence graph where indegree=0. So,
take the vertex T3 from the graph and remove it from the graph. And draw the
edge from T2 to T3.
Step3: And at last, take the vertex T1 and connect with T3.
189
Example – 2:
List all the conflicting operations and determine the dependency between the
transactions
190
Example – 3:
List all the conflicting operations and determine the dependency between the
transactions
191
Practice Example:
192
View Serializability
Condition-01:
Initial Read
The initial read of both the schedules must be in the same transaction.
Suppose two schedule S1 and S2. In schedule S1, if a transaction T1 is
reading the data item A, then in S2, transaction T1 should also read A.
Note: “Initial readers must be same for all the data items”.
Condition-02:
Updated Read
In schedule S1, if the transaction T i is reading the data item A which is updated by
transaction Tj, then in schedule S2 also, Ti should read data item A which is
updated by Tj.
Note: “Intermediate Write-read sequence must be same.”
193
Condition-03:
Final Write
A final write must be the same in both the schedules.
Suppose in schedule S1, if a transaction T1 updates A in the last, then in
S2 final write operation should also be done by transaction T1.
Note: “Final writers must be same for all the data items”.
Example:
Time Transaction T1 Transaction T2
t1 Read(X)
t2 Write(X)
t3 Read(X)
t4 Write(X)
t5 Read(Y)
t6 Write(Y)
t7 Read(Y)
t8 Write(Y)
Note: S2 is the serial schedule of S1. If we can prove that both the schedule are
view equivalent, then we can say that S1 schedule is a view serializable schedule.
Now, check the three conditions of view serializability for this example:
1. Initial Read
In S1 schedule, T1 transaction first reads the data item X. In Schedule S2 also
transaction T1 first reads the data item X.
194
Now, check for Y. In schedule S1, T1 transaction first reads the data item Y. In
schedule S2 also the first read operation on data item Y is performed by T1.
We checked for both data items X and Y, and the initial read condition is satisfied
in schedule S1 & S2.
2. Updated Read
In Schedule S1, transaction T2 reads the value of X, which is written by
transaction T1. In Schedule S2, the same transaction T2 reads the data item X
after T1 updates it.
Now check for Y. In Schedule S1, transaction T2 reads the value of Y, which is
written by T1. In S2, the same transaction T2 reads the value of data item Y after
T1 writes it.
The update read condition is also satisfied for both the schedules S1 and S2.
3. Final Write
In schedule S1, the final write operation on data item X is done by transaction
T2. In schedule S2 also transaction T2 performs the final write operation on X.
Now, check for data item Y. In schedule S1, the final write operation on Y is done
by T2 transaction. In schedule S2, a final write operation on Y is done by T2.
We checked for both data items X and Y, and the final write condition is also
satisfied for both the schedule S1 & S2.
Conclusion: Hence, all the three conditions are satisfied in this example, which
means Schedule S1 and S2 are view equivalent. Also, it is proved that schedule S2
is the serial schedule of S1. Thus we can say that the S1 schedule is a view
serializable schedule.
195
Recoverability
So far, we have studied what schedules are acceptable from the viewpoint of consis
-tency of the database, assuming implicitly that there are no transaction failures.
We now address the effect of transaction failures during concurrent execution.
If a transaction T₁, fails, for whatever reason, we need to undo the effect of this
transaction to ensure the atomicity property of the transaction. In a system that
allows concurrent execution, it is necessary also to ensure that any transaction T j,
that is dependent on Ti, (that is, Tj, has read data written by Ti.) is also aborted. To
achieve this surety, we need to place restrictions on the type of schedules permitted
in the system.
Now, we address the issue of what schedules are acceptable from the viewpoint of
recovery from transaction failure.
Irrecoverable Schedules:
If in a schedule,
Example
196
Here,
Recoverable Schedules
If in a schedule,
Here,
The commit operation of the transaction that performs the dirty read is
delayed.
197
This ensures that it still has a chance to recover if the uncommitted
transaction fails later.
Example
Here,
Important points:
198
Types of Recoverable Schedules
A recoverable schedule may be any one of these kinds
Example:
Here,
Transaction T2 depends on transaction T1.
Transaction T3 depends on transaction T2.
Transaction T4 depends on transaction T3.
In this schedule,
The failure of transaction T1 causes the transaction T2 to rollback.
The rollback of transaction T2 causes the transaction T3 to rollback.
The rollback of transaction T3 causes the transaction T4 to rollback.
199
2. Cascadeless Schedule:
If in a schedule, a transaction is not allowed to read a data item until the last
transaction that has written it is committed or aborted, such a schedule is called as
a Cascadeless Schedule.
In other words,
Cascadeless schedule allows only committed read operations.
Therefore, it avoids cascading roll back and thus saves CPU time.
Example:
Note:
Cascadeless schedule allows only committed read operations.
However, it allows uncommitted write operations.
3. Strict Schedule:
If in a schedule, a transaction is neither allowed to read nor write a data item until
the last transaction that has written it is committed or aborted, such a schedule is
called as a Strict Schedule.
200
CONCURRENCY CONTROL:
In the multi-user system, we all know that multiple transactions run in parallel,
thus trying to access the same data and suppose if one transaction already has the
access to the data item and now another transaction tries to modify the data then it
leads to error in database. There are issues with the concurrent execution of
transactions such as conflicting operation where in simultaneously both
transactions might try to access the data item which leads to the problem in
database.
201
Concurrency Control Protocols
Different concurrency control protocols offer different benefits between the
amount of concurrency they allow and the amount of overhead that they impose.
Following are the Concurrency Control techniques in DBMS:
Lock-Based Protocols
Two Phase Locking Protocol
Timestamp-Based Protocols
Validation-Based Protocols
Lock-Based Protocols
A lock is a data variable which is associated with a data item. This lock signifies
that operations that can be performed on the data item. Locks in DBMS help
synchronize access to the database items by concurrent transactions.
Shared lock: Transaction can perform read operation; any other transaction can
also obtain same lock on same data item at same time. Denoted by lock-S (Q).
Exclusive mode: A transaction may acquire exclusive lock on a data item in order
to both read/write into it. The lock is excusive in the sense that no other transaction
202
can acquire any kind of lock (either shared or exclusive) on that same data
item. Denoted by lock – X(Q).
The relationship between Shared and Exclusive Lock can be represented by the
following table which is known as Lock Compatibility matrix comp.
Locks to be granted
Lock-X (A); (Exclusive Lock, we want to both read A’s value and modify it)
Read A
A = A – 100
Write A
Unlock (A) (Unlocking A after the modification is done)
Lock-X (B) (Exclusive Lock, we want to both read B’s value and modify it)
Read B
B = B + 100
Write B
Unlock (B) (Unlocking B after the modification is done)
203
Let us see how these locking mechanisms help us to create error free schedules.
You should remember that in the previous chapter we discussed an example of an
erroneous schedule:
Transaction T1 Transaction T2
Read(A)
A=A-100
Read(A)
Temp=A*0.1
Read(C)
C=C + Temp
Write(C)
Write(A)
Read(B)
B = B + 100
Write (B)
We detected the error based on common sense only that the Context Switching is
being performed before the new value has been updated in A. T2 reads the old
value of A, and thus deposits a wrong amount in C. Had we used the locking
mechanism, this error could never have occurred.
Let us rewrite the schedule using the locks.
204
Transaction T1 Transaction T2 Concurrency control manager
Lock – X(A)
grant – X(A,T1)
Read A
A = A - 100
Write A
Unlock(A)
Lock – S
grant – S(A,T2)
(A) Read A
Temp = A *0.1
Unlock (A) grant –
X(C,T2) Lock – X (C)
Read (C)
C=C+
Temp Write
C Unlock (C)
Lock –
grant – X(B,T1)
X(B) Read
B
B = B + 100
Write B
Unlock (B)
205
May not sufficient to produce only serializable schedule.
Transaction T1 Transaction T2
Lock – X(A)
R(A)
W(A)
Unlock(A)
Lock – S(B)
R(B)
Unlock (B)
Lock – X(B)
R(B)
W(B)
Unlock (B)
Lock – S(A)
R(A)
Unlock(A)
Deadlock:
T1 T2
L – X(A)
L – X(B)
L – X(B)
L – X(A)
206
Starvation
T1 T2 T3 T4
Lock - S(A)
Lock - X (A
Lock - S(A
Lock - S(A
207
Two Phase Locking Protocol
1. Growing Phase: New locks on data items may be acquired but none can be
released.
2. Shrinking Phase: Existing locks may be released but no new locks can be
acquired.
In this example, the transaction acquires all of the locks it needs until it reaches its
locked point. (In this example, the transaction requires two locks.) When the
locked point is reached, the data are modified to conform to the transaction’s
requirements. Finally, the transaction is completed as it releases all of the locks it
acquired in the first phase.
Note – If lock conversion is allowed, then upgrading of lock( from S(a) to X(a) ) is
allowed in the Growing Phase, and downgrading of lock (from X(a) to S(a)) must
be done in shrinking phase.
208
Example:
T1 T2
Lock – S(A)
Read(A)
Lock – S(B)
Read(B)
Lock - S(A)
Read(A)
Unlock(A)
Unlock (A)
Unlock(A)
In this example, read locks for both A and B were acquired. Since both
transactions did nothing but read, this is easily identifiable as a serializable
schedule.
209
Unserializable schedule example:
210
Database Management Systems
211
Timestamp Based protocols
Basic idea of time stamping is to decide the order between the transactions before
they enter into the system, so that in case of conflict during execution, we can
resolve the conflict using ordering.
Time stamp:
With each transaction Ti, in the system, we associate a unique fixed timestamp, de
noted by TS (Ti). This timestamp is assigned by the database system before the
transaction Ti, starts execution. If a transaction T i, has been assigned timestamp
TS(Ti), and a new transaction Tj, enters the system, then TS(Ti)) < TS(Tj)).
1. Use the value of the system clock as the timestamp; that is, a transaction's time
stamp is equal to the value of the clock when the transaction enters the system.
2. Use a logical counter that is incremented after a new timestamp has been
assigned; that is, a transaction's timestamp is equal to the value of the counter
when the transaction enters the system.
To implement this scheme, we associate with each data item Q two timestamp
values:
W-timestamp (WTS) (Q) denotes the largest / last / latest timestamp of any
transaction that executed write (Q) successfully.
R-timestamp RTS (Q) denotes the largest / last / latest timestamp of any
transaction that executed read(Q) successfully.
These timestamps are updated whenever a new read (Q) or write(Q) instruction is
executed.
212
Timestamp - ordering Protocol:
The timestamp-ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order. This protocol operates as follows:
Otherwise, the system executes the write operation and sets W-timestamp(Q) =
TS(Ti).
Example:
213
Validation Based Protocols
In Validation based Protocol, the local copies of the transaction data are updated
rather than the data itself, which results in less interference while execution of the
transaction.
Validation based Protocol also known as Optimistic Concurrency Control since
transaction executes fully in the hope that all will go well during validation.
iii. Write phase: If Ti is validated, the updates are applied to the database;
otherwise, Ti is rolled back.
214
Validation Test for Transaction Tj
If for all Ti with TS (Ti) < TS (Tj) either one of the following condition
holds:
Finish (Ti) < start(Tj)
Start (Tj) < finish (Ti) < validation (Tj) and the set of data items written by T i
does not intersect with the set of data items read by T j. Then validation
succeeds and Tj can be committed. Otherwise, validation fails and Tj is
aborted.
Example:
Case 1:
Ti(TS=9:30) Tj(TS=9:37)
Finish(Ti)
Start(Tj)
Case 2:
Ti(TS(2:30)) Tj(TS(2:37))
Start(Tj), 2:37
Finish(Ti),2:39
Validation(Tj), 2:40
215
Example:
It is a validated schedule
Advantages:
1. Avoid Cascading roll backs.
2. Avoid deadlock
Disadvantages:
1. Starvation
216
Failure Classification
To find that where the problem has occurred, we generalize a failure into the
following categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a
point from where it can't go any further. If a few transaction or process is
hurt, then this is called as transaction failure.
Reasons for a transaction failure could be -
1. Logical errors: If a transaction cannot complete due to some code
error or an internal error condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active
transaction because the database system is not able to execute it. For
example, the system aborts an active transaction, in case of deadlock
or resource unavailability.
2. System Crash
System failure can occur due to power failure or other hardware or software
failure. Example: Operating system error.
In the system crash, non-volatile storage is assumed not to be corrupted.
3. Disk Failure
It occurs where hard-disk drives or storage drives used to fail frequently. It
was a common problem in the early days of technology evolution.
Disk failure occurs due to the formation of bad sectors, disk head crash, and
unreachability to the disk or any other failure, which destroy all or part of
disk storage.
217
Recovery and atomicity:
Consider banking system
Account X Account Y
Initial amount = 5000 Initial amount = 10000
Transfer 1000 from X to Y
System Crash occurred after amount deducted from account X but before adding
to account Y
Memory content were lost because of system crash
Re - execute Ti: This procedure will result in the value of X becoming 3000 rather
than 4000, Thus, the system enters an inconsistent state.
Do not re - execute Ti: The current system state has values of 4000 and 10000 for
X and Y, respectively. Thus, the system enters an inconsistent state.
In either case, the DB is left in inconsistent state, and thus simple recovery
schemes do not work.
Scheme to achieve the recovery from transaction failure is log based recovery.
218
Structure of the format of a log record.
Consider the data Item A and B with initial value 1000. (A=B=1000)
In the above table, in the left column, a transaction is written, and in the right
column of the table, a log record is written for this Transaction.
Log Based Recovery work in two modes, these modes are as follow-
Example:
Consider Transaction T0. It transfers Rs. 50 from account A to account B.
Original value of account A and B are Rs. 1000 and Rs. 2000 respectively.
This transaction is defined as:
T0 read(A)
A:=A -
Database Management Systems
219
50
220
write(A)
read(B)
B:=B + 50
write(B)
T1 read(C)
A:=A-100
write(C)
<T0, start>
<T0, A, 950>
<T0, B, 2050>
<T0, commit>
<T1, start>
<T1, C, 600>
<T1, commit>
The below fig shows the log that results from the complete execution of T0 and T1.
Log Database
< T0, start>
< T0, A, 950>
< T0, B, 2050>
< T0, commit>
A = 950
B = 2050
<T1, start>
<T1, C, 600>
<T1, commit>
C = 600
221
Case 1: Crash occurs just after the log record for written (B) operation.
Log contents after the crash are:
Case 2: Crash occur just after the log record for written(C) operation.
Log contents after the crash are:
< T0, start>
< T0, A, 950>
< T0, B, 2050>
< T0, commit>
<T1, start>
<T1, C, 600>
When the system comes back it finds <T0, start> and <T0, commit>. But there is no
<T1, commit> for <T1, start> Hence, system can execute redo (T0) but not redo
(TI). Hence, the value of account C remains unchanged.
Case 3: Crash occur just after the log record <T1, commit>
Log contents after the crash are:
<T0, start>
<T0, A, 950>
<T0, B, 2050>
<T0, commit>
<T1, start>
<T1, C, 600>
<T1, commit>
When the system comes back it can execute redo(T0) and redo(T1) operations.
222
2. Immediate database modification:
<T0, start>
<T0, A, 1000, 950>
<T0, B, 2000, 2050>
<T0, commit>
<T1, start>
<T1, C, 700, 600>
<T1, commit>
The order in which output took place to both database system and log as a result of
execution of T0 and T1 is
Log Database
< T0, start>
< T0, A,1000,950>
< T0, B, 2000, 2050>
A = 950
B = 2050
< T0, commit>
<T1, start>
<T1, C, 700, 600>
C = 600
<T1, commit>
Using the log, the system can handle failure. Two recovery procedures are
223
undo(Ti) restores the value of all data items updated by transaction Ti, to the
old values
redo(Ti) sets the values of all data items updated by transaction T i to the new
values.
Case 1: Failure occur just after the log record for write(B) operation
< T0, start>
< T0, A,1000,950>
< T0, B, 2000, 2050>
A = 950
B = 2050
Log contains < T0, start> but does not contain < T0, commit>. Hence undo T0 is
executed: the values of account A and B are restored to 1000 and 2000 resp.
Case 2: Failure occurs just after the log record for write (C) operation.
< T0, start>
< T0, A,1000,950>
< T0, B, 2000, 2050>
A = 950
B = 2050
< T0, commit>
<T1, start>
<T1, C, 700, 600>
C = 600
Log contains <T0, start> and <T0, commit>. Hence, the redo(T0) is executed and
values of account A and B are restored to the same (or new values) 950 and 2050
respectively. But log doesn't contain <T1, commit> for <T1, Start> Hence, the
value of account C to restored to old value by undo(T1) operation. Hence value of
C is 700.
224
Case 3: Crash occur just after the log record <T1, commit>
Log Database
< T0, start>
< T0, A,1000,950>
< T0, B, 2000, 2050>
A = 950
B = 2050
< T0, commit>
<T1, start>
<T1, C, 700, 600>
C = 600
<T1, commit>
Log contains < T0, start>, < T0, commit>and <T1, start>, <T1, commit> Hence,
recovery system executes redo (T0) and redo(T1) operation. The value of account
A. B and C are Rs 950, Rs 2050 and Rs 600 respectively
225
Recovery with Concurrent Transaction
Until now, we considered recovery in an environment where only a single trans-
action at a time is executing. We now discuss how we can modify and extend the
log-based recovery scheme to deal with multiple concurrent transactions.
Regardless of the number of concurrent transactions, the system has a single disk
buffer and a single log.
Recovery with Concurrent Transactions means to recover schedule when multiple
transaction executed if any transaction fail.
Transaction Rollback
We roll back a failed transaction, Ti, by using the log. The system scans the log
backward; for every log record of the form <Ti, Xj , V1, V2> found in the log, the
system restores the data item Xj to its old value V1. Scanning of the log terminates
when the log record <Ti, start> is found.
Scanning the log backward is important, since a transaction may have updated a
data item more than once. As an illustration, consider the pair of log records
<Ti, A, 10, 20>
<Ti, A, 20, 30>
226
The log records represent a modification of data item A by Ti, followed by another
modification of A by Ti. Scanning the log backward sets A correctly to 10. If the
log were scanned in the forward direction, A would be set to 20, which is incorrect.
Checkpoint
Let’s understand how a checkpoint works in DBMS with the help of the following
diagram.
227
3. The transaction goes into the redo list if the recovery system sees all logs
file with <TXN N, Commit>. In redo list, all log files as well as all the
transactions are removed and differently save their logs files.
4. For example: in all log file transactions TXN2 and TXN3 will have <TXN
N, Start> and <TXN N, Commit>. The TXN1 commits in the log file that is
the main reason transaction TXN1 is committed after the checkpoint is
crossed. So TXN1, TXN2, and TXN3 go into the redo list.
5. In the next step recovery system sees all log files but there is no commit or
abort log are found in the undo list so all transactions are undone and all log
files are removed from the list.
6. If transaction TXN 4 fails will be put into the undo list because transaction
TXN4 is not completed.
** When the Transaction reach to Checkpoint, its updates DB and Remove the Log
than again move towards to next Checkpoint and do same (Update DB Remove the
Log).
** It is like a bookmark.
Restart recovery:
When the system recovers from a crash, it constructs two lists.
The undo-list consists of transactions to be undone, and the redo-list consists
of transaction to be redone.
The system constructs the two lists as follows: Initially, they are both empty.
The system scans the log backward, examining each record, until it finds the
first <checkpoint> record.
228
UNIT - V
A DBMS stores the data on external storage because the amount of data is very
huge, must persist across program executions and has to be fetched into main
memory when DBMS processes the data.
The unit of information for reading data from disk, or writing data to disk, is a
page. The size of a page is 4KB or 8KB.
Each record in a file has a unique identifier called a record id, or rid for short. A
rid has the property that we can identify the disk address of the page containing the
record by using the rid.
Buffer Manager:
Data is read into memory for processing, and written to disk for persistent storage,
by a layer of software called the buffer manager. When the files and access
Space on disk is managed by the disk space manager. When the files and access
methods layer needs additional space to hold new records in a file, it asks the disk
space manager to allocate an additional disk page for the file;
Magnetic Disks:
Magnetic disks support direct access (transfers the block of data between the
memory and peripheral devices of the system, without the participation of the
processor) to a desired location and are widely used for database applications. A
DBMS provides seamless access to data on disk; applications need not worry about
whether data is in main memory or disk.
Spindle: A typical HDD design consists of a spindle, which is a motor that holds
the platters.
Cylinder: The collection of all the tracks that are of the same distance, from the
edge of the platter, is called a cylinder.
Read / Write Head: The data on a hard drive platter is read by read – write heads,
of read – write arm. The read – write arm also known as actuator.
Arm assembly: Each on a separate read – write arm are controlled by a common
arm assembly which moves all heads simultaneously from one cylinder to another.
Tracks: Each platter is broken into thousands of tightly packed concentric circles,
known as tracks. These tracks resemble the structure of annual rings of a tree. All
the information stored on the hard disk is recorded in tracks. Starting from zero at
the outer side of the platter, the number of tracks goes on increasing to the inner
side. Each track can hold a large amount of data counting to thousands of bytes.
Sectors: Each track is further broken down into smaller units called sectors. As
sector is the basic unit of data storage on a hard disk, each track has the same
number of sectors, which means that the sectors are packed much closer together
on tracks near the center of the disk. A single track typically can have thousands of
sectors. The data size of a sector is always a power of two, and is almost always
either 512 or 4096 bytes.
A database consists of a huge amount of data. The data is grouped within a table in
RDBMS, and each table has related records. A user can see that the data is stored
in form of tables, but in actual this huge amount of data is stored in physical
memory in form of files.
File Organization:
File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and
access to any specific record. In simple terms, Storing the files in certain order is
called file Organization.
Indexing:
The main goal of designing the database is faster access to any data in the database
and quicker insert/delete/update to any data. When a database is very huge, even a
smallest transaction will take time to perform the action. In order to reduce the
time spent in transactions, Indexes are used. Indexes are similar to book catalogues
in library or even like an index in a book.
Indexing is a data structure technique which allows you to quickly retrieve records
from a database file. An Index is a small table having only two columns. The first
column comprises a copy of the primary or candidate key of a table. Its second
column contains a set of pointers for holding the address of the disk block where
that specific key value stored.
An index
Takes a search key as input
Efficiently returns a collection of matching records.
Primary Index:
Primary Index is an ordered file which is fixed length size with two fields. The first
field is the same a primary key and second, filed is pointed to that specific data
block. In the primary Index, there is always one to one relationship between the
entries in the index table.
The primary Indexing in DBMS is also further divided into two types.
Dense Index
Sparse Index
Dense Index: In a dense index, a record is created for every search key valued in
the database. This helps you to search faster but needs more space to store index
records. In this Indexing, method records contain search key value and points to
the real record on the disk.
Dense Index
It is an index record that appears for only some of the values in the file. Sparse
Index helps you to resolve the issues of dense Indexing in DBMS. In this method
of indexing technique, a range of index columns stores the same data block
address, and when data needs to be retrieved, the block address will be fetched.
However, sparse Index stores index records for only some search-key values. It
needs less space, less maintenance overhead for insertion, and deletions but It is
slower compared to the dense Index for locating records.
Clustered Indexing
Clustering index is defined on an ordered data file. The data file is ordered on a
non-key field. In some cases, the index is created on non-primary key columns
which may not be unique for each record. In such cases, in order to identify the
records faster, we will group two or more columns together to get the unique
values and create index out of them. This method is known as clustering index.
Basically, records with similar characteristics are grouped together and indexes are
created for these groups. Clustered index sorted according to first name (Search
key).
For example, students studying in each semester are grouped together. i.e. 1st
Semester students, 2nd semester students, 3rd semester students etc are grouped.
Example:
Hashing is the technique of the database management system, which directly finds
the specific data location on the disk without using the concept of index structure.
In the database systems, data is stored at the blocks whose data address is produced
by the hash function. That location of memory where hash files stored these records
is called as data bucket or data block.
Key: A key in the Database Management system (DBMS) is a field or set of fields
that helps the relational database users to uniquely identify the row/records of the
database table.
Hash function: This mapping function matches all the set of search keys to those
addresses where the actual records are located. It is an easy mathematics function.
Linear Probing: It is a concept in which the next available block of data is used for
inserting the new record instead of overwriting the older data block.
Quadratic Probing: It is a method that helps users to determine or find the address
of a new data bucket.
Bucket Overflow: When a record is inserted, the address generated by the hash
function is not empty or data already exists in that address .This situation is called
bucket overflow.
1. Static Hashing
2. Dynamic Hashing
238
Static Hashing: In the static hashing, the resultant data bucket address will always
remain the same.
Therefore, if you generate an address for say Student_ID = 10 using hashing function
mod(3), the resultant bucket address will always be 1. So, you will not see any
change in the bucket address.
Therefore, in this static hashing method, the number of data buckets in memory
always remains constant.
Searching: When you need to retrieve the record, the same hash function should be
helpful to retrieve the address of the bucket where data should be stored.
Delete a record: Using the hash function, you can first fetch the record which is you
wants to delete. Then you can remove the records for that address in memory.
239
DYNAMIC HASHING TECHNIQUES
EXTENDIBLE HASHING
In Static Hashing the performance degrades with overflow pages. This problem,
however, can be overcome by a simple idea: Use a directory of pointers to buckets,
and double the size of the number of buckets by doubling just the directory and
splitting only the bucket that overflowed which the central concept of Extendible is
hashing.
It is a dynamic hashing method wherein directories and buckets are used to hash
data. It is an aggressively flexible method in which the hash function also
experiences dynamic changes.
Basic structure of Extendible hashing:
240
in accordance with the global depth is used to decide the action that to be
performed in case an overflow occurs. Local Depth is always less than or
equal to the Global Depth.
Bucket Splitting: When the number of elements in a bucket exceeds a
particular size, then the bucket is split into two parts.
Directory Expansion: Directory Expansion Takes place when a bucket
overflows. Directory Expansion is performed when the local depth of the
overflowing bucket is equal to the global depth.
16 10000
4 00100
6 00110
22 10110
24 11000
10 01010
31 11111
7 00111
9 01001
20 10100
26 11010
Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks
like this:
241
Inserting 16: The binary format of 16 is 10000 and global-depth is 1. The hash
function returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the
directory with id=0.
Inserting 4 and 6: Both 4(100) and 6(110) have 0 in their LSB. Hence, they are
hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by
directory 0 is already full. Hence, Over Flow occurs.
Since Local Depth = Global Depth, the bucket splits and directory expansion takes
place. Also, rehashing of numbers present in the overflowing bucket takes place
after the split. And, since the global depth is incremented by 1, now, the global
depth is 2. Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.[
16(10000),4(100),6(110),22(10110) ]
242
Notice that the bucket which was underflow has remained untouched. But,
since the number of directories has doubled, we now have 2 directories 01 and
11 pointing to the same bucket. This is because the local-depth of the bucket
has remained 1. And, any bucket having a local depth less than the global
depth is pointed-to by more than one directory.
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories
with id 00 and 10. Here, we encounter no overflow condition.
Inserting 31, 7, 9: All of these elements [31(11111), 7(111), 9(1001)] have either 01
or 11 in their LSBs. Hence, they are mapped on the bucket
pointed out by 01 and 11. We do not encounter any overflow
condition here.
243
Inserting 20: Insertion of data element 20 (10100) will again cause the overflow
problem.
20 is inserted in bucket pointed out by 00. Since the local depth of the bucket =
global-depth, directory expansion (doubling) takes place along with bucket splitting.
Elements present in overflowing bucket are rehashed with the new global depth.
Now, the new Hash table looks like this:
244
Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.
Therefore 26 best fits in the bucket pointed out by directory 010.
The bucket overflows, and, since the local depth of bucket < Global depth (2<3),
directories are not doubled but, only the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.
245
LINEAR HASHING
The scheme utilizes a family of hash functions h0, h1, h2, ... , with the property
that each function's range is twice that of its predecessor. That is, if hi maps a data
entry into one of M buckets, hi+1 maps a data entry into one of M buckets. Such a
family is typically obtained by choosing a hash function h and an initial number N
of buckets,2 and defining hi(value) = h(value) mod (2i N).
The idea is best understood in terms of rounds of splitting. During round number
Level, only hash functions hLevel and hLevel+1 are in use. The buckets in the file
at the beginning of the round are split, one by one from the first to the last bucket,
thereby doubling the number of buckets. At any given point within a round,
therefore, we have buckets that have been split, buckets that are yet to be split, and
buckets created by splits in this round.
Consider how we search for a data entry with a given search key value. We apply
hash function h Level , and if this leads us to one of the unsplit buckets, we simply
look there. If it leads us to one of the split buckets, the entry may be there or it may
have been moved to the new bucket created earlier in this round by splitting this
bucket; to determine which of the two buckets contains the entry, we apply
hLevel+1.
245
Database Management Systems
246
Database Management Systems
247
Tree Based Indexing
1. ISAM (Indexed sequential access method)
- Static Index Structure
2. B+ Trees
- Dynamic Index Structure
B+ Tree: B+ tree is dynamic index structure i.e the height of the tree grows and
contracts as records are added and deleted.
A B+ tree is also known as balanced tree in which every path from the root of the
tree to a leaf is of the same length.
Leaf node of B+ trees are linked, so doing a linear search of all keys will require
just one pass through all the leaf nodes.
B+ tree combines features of ISAM and B tress. It contains index pages and data
pages. The data pages always appear as leaf nodes in the tree. The root node and
intermediate nodes are always index pages. These features are similar to ISAM
unlike ISAM, overflow pages are not used in B+ trees.
For order M
Maximum number of keys per node =M-1
Minimum Number of keys per node= ceil (M/2)-1
Maximum number of pointers / children per node=M
Minimum number of pointers / children per node=ceil (M/2)
248
4. Copy the smallest search key value from second node to the
parent node.(Right biased)
Case 2: Overflow in non-leaf node
1. Split the non leaf node into two nodes.
2. First node contains ceil (m/2)-1 values.
3. Move the smallest among remaining to the parent.
4. Second node contains the remaining keys.
Example:
Problem: Insert the following key values 6, 16, 26, 36, 46 on a B+ tree with order = 3.
Solution:
Step 1: The order is 3 so at maximum in a node so there can be only 2 search key values. As
insertion happens on a leaf node only in a B+ tree so insert search key value 6 and 16 in
increasing order in the node. Below is the illustration of the same:
Step 2: We cannot insert 26 in the same node as it causes an overflow in the leaf
node, We have to split the leaf node according to the rules. First part contains
ceil((3-1)/2) values i.e., only 6. The second node contains the remaining values i.e.,
16 and 26. Then also copy the smallest search key value from the second node to
the parent node i.e., 16 to the parent node. Below is the illustration of the same:
249
Step 3: Now the next value is 36 that is to be inserted after 26 but in that node, it
causes an overflow again in that leaf node. Again follow the above steps to split
the node. First part contains ceil ((3-1)/2) values i.e., only 16. The second node
contains the remaining values i.e., 26 and 36. Then also copy the smallest search
key value from the second node to the parent node i.e., 26 to the parent node.
Below is the illustration of the same: The illustration is shown in the diagram
below.
250
Deletion in B+ tree:
Before going through the steps below, one must know these facts about a B+ tree
of degree m.
1. A node can have a maximum of m children.
2. A node can contain a maximum of m - 1 keys.
3. A node should have a minimum of ⌈m/2⌉ children.
4. A node (except root node) should contain a minimum of ⌈m/2⌉ - 1 keys.
While deleting a key, we have to take care of the keys present in the internal
nodes (i.e. indexes) as well because the values are redundant in a B+ tree.
Search the key to be deleted then follow the following steps.
Case I
The key to be deleted is present only at the leaf node not in the indexes (or internal
nodes). There are two cases for it:
1. There is more than the minimum number of keys in the node. Simply
delete the key.
M=3,
Max. Children=3
Min. Children=ceil (3/2)=2
Max.Keys=m-1=2
Min.keys=ceil (3/2)-1=1
251
2. There is an exact minimum number of keys in the node. Delete the key and
borrow a key from the immediate sibling. Add the median key of the
sibling node to the parent.
252
Case II
The key to be deleted is present in the internal nodes as well. Then we have to
remove them from the internal nodes as well. There are the following cases for this
situation.
1. If there is more than the minimum number of keys in the node, simply
delete the key from the leaf node and delete the key from the internal node
as well. Fill the empty space in the internal node with the inorder successor.
Deleting 45
253
2. If there are an exact minimum number of keys in the node, then delete
the key and borrow a key from its immediate sibling (through the parent).
Fill the empty space created in the index (internal node) with the borrowed
key.
Deleting 35
254
3. This case is similar to Case II(1) but here, empty space is generated
above the immediate parent node.
After deleting the key, merge the empty space with its sibling.
Fill the empty space in the grandparent node with the inorder successor.
255
Case III
In this case, the height of the tree gets shrinked. It is a little complicated.Deleting
55 from the tree below leads to this condition. It can be understood in the
illustrations below.
256
ISAM Trees
Indexed Sequential Access Method (ISAM)
trees are static
Root
40
Non-Leaf
Pages
20 33 51 63
Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 97*
63*
Pages
Non-leaf
Pages
Leaf
Pages
Overflow
258
pag
e
Pri
mar
y
pag
es
259
ISAM File Creation
How to create an ISAM file?
All leaf pages are allocated sequentially and
sorted on the search key value
260
The non-leaf pages are subsequently allocated
261
ISAM: Searching for Entries
Search begins at root, and key comparisons direct
it to a leaf
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
262
ISAM: Inserting Entries
The appropriate page is determined as for a search, and
the entry is inserted (with overflow pages added if
necessary)
Insert 23*
Root
40
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
263
23*
ISAM: Inserting Entries
The appropriate page is determined as for a search, and
the entry is inserted (with overflow pages added if
necessary)
Insert 48*
Root
40
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
264
23* 48*
265
ISAM: Inserting Entries
The appropriate page is determined as for a search, and
the entry is inserted (with overflow pages added if
necessary)
Insert 41*
Root
40
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
23* 48*41*
266
ISAM: Inserting Entries
The appropriate page is determined as for a search, and
the entry is inserted (with overflow pages added if
necessary)
Insert 42*
Root
40
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
23* 48*41*
ISAM: Deleting Entries
The appropriate page is determined as for a search, and
the entry is deleted (with ONLY overflow pages removed
when becoming empty)
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
265
42*
ISAM: Deleting Entries
The appropriate page is determined as for a search, and
the entry is deleted (with ONLY overflow pages removed
when becoming empty)
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
266
ISAM: Deleting Entries
The appropriate page is determined as for a search, and
the entry is deleted (with ONLY overflow pages removed
when becoming empty)
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
267
ISAM: Deleting Entries
The appropriate page is determined as for a search, and
the entry is deleted (with ONLY overflow pages removed
when becoming empty)
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 55* 63* 97*
270
File Organization
The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of
file organization which was used for a given set of records.
File organization is a logical relationship among various records. This
method defines how file records are mapped onto disk blocks.
271
Insertion of the new record:
Suppose we have four records R1, R3 and so on up to R9 and R8 in a sequence.
Hence, records are nothing but a row in the table. Suppose we want to insert a new
record R2 in the sequence, then it will be placed at the end of the file. Here,
records are nothing but a row in any table.
272
Insertion of the new record:
It is the simplest and most basic type of organization. It works with data
blocks. In heap file organization, the records are inserted at the file's end.
When the records are inserted, it doesn't require the sorting and ordering of
records.
When the data block is full, the new record is stored in some other block.
This new data block need not to be the very next data block, but it can select
any data block in the memory to store new records. The heap file is also
known as an unordered file.
In the file, every record has a unique id, and every page in a file is of the
same size. It is the DBMS responsibility to store and manage the new
records.
273
Insertion of a new record
Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we
want to insert a new record R2 in a heap. If the data block 3 is full then it will be
inserted in any of the database selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we
need to traverse the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be
time-consuming because there is no sorting or ordering of records. In the heap file
organization, we need to check all the data until we get the requested record.
274
When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way,
when a new record has to be inserted, then the address is generated using the hash
key and record is directly inserted. The same process is applied in the case of
delete and update.
In this method, there is no effort for searching and sorting the entire file. In this
method, each record will be stored randomly in the memory.
275
B+ Tree File Organization
B+ Tree is a data structure, which uses a tree-like structure for storing and
accessing the records or data from the file.
It is an enhanced method of an ISAM (Indexed Sequential Access Method). This
file organization stores that data which is not fit in the system’s main memory.
This file organization uses the concept of key-index. This concept uses the primary
key for the sorting of records. An index value of the database record is the record
address of the file.
B+ tree is the same as a binary search tree. But, this type of tree can have more
than two children. In this type of file organization, all the records or information
are stored at the leaf nodes. And the intermediate nodes act as the pointer to the
nodes which store the records, i.e., leaf nodes. Intermediate nodes in the tree do not
contain any information. Following diagram shows how the values are stored in
the B+ Tree File organization:
In this B+ tree, 30 is the only root node of the tree, which is also known as the
main node of the B+ tree. There exists an intermediary layer with the nodes, which
stores the address of the leaf nodes, not the actual records. Only the leaf nodes
contain the records in the order which is sorted.
In the above B+ tree, only one leaf node exists whose values are: 10, 15, 22, 26,
28, 33, 34, 38, 40. As all the leaf nodes of the tree are sorted, so the records can be
easily searched.
276
Clustered File Organization
Cluster is defined as “when two or more related tables or records are stored within
the same file”. The related column of two or more database tables in the cluster is
called the cluster key. And this cluster key is used to map the two tables together.
This method minimizes the cost of accessing and searching the various records
because they are combined and available in a single cluster.
Example:
Suppose we have two tables whose names are Student and Subject. Both of the
following given tables are related to each other.
Student
Subject
Subject_ID Subject_Name
C01 Math
C02 Java
C03 C
C04 DBMS
Therefore, both these tables ‘student’ and ‘subject’ are allowed to combine using a
join operation and can be seen as following in the cluster file.
Student + Subject
277
Cluster Key
Subject_ID Subject_Name Student_ID Student_Name Student_Age
C01 Math 101 Raju 20
103 Ravi 21
C02 Java 104 Rajesh 22
C03 C 105 Ranjith 21
107 Rahul 20
C04 DBMS 102 Ramesh 20
106 Ravinder 20
108 Rudra 21
If we have to perform the insert, update and delete operations on the record, then
we can perform them directly because the data is sorted on that key with which
searching and accessing is done. In the given table (Student + Subject), the cluster
key is a Subject_ID.
278
data, Report multiple Any of the columns efficient. tables.
generation, transactions can be used as key No Suitable for
statistical Suitable for column. Searching performance 1:M
calculations Online range of data & degrades mappings
etc transactions partial data are when there is
efficient. insert / delete
/ update.
Grows and
shrinks with
data.
Works well
in secondary
storage
devices and
hence
reducing disk
I/O.
Since all
datas are at
the leaf node,
searching is
easy.
All data at
leaf node are
sorted
sequential
linked list.
279
Database Management Systems
280
for large multiple hash are
tables. keys present inefficient.
or frequently
updated
column as
hash key are
inefficient.
281
Indexes and Performance Tuning
An index is a copy of selected columns of data from a table that can be searched
very efficiently. Although indexing does add some overheads in the form of
additional writes and storage space to maintain the index data structure, the key
focus of implementing index – in various available ways – is to improve the
lookup mechanism.
It must improve the performance of data matching by reducing the time taken to
match the query value.
Now, let’s understand how index is actually programmed and stored and what
causes the speed when an index is created.
Index entries are also "rows", containing the indexed column(s) and some sort of
pointing / marking data into the base table data. When an index is used to fetch a
row, the index is walked until it finds the row(s) of interest, and the base table is
then looked up to fetch the actual row data.
When a data is inserted, a corresponding row is written to the index, and when a
row is deleted, its index row is taken out. This keeps the data and searching index
always in sync making the lookup very fast and read-time efficient.
282
Indexing is implemented using specific architecture and approaches. Some
architectures worth mentioning are:
Clustered Index:
A clustered indexed is similar to telephone directory, where data is arranged by
first name...
A table can have only one clustered index, However, one clustered index can have
multiple columns. Similar to telephone directory is arranged by first name and last
name
Non-clustered index:
A nonclustered indexed is similar to the index in textbook, where data is stored at
one place and index is stored in another place.
2. The index has pointers to the storage location.
3. Since, Nonclustered index are stored separately, a table can have more than
one Nonclustered index
4. In the index itself, data is stored in ascending or descending order of the index
283