DBMS File
DBMS File
1.1 Overview
1.2 Introduction
1.3 DBMS vs. File Systems.
1.4 DBMS Architecture
1.5 Database Users & Database schema
1.6 Database Language
1.7 Data Independence
1.8 Assignment
1.9 Test/Quiz
SECTION 2: KEYS
SECTION 5: NORMALIZATION:
5.1 What is Normalization
5.2 Functional Dependency
5.3 Inference Rule
5.4 Normal forms
5.5 Relational Decomposition
5.6 Multi valued Dependency
5.7 Join Dependency
5.8 Assignment
5.9 Test/Quiz
SECTION 8: RAID:
8.1 levels
8.2 Assignment
8.3 Test/Quiz
A database-management system (DBMS) is a collection of interrelated data and a set of programs to access
those data.
The ultimate purpose of a Database Management system is to store and transform data into
information to support making decisions.
DBMS provides the following functions:
Concurrency: At the same time multiple users can access data in the same database.
Security: security rules to determine access rights of users
Backup and recovery: processes to back-up the data regularly and recover data if a problem
occurs
Integrity: database structure and rules improve the integrity of the data
Data descriptions: a data dictionary provides a description of the data
MySQL
Microsoft Access
Oracle
PostgreSQL
dBASE
FoxPro
SQLite
IBM DB2
LibreOffice Base
MariaDB
Microsoft SQL Server etc.
History of DBMS
1960 – Charles Bachman designed the first DBMS system
1970 – Codd introduced IBM’S Information Management System (IMS)
1976- Peter Chen coined and defined the Entity-relationship model also know as the ER
model
1980 – Relational Model becomes a widely accepted database component
1985- Object-oriented DBMS develops.
1990- Incorporation of object-orientation in relational DBMS.
1991- Microsoft ships MS access, a personal DBMS and that displaces all other personal
DBMS products.
1995: First Internet database applications
1997: XML applied to database processing. Many vendors begin to integrate XML into
DBMS products.
Types of DBMS
There are 4 major types of DBMS.
Hierarchical –
In hierarchical database management systems (hierarchical DBMSs) model, data is stored in a
parent-children relationship node. In a hierarchical database, besides actual data, records also
contain information about their groups of parent/child relationships. ta gets stored in the form of a
collection of fields in which each field contains only one value, i.e., every individual record has only
one parent and a parent can have one or more than one child. To retrieve a field’s data, we need to
traverse through each tree until the record is found.
The hierarchical database system structure was developed by IBM in the early 1960s.
Example: The IBM Information Management System (IMS) and Windows Registry are two popular
examples of hierarchical databases.
Network DBMS –
The network database structure was invented by Charles Bachman. Network database management
systems (Network DBMSs) uses network structure to create a relationship between entities. It
supports many-to-many relations.
The network database is more efficient. Therefore, is similar to the hierarchical database.
Example: Integrated Data Store (IDS), IDMS (Integrated Database Management System), Raima
Database Manager
Relational DBMS –
In relational databases, the relationship between data files is relational. Data is stored in the tabular
form of columns and rows. Each column if a table represents an attribute and each row in a table
represents a record. Each field in a table represents a data value. The relational database depicts the
relation between two or more tables. Structured Query Language (SQL) is a language used to query
an RDBMS including inserting, updating, deleting, and searching records. RDBMS is the most
popular databases in the world.
Example: Oracle, SQL Server, MySQL, SQLite, and IBM DB2.
Object-Oriented Relation DBMS-
Object-oriented Databases were created in the early 1980s. Object-Oriented Databases deals with
the functionality of object-oriented programming and it increases the semantics of C++ and Java. It
adds the database functionality to object programming languages. Object developers can write
complete database applications with a decent amount of additional effort.
Example: Some Object-Oriented Databases were designed to work with OOP languages such as
Delphi, Ruby, C++, Java, and Python.
Application of DBMS –
Banking:- For customer information, payments, deposits, loans etc….
Airlines:- For reservations and schedule information.
Universities:- For students and faculty information, course registrations, colleges and grades.
Telecommunication:- It helps to keep call records, monthly bills,Maintaining balances, etc..
Finance:- For storing information about stock, sales, and purchases of financial instruments
like stocks and bonds.
Sales:- Use for storing customer, product and sales information.
Manufacturing:- It is used for the management of supply chain and for tracking production of
items. Inventors status in warehouses.
Characteristics of DBMS –
Advantages of DBMS –
Disadvantages of DBMS –
The centralization of resources increases the vulnerability of the system.
The cost of Hardware and Software of a DBMS is quite high.
The use of the same program at a time by many users sometimes lead to the loss of data.
Any accidental failure of a component may cause loss of valuable data.
As DBMS becomes big software due to its functionalities so it requires lots of space and
memory to run its application efficiently.
DBMS requires updates itself daily. DBMS should be updated according to the current
scenario.
DBMS gives poor performance for small scale firms as its speed is slow.
DBA (Database Administrator)
1. Software that manages the data files in a Software to create & manages tha database.
computer system.
2. Helps to store a collection of raw data files into Help to easily stored, retrieve and manipulate
the hard disk. data in a database.
8. Backup and recovery is not possible. It has a sophisticated backup & recovery
9. File system handles data in a small scale. DBMS handles data on large scale.
10. Task such as Storing, Retrieving & Searching Operations such as Updating, Searching data is
are done manually in the file system. Therefore, easier in DBMS because it allows SQL queries.
it is difficult to manage data in the file system.
DATABASE
USER
2. Two-tier architecture –
In two-tier architecture, the Database system is present at the server machine and the DBMS
application is present at the client machine, these two machines are connected with each other
through a reliable network as shown in the above diagram.
Database System
SSSystemSystem Server
Application
Client
USER
3. Three-tier architecture –
3-tier schema is an extension of the 2-tier architecture. 3-tier architecture has the following layers-
This DBMS architecture contains an Application layer between the user and the DBMS, which is
responsible for communicating the user’s request to the DBMS system and send the response from
the DBMS to the user.
The application layer (business logic layer) also processes functional logic, constraint, and rules
before passing data to the user or down to the DBMS
Three-tier architecture is the most popular DBMS architecture.
Database
Server
Application Server
Application Client
Client
USER
1.5 Database users & Schemas
Database Users:-
Application Programmers – They are the developers who interact with the database by means of
DML queries. These DML queries are written in application programs like C, C++, JAVA, Pascal, etc.
These programs meet the user requirement and made according to user requirements. Retrieving
information, creating new information and changing existing information is done by these application
programs.
End Users – End users are those who access the database from the terminal end. They use the
developed applications and they don’t have any knowledge about the design and working of the
database. These are the second class of users and their main motto is just to get their task done.
There are basically two types of end-users that are discussed below.
Native Users – Any user who does not have any knowledge about the database can be in this
category. Their task is to just use the developed application and get the desired results. For example,
Clerical staff in any bank is a naïve user. They don’t have any DBMS knowledge but they still use the
database and perform their given task.
Stand-alone Users – These are those users whose job is basically to maintain personal databases
by using a ready-made program package that provides easy to use menu-based or graphics-based
interfaces, An example is the user of a tax package that basically stores a variety of personal
financial data of tax purposes. These users become very proficient in using a specific software
package.
specialized Users – These are sophisticated users writing special database application programs.
These may be CADD systems, knowledge-based and expert systems, complex data systems
(audio/video), etc.
Sophisticated Users – These users basically include engineers, scientists, business analytics and
others who thoroughly familiarize themselves with the facilities of the DBMS in order to implement
their application to meet their complex requirements.
Database Schema:-
The data which is stored in the database at a particular moment of time is called an instance of the
database. Database systems comprise of complex data structures. Thus, to make the system
efficient for retrieval of data and reduce the complexity of the users, developers use the method of
Data Abstraction. A database schema defines its entities and the relationship among them.
There are mainly three levels of data abstraction:
1. Internal Level: The internal schema defines the physical storage structure of the database. It
defines how the data will be stored in secondary storage. It is also called the Physical
Database Schema. It never deals with physical devices. Instead, internal schema views a
physical device as a collection of physical pages.
2. Conceptual or Logical Level: The conceptual schema describes the Database structure of
the whole database for the community of users. This logical level comes between the user
level and the physical storage view. However, there is only a single conceptual view of a single
database.
3. External or View level: An external schema describes the part of the database which a
specific user is interested in. An external view is just the content of the database as it is seen
by some specific particular users. For example, a user from the sales department will see only
sales-related data.
Database Instance
A database schema is the skeleton of database. It is designed when the database doesn’t exist at all. Once
the database is operational, it is very difficult to make any changes to it. A database schema does not contain
any data or information.
A database instance is a state of operational database with data at any given time. A DBMS ensures that it is
every instance (state) is in a valid state, by diligently following all the validations, constraints, and conditions
that the database designers have imposed.
CREATE – It is used to create the database or its objects (like table, function, views)
ALTER – It is used to alter the structure of the database.
DROP – It is used to delete objects from the database.
RENAME – It is used to rename an object existing in the database.
TRUNCATE – It is used to remove all records from a table, including all spaces allocated for
the records are removed.
COMMENT – It is used to add comments to the data dictionary.
DQL (Data Query Language) – DML statements are used for performing queries on the data within database objects.
The purpose of DQL Command is to get some database relation based on the query passed to it.
DML (Data Manipulation Language) – DML is an abbreviation of Data Manipulation Language. It is used to retrieve,
modify, add, and delete data in the database.
TCL (Transaction Control Language) – TCL commands deal with the transaction within the
database.
Task Perform by TCL:-
COMMIT– Commits a Transaction.
ROLLBACK– Rollbacks a transaction in case any error occurs.
SAVEPOINT– Sets a savepoint within a transaction.
SET TRANSACTION– Specify characteristics for the transaction.
DCL (Data Control Language) – DCL includes commands such as GRANT and REVOKE which
mainly deals with the rights, permissions and other controls of the database system.
Task Perform by DCL:-
This means that upper levels are unaffected by changes to lower levels, Helps you to improve the
quality of the data.
The conventional data processing does not provide data independence in application programs. So,
any kind of changes in the information, layouts, or arrangements need the change in application
programs also.
Due to Physical independence, any of the below change will not affect the conceptual layer.
Logical/Conceptual Level:
Logical Data Independence is the ability to change the conceptual scheme without changing
1. External views
2. External API or programs
Logical data independence refers characteristic of being able to change the conceptual
schema without having to change the external schema.
Logical data independence is used to separate the external level from the conceptual view.
If we do any changes in the conceptual view of the data, then the user view of the data would
not be affected.
Logical data independence occurs at the user interface level.
Due to Logical independence, any of the below change will not affect the external layer.
Section 2: Keys:
2.1 Super Key
The set of attributes that can uniquely identify a tuple is known as Super Key.
Adding zero or more attributes to the candidate key generates the super key.
A candidate key is a super key but vice versa is not true.
For example:
{st_Id}
{st_Number}
{st_Id, st_Number}
{st_Id, st_Name}
{st_Number, st_Name}
2.2 Primary key:
A primary key also called a primary keyword, is a key in a relational database that is unique for each
record.
A relational database must always have one and only one primary key.
Primary keys typically appear as columns in relational database tables.
They allow you to find the relation between two tables.
For Example: In the above-given example, student ID is a primary key because it uniquely identifies a Student
record. In this table, no other Student can have the same Student ID, And Multiple students can enroll in the
same college but a single student cannot enroll in multiple colleges at a time.
It acts as a cross-reference between tables because it references the primary key of another table
A correct definition of a foreign key would be: Foreign keys are the columns of a table that points to the
candidate key of another table.
For example:
Country_I Country_Name
D
01 India
02 China
03 Nepal
1 01 Madhya Pradesh
2 01 Uttar Pradesh
3 02 Fujian
4 02 Beijing
5 03 Mahakali
In the above-given example, there were 2 tables one for the country and the other tables contain a number of
States.
From the country table, here country_id is a primary key whereas country_id is used as a foreign key in a state
table. With the help of foreign keys, we can easily access records of the country table.
The Primary Key consisting of two or more attribute is called Composite Key.
It is a combination of two or more columns.
For Example:
Above, our composite keys are StudentID and StudentEnrollNo. The table has two attributes as the primary
key.
It is little like a primary key but it can accept only one null value and it cannot have duplicate values.
The unique key and primary key both provide a guarantee for uniqueness for a column or a set of
columns.
There is an automatically defined unique key constraint within a primary key constraint.
There may be many unique key constraints for one table, but only one PRIMARY KEY constraint for
one table.
The UNIQUE constraint ensures that all values in a column are different.
1. The primary key will not accept NULL values whereas the Unique key can accept one NULL value.
2. A table can have only primary keys whereas there can be multiple unique keys on a table.
3. A Clustered index automatically created when a primary key is defined whereas the Unique key generates
the non-clustered index.
2.6 Alternate Key
An alternate key is a key associated with one or more columns whose values uniquely identify every row in
the table, but which is not the primary key.
As we have seen in the candidate key guide that a table can have multiple candidate keys. Among these
candidate keys, only one key gets selected as the primary key, the remaining keys are known as alternative
or secondary keys.
For Example:
where the primary key for a table may be the student id, the alternate key might combine the first, middle, and
last names of the student. Each alternate key can generate a unique index or a unique constraint in a target
database.
From the above example, studentID and StudentEnrollNo both can be a primary key because they both can
give a unique record separately.
We can also consider studentID as a primary key then other columns become Alternative key
The value of Candidate Key is unique and non-null for every tuple.
There can be more than one candidate key in a relation.
A super key with no redundant attribute is known as a candidate key.
Candidate keys are selected from the set of super keys.
The candidate key can be simple (having only one attribute) or composite as well. For Example, {country_id,
State_id} is a composite candidate key for relation Country_State
Country_I Country_Name
D
01 India
02 China
03 Nepal
1 01 Madhya Pradesh
2 01 Uttar Pradesh
3 02 Fujian
4 02 Beijing
5 03 Mahakali
An Entity-relationship model (ER model) describes the structure of a database with the help of a diagram,
which is known as an Entity Relationship Diagram (ER Diagram). An ER model is a design or blueprint of a
database that can later be implemented as a database. It also develops a very simple and easy to design view
of data. It defines the conceptual view of a database.
An ER diagram shows the relationship among entity sets. An entity set is a group of similar entities and these
entities can have attributes. In terms of DBMS, an entity is a table or attribute of a table in the database, so by
showing a relationship among tables and their attributes, the ER diagram shows the complete logical structure
of a database.
Ex:-
Component of ER Diagram
3.2 Symbols Used in ER
Entity – An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. An
entity set is a collection of similar types of entities. An entity set may contain entities with attributes sharing
similar values. An entity may be any object, class, person or place. In the ER diagram, an entity can be
represented as rectangles.
Weak Entity – An entity that depends on another entity called a weak entity. The weak entity doesn’t contain
any key attribute of its own. The weak entity is represented by a double rectangle. A weak entity depends on a
strong entity to ensure the existence of a weak entity. Like a strong entity, a weak entity does not have any
primary key, it has partial discriminator key. The weak entity is represented by a double rectangle.
The relation between one strong and one weak entity is represented by a double diamond.
A strong entity is not dependent on any other entity in a schema. A strong entity always has a primary key.
The strong entity is represented by a single rectangle. Two strong entity’s relationship is represented by a
single diamond. Various strong entities together make the strong entity set.
2. Attribute – Attributes are the properties of entities. Attributes are represented by means of ellipses. Every
ellipse represents one attribute and is directly connected to its entity (rectangle).
An attribute can be of many types, here are different types of attributes defined in the ER database
model:
key Attribute – The key attribute is used to represent the main characteristics of an entity. It represents a
primary key. The key attribute is represented by an ellipse with the text underlined.
Composite Attribute – Composite attributes are made of more than one simple attribute. For example, a
student’s complete name may have first_name and last_name. The composite attribute is represented by an
ellipse, and those ellipses are connected with an ellipse.
Multivalued Attribute – Multi-value attributes may contain more than one value. For example, a person can
have more than one phone number, email_address, etc. The double oval is used to represent the multivalued
attribute.
Derived Attribute – Derived attributes are the attributes that do not exist in the physical database, but their
values are derived from other attributes present in the database. For another example, age can be derived
from data_of_birth. It can be represented by a dashed ellipse.
3.3 Cardinality
Mapping Constraints
Cardinality defines the number of attributes in one entity set, which can be associated with the number of
attributes of other sets via a relationship set. In simple words, it refers to the relationship one table can have
with the other table.
Notations used for Cardinality:
Relationship –
Types of Relationships:
One to One – When only one instance of an entity is associated with the relationship, it is marked as ‘1:1’.A
one-to-one relationship can be used for security purposes, to divide a large table, and various other specific
purposes.
One to Many – When a single instance of an entity is associated with more than one instance of another entity
then it is called one to many relationships. This is the most common relationship type. One-to-Many
relationships can also be viewed as Many-to-One relationships, depending on which way you look at it.
Many to One – More than one entity from entity set A can be associated with at most one entity of entity set B,
however an entity from entity set B can be associated with more than one entity from entity set A.
Many to Many- Entity from A can be associated with more than one entity from B and vice versa. A many-to-
many relationship could be thought of as two one-to-many relationships, linked by an intermediary table.
In generalization two entities combine together to form a new higher level entity. The higher level entity can
also combine with other lower level entities to make further higher level entity.
It is a Bottom up Approach.
It is a reverse process of Specialization.
It’s more like Super-class and Sub-class system.
Sub-classes are combined to form a super-class.
Examples:
DBMS Specialization:
Specialization is a designing procedure that proceeds in a top-down manner.
Examples:
Generalization helps in reducing the size of schema whereas, specialization is just opposite it increases
the number of entities thereby increasing the size of a schema.
Generalization is always applied to the group of entities whereas, specialization is always applied on a
single entity.
Generalization results in a formation of a single entity whereas, Specialization results in the formation of
multiple new entities.
Generalization is a bottom-up approach, whereas Specialization is a Top-down approach.
DBMS Aggregation:
ER Model is a way which helps in database design with outmost efficiency. One of the major limitations of ER
Model is its inability to represent relationship among relationship.
In order to represent ternary relationship, it can be represented using ER Model but a lot of redundancies will
arise. Hence, the concept of aggregation is used to remove these redundancies.
In aggregation, the relation between two entities is treated as a single entity. In aggregation, relationship with
its corresponding entities is aggregated into a higher level entity.
Example:
3.5 Convert ER model into table:
The database can be represented using the notations, and these notations can be reduced to a collection of
tables.
In the database, every entity set or relationship set can be represented in tabular form.
There are some points for converting the ER diagram to the table:
o Entity type becomes a table.
In the given ER diagram, LECTURE, STUDENT, SUBJECT and COURSE forms individual tables.
In the STUDENT entity, STUDENT_NAME and STUDENT_ID form the column of STUDENT table. Similarly,
COURSE_NAME and COURSE_ID form the column of COURSE table and so on.
In the given ER diagram, COURSE_ID, STUDENT_ID, SUBJECT_ID, and LECTURE_ID are the key attribute
of the entity.
In the student table, a hobby is a multivalued attribute. So it is not possible to represent multiple values in a
single column of STUDENT table. Hence we create a table STUD_HOBBY with column name STUDENT_ID
and HOBBY. Using both the column, we create a composite key.
In the given ER diagram, student address is a composite attribute. It contains CITY, PIN, DOOR#, STREET,
and STATE. In the STUDENT table, these attributes can merge as an individual column.
In the STUDENT table, Age is the derived attribute. It can be calculated at any point of time by calculating the
difference between current date and Date of Birth.
Using these rules, you can convert the ER diagram to tables and columns and assign the mapping between
the tables. Table structure for the given ER diagram is as below:
TABLE STRUCTURE:
SECTION 4:RELATIONAL MODEL:
1 Joy India
NAME
Amit
Sumit
Agra
Roy
Relational instance: In the relational database system, the relational instance is represented by a finite set of
tuples. Relation instances do not have duplicate tuples.
Relation Schema: Arelation schema describes the structure of the relation, with the name
of the relation(name of the table), its attributes and their names and type.
Relation Key: A
relation key is an attribute that can uniquely identify a particular
tuple(row) in a relation(table).
Example for Student Relation:
S_Name S_Roll S_Mobile S_Address
Amit 1011 0987654327 Delhi
Sumit 1012 7896547484 Gwalior
Ram 1013 9456788343 Gurgaon
Shyam 1014 8764678934 Noida
In the given table, S_Name, S_Roll, S_Mobile, and S_Address are the attributes.
The instance of schema STUDENT has 4 tuples.
t3 = <Ram, 1013, 9456788343, Gurgaon>
Properties of Relations:
The name of the relation is distinct from all other relations.
Each relation cell contains exactly one atomic (single) value
Each attribute contains a distinct name
Attribute domain has no significance
Tuple has no duplicate value
Order of tuple can have a different sequence
4.2 CONSTRAINTS:
Constraints invoke limits to the data or type of data that can be inserted/updated/deleted from a table.
• This ensures the accuracy and reliability of the data in the database.
• It could be either on a column level or a table level.
• The column level constraints are applied only to one column, whereas the table level constraints are applied
to the whole table.
• Relational constraints are the restrictions imposed on the database contents and operations.
• Integrity constraints are a set of rules. It is used to maintain the quality of information.
• Thus, integrity constraint is used to guard against accidental damage to the database.
1. Domain Constraints:
Here, value ‘A’ is not allowed since only integer values can be taken by the age attribute.
This relation satisfies the tuple uniqueness constraint since here all the tuples are unique.
Example 2:
Country_id Country_name Country_population
This relation does not satisfy the tuple uniqueness constraint since here all the tuples are not unique.
Keys are the entity set that is used to identify an entity within its entity set uniquely.
An entity set can have multiple keys, but out of which one key will be the primary key.
All the values of the primary key must be unique.
The value of the primary key must not be null.
Example:
This relation does not satisfy the key constraint as here all the values of the primary key are not unique.
The entity integrity constraint states that the primary key value can’t be null.
This is because the primary key value is used to identify individual rows in relation and if the primary key has a
null value, then we can’t identify those rows.
A table can contain a null value other than the primary key field.
Example 1:
This relation does not satisfy the entity integrity constraint as here the primary key contains a NULL value.
Ex-2:
C02 4567
C04 Bhutan
This relation satisfies the entity integrity constraint as here the primary key does not contain a NULL value.
Whereas, A table can contain a null value other than the primary key field.
Example:
Consider the following two relations- ‘Country’ and ‘State’.
Here, relation ‘State’ references the relation ‘Country’.
Country_i Country_name
d
C01 India
C02 Nepal
These rules can be applied on any database system that manages stored data using only its relational
capabilities.
1. Rule 0:This is the foundational Rule. This rule states that any database system should have characteristics as relational, as a
database and as a management system to be RDBMS.
2. Rule 1:This is aInformation Rule. All information in an RDBMS is represented logically in just one way - by values in tables. In other words,All
information(including metadata) is to be represented as stored data in cells of tables. The rows and columns have to be strictly
unordered.
3. Rule 2: This is aGuaranteed Access Rule.Every single data element (value) is guaranteed to be accessible logically with a
combination of table-name, primary-key (row value), and attribute-name (column value). No other means, such as pointers, can be
used to access data.
4. Rule 3: This is aSystematic treatment of Null Rule. Null has several meanings, it can mean missing data, not applicable or no
value. It should be handled consistently. Also, Primary key must not be null, ever. Expression on Null must give null
5. Rule 4: This is aActive Online Catalog Rule.Structure of database must be stored in an online catalog which can be queried by
authorized users.
6. Rule 5: This is aPowerful and Well Structured Language Rule. The system must support at least one relational
language that
Has a linear syntax
Can be used both interactively and within application programs
Supports data definition operations (including view definitions), data manipulation operations
(update as well as retrieval), security and integrity constraints, and transaction management
operations (begin, commit, and rollback).
7. Rule 6: This is aView Updation Rule.Different views created for various purposes should be automatically
updatable by the system.
8. Rule 7:This is aRelational Level Operation Rule.A database must support high-level insertion, updation,
and deletion. This must not be limited to a single row, that is, it must also support union, intersection and
minus operations to yield sets of data records.
9. Rule 8:This is aPhysical Data Independence Rule.Any modification in the physical location of a table
should not enforce modification at application level.
10. Rule 9:This is aLogical Data Independence Rule.Changes to the logical level (tables, columns, rows, and
so on) must not require a change to an application based on the structure. Logical data independence is more
difficult to achieve than physical data independence.
11. Rule 10:This is aIntegrity Independence Rule.The database should be able to enforce its own integrity
rather than using other programs. Key and Check constraints, trigger etc, should be stored in Data Dictionary.
This also make RDBMS independent of front-end.
12. Rule 11:This is aInformation Distribution Independence Rule. Distribution of data over various locations
should not be visible to end-users.
13. Rule 12:This is Nonsubversion Rule.If low level access is allowed to a system, it should not be able to
subvert or bypass integrity rules to change the data.
Characteristics-
Following are the important characteristics of relational operators-
1. Selection Operator(σ):It selects tuples that satisfy the given predicate from a relation. In other words,
The
SELECT operation is used for selecting a subset of the tuples according to a given selection
condition.
Notation-σp(r)
Example:
σCountry_name=”India”(COUNTRY)
Output:
2. Projection Operator(π):
Projection Operator (π) is a unary operator in relational algebra that performs a projection operation.
It displays the columns of a relation or table based on the specified attributes.
Project operator in relational algebra is similar to the Select statement in SQL.
Syntax:
Π column_name1, column_name2,…..,column_nameN(table_name)
Example:
Table: Country
Input:
Π Country_name, Country_population (Country)
Output:
Country_name Country_population
India 4000
Nepal 4567
China 4324
Bhutan 5675
Pakistan 5623
3. Rename Operator(ρ): Rename (ρ) operation can be used to rename a relation or an attribute of
a relation. It is denoted by rho (ρ). The results of relational algebra are also relations but without any name.
Syntax:
ρ×(E)[Where the result of expression E is saved with name of x.]
or
ρ(Relation_New, Relation_Old)
Example:
Table: Country
Input:
ρ(Country_pop, π(Country_population)(Country))
Output:
Country_population
4000
4567
4324
5675
5623
4. Union Operator(U):This operation is used to fetch data from two relations(tables) or temporary
relation(result of another operation).
For this operation to work, the relations(tables) specified should have same number of attributes(columns) and same
attribute domain. Also the duplicate tuples are autamatically eliminated from the result.
Example:
Table 1: Course
Table 2: Student
Student_id Student_name Student_age
C01 Amit 22
C05 Neha 33
C03 Rahul 32
C04 Ravi 25
Input:
Π Student_name (Course) U π Student_name(Student)
Output:
Student_name
Amit
Rahul
Ravi
Note: As you can see there are no duplicate names present in the output even though we had few
common names in both the tables, also in the COURSE table we had the duplicate name itself.
5. Cartesian Product(×):This is used to combine data from two different relations(tables) into one and fetch data from
the combined relation.
It is denoted by X.
Syntax: A × B
Example:
Table 1: Student
S01 Sumit A
S02 Honey C
S03 Harry B
Table 2: Department:
Dep_id Dep_name
A CS
B Mechanical
C Sales
Input:
Student × Department
Output:
S_id S_name S_dep Dep_id Dep_name
S01 Sumit A A CS
S02 Honey C A CS
S03 Harry B A CS
6. Minus/Set-Difference (–):The result of set difference operation is tuples, which are present in one relation but
are not in the second relation.
-Symbol denotes it.
Syntax: A–B
Input:
Finds all the tuples that are present in A but not in B.
π author(Books) – π author(Articles)
Output − Provides the name of authors who have written books but not articles.
7. Intersection ( Ⴖ ):An intersection is defined by the symbol ∩ . Defines a relation consisting of a set of all tuple that are in both A
and B. However, A and B must be union-compatible.
Syntax: A Ⴖ B
Table 1: Depositor
Cust_name Acc_number
Lara A-01
Harry A-02
Potter A-04
Smith A-10
Table 2: Borrower
Cust_name Loan_number
John L-05
Harry L-012
Potter L-07
Shiv L-11
Input:
πCust_name(Borrower) Ⴖ πCust_name(Depositor)
Output:
Cust_name
Harry
Potter
Query specification involves giving a step by step process of obtaining the query result.
Ex: Relational Algebra
Difficult for the use of non-Experts.
Query specification involves giving the logical conditions the result are required to satisfy.
Easy for the use of non-Experts.
Example:
Table -1: Student
Resulting relation:
Domain Relational Calculus:In domain relational calculus the records are filtered based on the domain of the
attributes and not based on the tuple values.
Domain relational calculus uses the same operators as tuple calculus. It uses logical connectives ∧ (and), ∨ (or) and ┓
(not).
It uses Existential (∃) and Universal Quantifiers (∀) to bind the variable.
Where x1,x2,x3…..xn represent domain variables. P represent the formula composed of atoms, as same in the case of
Tuple relational calculus
Atomic Formula:
<x1,x2,x3…..xn> є r, where r is a relation on n attributes and x1,x2,x3…..xn are domain vlues or domain constraints.
X © c, where x is a domain variable,© is a comparison operator, and c is a constant in the domain of the attribute
for which x is a domain variable.
Example:
Ouput:
Student_name Student_age
Neha 33
Rahul 32
Join Operations
Join is a binary operation which allows you to combine join product and selection in one single statement. The goal of
creating a join condition is that it helps you to combine the data from multiple join tables.
1. Inner joins:
Theta join
Equi join
Natural Join
2. Outer joins:
Inner join:This is the most common type of join and is similar to AND operation. It combines the results of one or more tables and
displays the results when all the filter conditions are met.
Returns records that have matching values in both tables.
Syntax:
Select *
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
Theta join:A theta join is a join that links tables based on a relationship other than equality between two columns.
A theta join could use any operator other than the “equal” operator.
Basically it is also known as Generic join.
It is same as Equi join but it allows all other operators like >, <, >= etc..
Equi join:In equi join, tables are merged on the basis of common attributes. For whatever the Join type (Inner, outer
etc..), if we use only the equality operator(=) then we can say that the join is EQUI join.
Natural join:The join involves an Equality test, and thus is often described as an Equi-join.
The natural join will remove the duplicate attributes. No comparison operator is used in Natural Join.
Syntax:
Select *
FROM table1
NATURAL JOIN table2;
Outer join: All the above given joins are also known as inner join keeps only the tuples satisfying the given condition or
matching attribute but in Outer join, all the tuples of a relation are present in the resulting relation based
on the type of join.
An Outer Join does not require each record in the two joined tables to have a matching record.
Syntax:
Select *
FROM table1,table2
WHERE conditions [ + ];
Left Outer join:Keep data from the left-hand table and if there are no columns matching in the right
table, it returns NULL values.
Syntax:
Select *
FROM table1
LEFT OUTER JOIN table2
ON table1.column_name = table2.column_name;
Right Outer join:Keep data from the right-hand table and if there are no columns matching in the left
table, it returns NULL values
Syntax:
Select *
FROM table1<br>
RIGHT OUTER JOIN table2
ON table1.column_name = table2.column_name;
Full Outer join: Full outer join actually combines the result of both left outer join and right outer join. Keep data from
both tables and it returns row from either table when the conditions are met and returns NULL
value when there is no match.
Syntax:
Select *
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;
Self join:A self join is a special form of Equi join or Inner join in which a table is joined against itself.
Syntax:
SELECT a.column_name,b.column_name…
FROM table1 a, table2 b
WHERE a.column_field=b.column_field
Cross join: The CROSS JOIN produces a result set which is the number of rows in the first table multiplied by the
number of rows in the second table, if no WHERE clause is used along with CROSS JOIN. This kind of result is called
as Cartesian Product.
Syntax:
Select *
FROM table1
CROSS JOIN table2;
Section 5: Normalization
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional
dependent on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
5NF A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
Suppose there is a relation R which has two attributes X and Y,value of X uniquely determines the value of Y.
This relationship is indicated by the representation below :
X→Y
Y=Functionally dependent on X
X=Determinant set
Y=Dependent attribute
Example:-
We have <Country >table with two attributes cid and cname
cid cname
C01 India
C02 Pakistan
Therefore, the above functional dependency between cid and cname can be determined as cid is functionally dependent
on cname:
cid -> cname
Example:- We are considering the same <Country> table with two attributes to understand the concept of trivial dependency.
Example:- cid->cname
If A is a set of attributes and B is subset of C, then C holds B. If B is a subset of A then Adetermines B This
property is trivial property.
If A ⊇ B then A→ B
Same as the transitive rule in algebra, if Adetermines B holds and Bdetermines C holds, then
Amust also determines C also holds. A→ B is called as A functionally that determines B.
If A→ B and B→ C then A→ C
5.3.4 Additive/Union Rule:
If Adetermines B holds and Adetermines C holds, then Amust also determines B∧C holds.
If A→ B and A→ C then A→ BC
It states that an attribute of a table cannot hold multiple values. It must hold only single-valued
attribute.
First normal form disallows the multi-valued attribute, composite attribute, and their combinations.
The decomposition of the Student table into 1NF has been shown below:
Example:- Suppose a school wants to store the data of teachers and the subjects they teach. They
create a table that looks like this: Since a teacher can teach more than one subjects, the table can have
multiple rows for a same teacher.
The table is in 1 NF because each attribute has atomic values. However, it is not in 2NF because non prime
attribute Teacher_age is dependent on Teacher_id alone which is a proper subset of candidate key. This
violates the rule for 2NF as the rule says “no non-prime attribute is dependent on the proper subset of any
candidate key of the table”.
To make the table complies with 2NF we can break it in two tables like this:
Teacher_id Teacher_age
101 38
102 38
103 40
Teacher_id Subject
101 Maths
101 Physics
102 Biology
103 Physics
103 Chemistry
Now the tables comply with Second normal form (2NF).
By transitive functional dependency, we mean we have the following relationships in the table: A is functionally dependent on B, and B is functionally
dependent on C. In this case, C is transitively dependent on A via B.
Example:
In the table able, [Book ID] determines [Genre ID], and [Genre ID] determines [Genre Type]. Therefore, [Book ID] determines [Genre
Type] via [Genre ID] and we have transitive functional dependency, and this structure does not satisfy third normal form.
To bring this table to third normal form, we split the table into two as follows:
Now all non-key attributes are fully functional dependent only on the primary key. In [TABLE_BOOK], both [Genre ID] and [Price] are
only dependent on [Book ID]. In [TABLE_GENRE], [Genre Type] is only dependent on [Genre ID].
For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation will be a multi-valued
dependency.
STUDENT
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity. Hence, there is no relationship
between COURSE and HOBBY.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
Example:
SUBJECT LECTURER SEMESTER
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take Math class for sem2. In this
case, combination of all these fields required to identify a valid data.
Suppose we add a new Semester as sem3 but do not know about the subject and who will be taking that subject so we
leave Lecturer and Subject as NULL. But all three columns together acts as a primary key, so we can't leave other two
columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
SEMESTER SUBJECT
sem1 Computer
sem1 Math
sem1 Chemistry
sem2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
sem 1 Anshika
sem 1 John
sem1 John
sem2 Akash
sem1 Praveen
1. Lossless decomposition-
Lossless decomposition ensures-
Employee ⋈ Department
2. Dependency Preserving-
Dependency is an important constraint on the database.
Every dependency must be satisfied by at least one decomposed table.
In this property, it allows to check the updates without computing the natural join of the database structure.
If we decompose a relation R into relations R1 and R2, All dependencies of R either must be a part of R1 or R2 or
must be derivable from combination of FD’s of R1 and R2.
For Example, A relation R (A, B, C, D) with FD set{A->BC} is decomposed into R1(ABC) and R2(AD) which is
dependency preserving because FD A->BC is a part of R1(ABC).
Example:-
Suppose there is a bike manufacturer company which produces two colors(Red and blue) of each model every year.
Here columns BIKE_COLOR and MANUF_YEAR are dependent on BIKE_MODEL and independent of each other.
In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The representation of these
dependencies is shown below:
The above table can be decomposed into the following three tables; therefore it is not in 5NF:
<Country_name>
Country_id Country_name
C01 India
C02 Pakistan
C03 Afganistan
<Country_population>
Country_id Country_population
C01 3000
C02 4000
C03 4087
<Namepopulation>
Country_name Country_population
India 3000
Pakistan 4000
Afganistan 4087
Inclusion Dependency:
An Inclusion Dependency is a statement of the form that some columns of a relation
are contained in other columns. A foreign key constraint is an example of inclusion dependency.
In one relation, the referring relation is contained in the primary key column(s) of the referenced relation.
In inclusion dependency, we should not split groups of attributes that participate in an inclusion
dependency.
In practice, most inclusion dependencies are key-based that is involved only keys.
Example:
Read (A)
A=A-100
Write (A)
Read(B) → If transaction fails here, then system will be inconsistent as 100 units debited from
Account A but not added to account B.
B = B + 100
Write (B)
To remove this partially executed problem, we increase the level of atomicity and bundle all
instruction of a logical operation into a unit called transaction.
Transaction Operations:
Following are the main operations of transaction:
Read(X): Read operation is used to read the value of X from the database and stores it in a buffer in
main memory.
Write(X): Write operation is used to write the value back to the database from the buffer.
1. Atomicity
The atomicity property of a transaction requires that all operations of a transaction be completed, if not, the
transaction is aborted. A transaction is treated as single, individual logical unit of work.
"A" stands for atomicity it states that either all the instructions participating in a transaction will execute
or none. Atomicity is guaranteed by transaction management component.
2. Consistency
"C" stands for consistency it states that if a database is consistent before the execution of a transaction that if
must remains consistent after execution of a transaction.
If the transaction fails, the database must be returned to the state it was in prior to the execution of the failed
transaction.
Note: If atomicity, isolation, durability holds well then consistency holds well automatically.
3. Isolation
Isolation property of a transaction means that the data used during the execution of a transaction cannot be
used by a second transaction until the first one is completed.
Isolation means if a transaction run isolately or concurrently with other transaction then the result must be
same. Concurrency control component takes cares of isolation.
4. Durability
When a transaction is completed, the database reaches a consistent state and that state cannot be lost, even in
the event of system's failure.
Durability means that the work done by a successful transaction must remain in the system. Even in case of any
hardware or software failure.
Note: Recovery management component takes care of durability.
System Failure
Active
Abort
Aborted
Failed
All changes being
Rolled back
Active state
o The active state is the first state of every transaction. In this state, the transaction is being
executed.
o For example: Insertion or deletion or updating a record is done here. But all the records are still
not saved to the database.
Partially committed
o In the partially committed state, a transaction executes its final operation, but the data is still not
saved to the database.
o In the total mark calculation example, a final display of the total marks step is executed in this
state.
Committed
A transaction is said to be in a committed state if it executes all its operations successfully. In this state,
all the effects are now permanently saved on the database system.
Failed state
o If any of the checks made by the database recovery system fails, then the transaction is said to be
in the failed state.
o In the example of total mark calculation, if the database is not able to fire a query to fetch the
marks, then the transaction will fail to execute.
Aborted
o If any of the checks fail and the transaction has reached a failed state then the database recovery
system will make sure that the database is in its previous consistent state. If not then it will abort
or roll back the transaction to bring the database into a consistent state.
o If the transaction fails in the middle of the transaction then before executing the transaction, all
the executed transactions are rolled back to its consistent state.
o After aborting the transaction, the database recovery module will select one of the two operations:
1. Re-start the transaction
2. Kill the transaction
Types of Transaction:
Based on Actions
Two-step
Restricted
Action model
Based on Structure
Flat or simple transactions: It consists of a sequence of primitive operations executed between a begin
and end operations.
Nested transactions: A transaction that contains other transactions.
Workflow
Serial Schedule
Serializable Schedule
1.Serial Schedule
The serial schedule is a type of schedule where one transaction is executed completely before starting
another transaction. In the serial schedule, when the first transaction completes its cycle, then the
next transaction is executed.
For example: Suppose there are two transactions T1 and T2 which have some operations. If it has no
interleaving of operations, then there are the following two possible outcomes:
Execute all the operations of T1 which was followed by all the operations of T2.
Execute all the operations of T1 which was followed by all the operations of T2.
In the given (a) figure, Schedule A shows the serial schedule where T1 followed by T2.
In the given (b) figure, Schedule B shows the serial schedule where T2 followed by T1.
2. Non-serial Schedule
If interleaving of operations is allowed, then there will be non-serial schedule.
It contains many possible orders in which the system can execute the individual operations of the
transactions.
In the given figure (c) and (d), Schedule C and Schedule D are the non-serial schedules. It has
interleaving of operations.
3. Serializable schedule
The serializability of schedules is used to find non-serial schedules that allow the transaction to
execute concurrently without interfering with one another.
It identifies which schedules are correct when executions of the transaction have interleaving of their
operations.
A non-serial schedule will be serializable if its result is equal to the result of its transactions executed
serially.
Figures to explain these schedules:
a) Schedule A
T1 T2
Read (A);
A := A - N;
Time Write (A);
Read (B);
B :=B + N;
Write (B);
Read (A);
A := A + M;
Write (A);
b) Schedule B
T1 T2
Read (A);
A := A + M;
Write (A);
Read (A);
A := A - N;
Time Write (A);
Read (B);
B :=B + N;
Write (B);
c) Schedule C
T1 T2
Read (A);
A := A - N;
Time Read (A);
A := A + M;
Write (A);
Read (B);
Write (A);
B :=B + N;
Write (B);
d) Schedule D
T1 T2
Read (A);
A := A - N;
Time Write (A);
Read (A);
A := A + M;
Write (A);
Read (B);
B :=B + N;
Write (B);
Here,
Schedule A and Schedule B are serial schedule.
Schedule C and Schedule D are Non-serial schedule.
Conflict
View
Or in other words, When multiple transactions are running concurrently then there is a possibility that the
database may be left in an inconsistent state. Serializability is a concept that helps us to check
which schedules are serializable. A serializable schedule is the one that always leaves the database in
consistent state.
Conflict Serializability
It is one of the type of Serializability, which can be used to check whether a non-serial schedule is conflict
serializable or not.
A schedule is called conflict serializable if we can convert it into a serial schedule after swapping its non-
conflicting operations.
Conflicting operations:-
Two operations are said to be in conflict, if they satisfy all the following three conditions:
Example:
Operation W(X) of transaction T1 and operation R(X) of transaction T2 are conflicting operations,
because they satisfy all the three conditions mentioned above. They belong to different transactions,
they are working on same data item X, one of the operation in write operation.
Following two operations are:
Two schedules are said to be conflict Equivalent if one schedule can be converted into other schedule after
swapping non-conflicting operations.
Lets check whether a schedule is conflict serializable or not. If a schedule is conflict Equivalent to its serial
schedule then it is called Conflict Serializable schedule. Lets take few examples of schedules.
View Serializability
View Serializability is a process to find out that a given schedule is view serializable or not.
o A schedule will view serializable if it is view equivalent to a serial schedule.
o If a schedule is conflict serializable, then it will be view serializable.
o The view serializable which does not conflict serializable contains blind writes.
View Equivalent
Two schedules S1 and S2 are said to be view equivalent if they satisfy the following conditions:
1. Initial Read
An initial read of both schedules must be the same. Suppose two schedule S1 and S2. In schedule S1, if a
transaction T1 is reading the data item A, then in S2, transaction T1 should also read A.
Schedule s1:
T1 T2
Read (A)
Write( A)
Schedule s2:
T1 T2
Write (A)
Read (A)
Above two schedules are view equivalent because Initial read operation in S1 is done by T1 and in S2 it is
also done by T1.
2. Updated Read
In schedule S1, if Ti is reading A which is updated by Tj then in S2 also, Ti should read A which is updated
by Tj.
Schedule S1:
T1 T2 T3
Write (A)
Write( A)
Read (A)
Schedule S2:
T1 T2 T3
Write( A)
Write (A)
Read (A)
Above two schedules are not view equal because, in S1, T3 is reading A updated by T2 and in S2, T3 is
reading A updated by T1.
3. Final Write
A final write must be the same between both the schedules. In schedule S1, if a transaction T1 updates A
at last then in S2, final writes operations should also be done by T1.
Schedule S1:
T1 T2 T3
Write (A)
Read ( A)
Write (A)
Schedule S2:
T1 T2 T3
Read ( A)
Write (A)
Write (A)
Above two schedules is view equal because Final write operation in S1 is done by T3 and in S2, the final
write operation is also done by T3.
Example:
T1 T2 T3
Read (A)
Write( A)
Write (A)
Write (A)
Schedule S
1. = 3! = 6
2. S1 = <T1 T2 T3>
3. S2 = <T1 T3 T2>
4. S3 = <T2 T3 T1>
5. S4 = <T2 T1 T3>
6. S5 = <T3 T1 T2>
7. S6 = <T3 T2 T1>
T1 T2 T3
Read (A)
Write (A)
Write( A)
Write (A)
Schedule S1
In both schedules S and S1, there is no read except the initial read that's why we don't need to check that
condition.
The initial read operation in S is done by T1 and in S1, it is also done by T1.
The final write operation in S is done by T3 and in S1, it is also done by T3. So, S and S1 are view
Equivalent.
The first schedule S1 satisfies all three conditions, so we don't need to check another schedule.
T1 → T2 → T3
Recoverability:
Recoverability of Schedule. Sometimes a transaction may not execute completely due to a software
issue, system crash or hardware failure. In that case, the failed transaction has to be rollback. But some
other transaction may also have used value produced by the failed transaction.
Need to address the effect of transaction failures on concurrentlt running transactions.
Recoverable schedule:- If a transaction Tj reads a data items previously written by a transaction Ti, the
commit operation of Ti appears before the commit operation Tj.
T1 T2
Read (A)
Write (A)
Read (A)
Read (B)
The following schedule is not recoverable if T2 commits immediately after the read.
If T1 should abort, T2 would have read (and possibly shown to the user) an inconsistent database state. Hence
database must ensure that schedules are recoverable.
T1 T2 T3
Read (A)
Read (B)
Write (A)
Read (A)
Write (A)
Read (A)
If T1 fails, T2 & T3 must also be rolled back.
Cascadeless Schedules:- Cascading rollback cannot occur, for each pair of transactions Ti & Tj reads a data
item previously written by Ti, the commit operation of Ti appears before the read operation of Tj.
Failure Classification
To find that where the problem has occurred, we generalize a failure into the following categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point from where it can't
go any further. If a few transaction or process is hurt, then this is called as transaction failure.
1. Logical errors: If a transaction cannot complete due to some code error or an internal
error condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active transaction because the
database system is not able to execute it. For example, The system aborts an active
transaction, in case of deadlock or resource unavailability.
2. System Crash
o System failure can occur due to power failure or other hardware or software
failure. Example: Operating system error.
3. Disk Failure
o It occurs where hard-disk drives or storage drives used to fail frequently. It was a common
problem in the early days of technology evolution.
o Disk failure occurs due to the formation of bad sectors, disk head crash, and unreachability
to the disk or any other failure, which destroy all or part of disk storage.
Concurrency control
In a multiprogramming environment where multiple transactions can be executed simultaneously, it
is highly important to control the concurrency of transactions. We have concurrency control protocols
to ensure atomicity, isolation, and serializability of concurrent transactions.
Concurrency control protocols can be broadly divided into two categories −
Lock-based Protocols
Database systems equipped with lock-based protocols use a mechanism by which any transaction
cannot read or write data until it acquires an appropriate lock on it. Locks are of two kinds −
Binary Locks − A lock on a data item can be in two states; it is either locked or unlocked.
Shared/exclusive − This type of locking mechanism differentiates the locks based on their uses. If a lock is
acquired on a data item to perform a write operation, it is an exclusive lock. Allowing more than one
transaction to write on the same data item would lead the database into an inconsistent state. Read locks
are shared because no data value is being changed.
There are four types of lock protocols available −
Simplistic Lock Protocol
Simplistic lock-based protocols allow transactions to obtain a lock on every object before a 'write'
operation is performed. Transactions may unlock the data item after completing the ‘write’ operation.
Pre-claiming Lock Protocol
Pre-claiming protocols evaluate their operations and create a list of data items on which they need
locks. Before initiating an execution, the transaction requests the system for all the locks it needs
beforehand. If all the locks are granted, the transaction executes and releases all the locks when all
its operations are over. If all the locks are not granted, the transaction rolls back and waits until all
the locks are granted.
Two-phase locking has two phases, one is growing, where all the locks are being acquired by the
transaction; and the second phase is shrinking, where the locks held by the transaction are being
released.
To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then
upgrade it to an exclusive lock.
Strict Two-Phase Locking
The first phase of Strict-2PL is same as 2PL. After acquiring all the locks in the first phase, the
transaction continues to execute normally. But in contrast to 2PL, Strict-2PL does not release a lock
after using it. Strict-2PL holds all the locks until the commit point and releases all the locks at a time.
Strict-2PL does not have cascading abort as 2PL does.
Timestamp-based Protocols
The most commonly used concurrency protocol is the timestamp based protocol. This protocol uses
either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions at the time
of execution, whereas timestamp-based protocols start working as soon as a transaction is created.
Every transaction has a timestamp associated with it, and the ordering is determined by the age of
the transaction. A transaction created at 0002 clock time would be older than all other transactions
that come after it. For example, any transaction 'y' entering the system at 0004 is two seconds
younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the system know
when the last ‘read and write’ operation was performed on the data item.
Time-stamp ordering rules can be modified to make the schedule view serializable.
Instead of making T rolled back, the 'write' operation itself is ignored.
i
Deadlock in DBMS
A deadlock is a condition where two or more transactions are waiting indefinitely for one another to give
up locks. Deadlock is said to be one of the most feared complications in DBMS as no task ever gets
finished and is in waiting state forever.
For example: In the student table, transaction T1 holds a lock on some rows and needs to update some
rows in the grade table. Simultaneously, transaction T2 holds locks on some rows in the grade table and
needs to update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and similarly,
transaction T2 is waiting for T1 to release its lock. All activities come to a halt state and remain at a
standstill. It will remain in a standstill until the DBMS detects the deadlock and aborts one of the
transactions.
Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database rather than
aborting or restating the database. This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A method like
"wait for graph" is used for detecting the deadlock situation but this method is suitable only for the
smaller database. For the larger database, deadlock prevention method can be used.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should detect
whether the transaction is involved in a deadlock or not. The lock manager maintains a Wait for the graph
to detect the deadlock cycle in the database.
Wait for Graph
o This is the suitable method for deadlock detection. In this method, a graph is created based on the
transaction and their lock. If the created graph has a cycle or closed loop, then there is a deadlock.
o The wait for the graph is maintained by the system for every transaction which is waiting for some
data held by the others. The system keeps checking the graph if there is any cycle in the graph.
The wait for a graph for the above scenario is shown below:
Deadlock Prevention
o Deadlock prevention method is suitable for a large database. If the resources are allocated in such
a way that deadlock never occurs, then the deadlock can be prevented.
o The Database management system analyzes the operations of the transaction whether they can
create a deadlock situation or not. If they do, then the DBMS never allowed that transaction to be
executed.
Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a conflicting lock by
another transaction then the DBMS simply checks the timestamp of both transactions. It allows the older
transaction to wait until the resource is available for execution.
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any transaction T. If T2
holds a lock by some other transaction and T1 is requesting for resources held by T2 then the following
actions are performed by DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource, then Ti is
allowed to wait until the data-item is available for execution. That means if the older transaction is
waiting for a resource which is locked by the younger transaction, then the older transaction is
allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj is waiting
for it, then Tj is killed and restarted later with the random delay but with the same timestamp.
Wound wait scheme
o In wound wait scheme, if the older transaction requests for a resource which is held by the younger
transaction, then older transaction forces younger one to kill the transaction and release the
resource. After the minute delay, the younger transaction is restarted but with the same
timestamp.
o If the older transaction has held a resource which is requested by the Younger transaction, then
the younger transaction is asked to wait until older releases it.
Starvation:-
Starvation or Livelock is the situation when a transaction has to wait for a indefinate period of time to acquire a
lock.
Reasons of Starvation –
If waiting scheme for locked items is unfair. ( priority queue )
Victim selection. ( same transaction is selected as a victim repeatedly )
Resource leak.
Via denial-of-service attack.
Problem:-Starvation can be best explained with the help of an example – Suppose there are 3 transactions namely T1,
T2, and T3 in a database that are trying to acquire a lock on data item ‘ I ‘ . Now, suppose the scheduler grants the lock to
T1(may be due to some priority), and the other two transactions are waiting for the lock. As soon as the execution of T1 is
over, another transaction T4 also come over and request unlock on data item I. Now, this time the scheduler grants lock to
T4, and T2, T3 has to wait again . In this way if new transactions keep on requesting the lock, T2 and T3 may have to wait
for an indefinate period of time, that leads to Starvation.
Solution of Starvation:
1. Increasing Priority –
Starvation occurs when a transaction has to wait for an indefinate time, In this situation we can increase the priority
of that particular transaction/s. But the drawback with this solution is that it may happen that the other transaction
may have to wait longer untill the highest priority transaction comes and proceeds.
If a transaction has been a victim of repeated selections, then the algorithm can be modified by lowering
its priority over other transactions.
A fair scheduling approach i.e FCFS can be adopted, In which the transaction can acquire a lock on an
Item in the order, in which the requested the lock.
These are the schemes that uses timestamp ordering mechanism of transaction .
Section 7: INDEXING & HASHING
INDEXING:
We know that data is stored in the form of records. Every record has a key field, which helps it to be
recognized uniquely.
Index is a physical structure contains pointers to the data.
Indexing is a way to optimize the performance of a database by minimizing the number of disk accesses required when a
query is processed.
It is a data structure technique which is used to quickly locate and access the data in a database.
Indexing in database systems is similar to what we see in books.
When a database is very huge, even a smallest transaction will take time to perform the action. In order to reduce the
time spent in transactions, Indexes are used.
The users cannot see the indexes, they are just used to speed up queries. Effective indexes are one of the best ways to improve
performance in a database application.
Advantages of Indexing:
They make it possible to quickly retrieve (fetch) data.
They can be used for sorting.
Their use in queries usually results in much better performance.
Unique indexes guarantee uniquely identifiable records in the database.
You can't sort data in the lead nodes as the value of the primary key classifies it.
DisAdvantages of Indexing:
They decrease performance on inserts, updates, and deletes.
They take up space (this increases with the number of fields used and the length of the fields).
To perform the indexing database management system, you need a primary key on the table with a unique value.
You are not allowed to partition an index-organized table.
You can't perform any other indexes on the Indexed data.
Types of Indexing:
Structure of Indexing:
Ordered Indices:
The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered
indices. These are generally fast and a more traditional type of storing mechanism.
Imagine we have a student table with thousands of records, each of which is 10 bytes long. Imagine their IDs
start from 1 2, 3… and goes on. And we have to search student with ID 572. In a normal database with no
index, it searches the disk block from the beginning till it reaches 572. So the DBMS will reach this record after
reading 571*10 = 5710 bytes. But if we have index on ID column, then the address of the location will be
stored as each record as (1,200), (2, 201)… (572, 773) and so on. One can imagine it as a smaller table with
index column and address column. Now if we want to search record with ID 572, then it will search using
indexes. i.e.; here it will traverse only 571*2 = 1142 bytes which very less compared to earlier one. Hence
retrieving the record from the disk becomes faster. Most of the cases these indexes are sorted and kept to
make searching faster. If the indexes are sorted, then it is called as ordered indices.
Primary Index:
If the index is created on the primary key of the table then it is called as Primary Indexing. Since these primary
keys are unique to each record and it has 1:1 relation between the records, it is much easier to fetch the
record using it.
As primary keys are stored in sorted order, the performance of the searching operation is quite efficient.
The primary index will be categorized into 2 parts: Dense index and Sparse index.
Dense index:
The dense index contains an index record for every search key value in the data file. It makes searching
faster.
In this, the number of records in the index table is same as the number of records in the main table.
It needs more space to store index record itself. The index records have the search key and a pointer to the
actual record on the disk.
Sparse Index:
In the data file, index record appears only for a few items. Each item points to a block.
In this, instead of pointing to each record in the main table, the index points to the records in the main table in a gap.
In this method of indexing, range of index columns store the same data block address. And when data is to be retrieved,
the block address will be fetched linearly till we get the requested data.
Example:
As you can see, the data blocks have been divided in to several blocks, each containing a fixed number of records (in our case 10).
The pointer in the index table points to the first record of each data block, which is known as the Anchor Record for its important
function. If you are searching for roll 14, the index is first searched to find out the highest entry which is smaller than or equal to 14. We
have 11. The pointer leads us to roll 11 where a short sequential search is made to find out roll 14.
Clustering Index:
In some cases, the index is created on non-primary key columns which may not be unique for each record. In such cases,
in order to identify the records faster, we will group two or more columns together to get the unique values and create
index out of them.
Example: suppose a company contains several employees in each department. Suppose we use a clustering index,
where all employees which belong to the same Dept_ID are considered within a single cluster, and index pointers point to
the cluster as a whole. Here Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records which belong to the different cluster.
If we use separate disk block for separate clusters, then it is called better technique.
Secondary Index:
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are usually kept in
the primary memory so that address fetch should be faster. Then the secondary memory searches the actual data based
on the address got from mapping. If the mapping size grows then fetching the address itself becomes slower. In this case,
the sparse index will not be efficient. To overcome this problem, secondary indexing is introduced.
A non-clustered index just tells us where the data lies, i.e. it gives us a list of virtual pointers or references to
the location where the data is actually stored.
Data is not physically stored in the order of the index. Instead, data is present in leaf nodes.
Example:
Multilevel Indexing:
Multilevel Indexing is created when a primary index does not fit in memory.
you can reduce the number of disk accesses to short any record and kept on a disk as a sequential file and
create a sparse base on that file.
Multi-level Index helps in breaking down the index into several smaller indices in order to make the outermost
level so small that it can be saved in a single disk block, which can easily be accommodated anywhere in the
main memory.
HASHING
In DBMS, hashing is a technique to directly search the location of desired data on the disk without using index
structure.
Data is stored in the form of data blocks whose address is generated by applying a hash function in the
memory location where these records are stored known as a data block or data bucket.
Hashing uses hash functions with search keys as parameters to generate the address of a data record.
Here, are the situations in the DBMS where you need to apply the Hashing method:
For a huge database structure, it's tough to search all the index values through all its level and then you need
to reach the destination data block to get the desired data.
Hashing method is used to index and retrieve items in a database as it is faster to search that specific item
using the shorter hashed key instead of using its original value.
Hashing is an ideal method to calculate the direct location of a data record on the disk without using index
structure.
It is also a helpful technique for implementing dictionaries.
Data bucket – Data buckets are memory locations where the records are stored. It is also known as Unit Of
Storage.
Key: A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a
relation(table). This allows you to find the relationship between two tables.
Hash function: A hash functionh, is a mapping function which maps all the set of search keys Kto the address
where actual records are placed.
Linear Probing – Linear probing is a fixed interval between probes. In this method, the next available data
block is used to enter the new record, instead of overwriting on the older record.
Quadratic probing- It helps you to determine the new bucket address. It helps you to add Interval between
probes by adding the consecutive output of quadratic polynomial to starting value given by the original
computation.
Hash index – It is an address of the data block. A hash function could be a simple mathematical function to
even a complex mathematical function.
Double Hashing –Double hashing is a computer programming method used in hash tables to resolve the
issues of has a collision.
Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for any static has to
function.
1. Static Hashing
2. Dynamic Hashing
Static Hashing
In the static hashing, the resultant data bucket address will always remain the same.
Therefore, if you generate an address for say Student_ID = 10 using hashing function mod(3), the resultant
bucket address will always be 1. So, you will not see any change in the bucket address.
Therefore, in this static hashing method, the number of data buckets in memory always remains constant.
Inserting a record: When a new record requires to be inserted into the table, you can generate an address for
the new record using its hash key. When the address is generated, the record is automatically stored in that
location.
Searching: When you need to retrieve the record, the same hash function should be helpful to retrieve the
address of the bucket where data should be stored.
Delete a record: Using the hash function, you can first fetch the record which is you wants to delete. Then you
can remove the records for that address in memory.
Update a record:To update a record, we will first search it using a hash function, and then the data record is
updated.
Operation
Insertion − When a record is required to be entered using static hash, the hash function h computes
the bucket address for search key K, where the record will be stored.
Bucket address = h (K)
Search − When a record needs to be retrieved, the same hash function can be used to retrieve the
address of the bucket where the data is stored.
Delete − This is simply a search followed by a deletion operation.
1. Open hashing
2. Close hashing.
Open Hashing
In Open hashing method, Instead of overwriting older one the next available data block is used to enter the
new record, This method is also known as linear probing.
Example: suppose R3 is a new address which needs to be inserted, the hash function generates address
as 112 for R3. But the generated address is already full. So the system searches next available data bucket,
113 and assigns R3 to it.
Close Hashing
In the close hashing method, when buckets are full, a new bucket is allocated for the same hash and result are
linked after the previous one.
Example:Suppose R3 is a new address which needs to be inserted into the table, the hash function
generates address as 110 for it. But this bucket is full to store the new data. In this case, a new bucket is
inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the size of the database
grows or shrinks.
Dynamic hashing offers a mechanism in which data buckets are added and removed dynamically and on
demand. In this hashing, the hash function helps you to create a large number of values.
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are 01, so it will go
into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2. The last two bits of 7 are 11, so
it will go into B3.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This method is also known
as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor performance.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are 01, so it will go
into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2. The last two bits of 7 are 11, so
it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into bucket B1, and the
last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last two bits of both
the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because last two bits of both
the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two bits of both the entry
are 11.
What is Collision?
Hash collision is a state when the resultant hashes from two or more data in the data set, wrongly map the
same place in the hash table.
There are two technique which you can use to avoid a hash collision:
1. Rehashing: This method, invokes a secondary hash function, which is applied continuously until an empty slot
is found, where a record should be placed.
2. Chaining: Chaining method builds a Linked list of items whose key hashes to the same value. This method
requires an extra link field to each table position.
Hashing is not favorable when the data is organized in some ordering and the queries require a range of data.
When data is discrete and random, hash performs the best.
Hashing algorithms have high complexity than indexing. All hash operations are done in constant time.
B+ Tree
The B+ tree is a balanced binary search tree. It follows a multi-level index format.
In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes remain at
the same height.
In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support random access as
well as sequential access.
Structure of B+ tree:
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the order n where n is
fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Searching a record in B+ Tree
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the intermediary node
which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end, we will be
redirected to the third leaf node. Here DBMS will perform a sequential search to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after 55. It is a
balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting the fill factor,
balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split the leaf node
of the tree in the middle so that its balance is not altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf
nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60 added to it,
and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to find the node
where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from the
intermediate node as well as from the 4th leaf node too. If we remove it from the intermediate node, then the
tree will not satisfy the rule of the B+ tree. So we need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
Advantage:
RAID 0 offers great performance, both in read and write operations. There is no overhead caused by parity
controls.
All storage capacity is used, there is no overhead.
The technology is easy to implement.
Disadvantage:
RAID 0 is not fault-tolerant. If one drive fails, all data in the RAID 0 array are lost. It should not be used for
mission-critical systems.
RAID 1:-
RAID 1 writes and reads identical data to pairs of drives. This process is often called data mirroring
and it’s a primary function is to provide redundancy. If any of the disks in the array fails, the system
can still access data from the remaining disk(s). Once you replace the faulty disk with a new one, the
data is copied to it from the functioning disk(s) to rebuild the array. RAID 1 is the easiest way to
create failover storage.
Blocks Mirrored, No stripe, No parity
Minimum 2 disks.
Good performance ( no striping. no parity ).
Excellent redundancy ( as blocks are mirrored )
Business use: Standard application servers where data redundancy and availability is important.
Disk 1 Disk 2
1 1
4 4
7 7
Advantages
RAID 1 offers excellent read speed and a write-speed that is comparable to that of a single drive.
In case a drive fails, data do not have to be rebuild, they just have to be copied to the replacement drive.
RAID 1 is a very simple technology.
Disadvantages
The main disadvantage is that the effective storage capacity is only half of the total drive capacity because all
data get written twice.
Software RAID 1 solutions do not always allow a hot swap of a failed drive. That means the failed drive can
only be replaced after powering down the computer it is attached to. For servers that are used simultaneously
by many people, this may not be acceptable. Such systems typically use hardware controllers that do support
hot swapping.
RAID 5:-
RAID 5 stripes data blocks across multiple disks like RAID 0, however, it also stores parity
information (Small amount of data that can accurately describe larger amounts of data) which is used
to recover the data in case of disk failure.
This level offers both speed (data is accessed from multiple disks) and redundancy as parity data is
stored across all of the disks. If any of the disks in the array fails, data is recreated from the
remaining distributed data and parity blocks. It uses approximately one-third of the available disk
capacity for storing parity information.
Blocks Striped, Distributed Parity
Minimum 3 disks.
Good performance ( as blocks are striped ).
Good redundancy ( distributed parity ).
Best cost effective option providing both performance and redundancy. Use this for DB that is heavily
read oriented. Write operations will be slow.
Ideal use: File storage servers and application servers.
Advantages
Read data transactions are very fast while write data transactions are somewhat slower (due to the parity that
has to be calculated).
If a drive fails, you still have access to all data, even while the failed drive is being replaced and the storage
controller rebuilds the data on the new drive.
Disadvantages
RAID 10:-
RAID 10 combines the mirroring of RAID 1 with the striping of RAID 0. Or in other words, it combines
the redundancy of RAID 1 with the increased performance of RAID 0. It is best suitable for
environments where both high performance and security is required.
Blocks Mirrored, and Blocks Striped.
Minimum 4 disks.
This is also called as “stripe of mirrors”
Excellent redundancy ( as blocks are mirrored)
Excellent performance ( as blocks are striped)
If you can afford the dollar, this is the best option for any mission critical applications (especially databases).
Ideal use: Highly utilized database servers/ servers performing a lot of write
operations.
Advantages
If something goes wrong with one of the disks in a RAID 10 configuration, the rebuild time is very fast since all
that is needed is copying all the data from the surviving mirror to a new drive. This can take as little as 30
minutes for drives of 1 TB.
Disadvantages
Half of the storage capacity goes to mirroring, so compared to large RAID 5 or RAID 6 arrays, this is an
expensive way to have redundancy.
RAID levels 2, 3, and 4, 7are theoretically defined but not used in practice.
RAID 2
It is similar to RAID 5, but instead of disk striping using parity, striping occurs at the bit-level.
RAID 2 is seldom deployed because costs to implement are usually prohibitive (a typical setup
requires 10 disks) and gives poor performance with some disk I/O operations.
RAID 3
It is also similar to RAID 5, except this solution requires a dedicated parity drive.
RAID 3 is seldom used except in the most specialized database or processing environments,
which can benefit from it.
RAID 4:-
It is a configuration in which disk striping happens at the byte level, rather than at the bit-level.
RAID 6:-
It is also used frequently in enterprises.
It's identical to RAID 5, except it's an even more robust solution because it uses one more parity
block than RAID 5.
You can have two disks die and still have a system be operational.
RAID 7 :-
It is a proprietary level of RAID owned by the now-defunct Storage Computer Corporation.
RAID 0+1:-
It is often interchanged for RAID 10 (which is RAID 1+0), but the two are not same.
RAID 0+1 is a mirrored array with segments that are RAID 0 arrays.
It's implemented in specific infrastructures requiring high performance but not a high level of
scalability.
Advantages of RAID:-
Performance, resiliency and cost are among the major benefits of RAID.
By putting multiple hard drives together, RAID can improve on the work of a single hard drive and,
depending on how it is configured.
It can increase computer speed and reliability after a crash.
RAID can still result in lower costs by using lower-priced disks in large numbers.
Servers make use of RAID technology.
Disadvantages of RAID:-
DBMS RDBMS
DBMS system, stores data in either a navigational RDBMS uses a tabular structure where the headers are
or hierarchical form. the column names, and the rows contain corresponding
values
DBMS supports single user only. It supports multiple users.
In a regular database, the data may not be stored Relational databases are harder to construct, but they are
following the ACID model. This can develop consistent and well structured. They obey ACID
inconsistencies in the database. (Atomicity, Consistency, Isolation, Durability).
Low software and hardware needs. Higher hardware and software need.
Data is stored in the form of tables which are related to
No relationship between data
each other with the help of foreign keys.
Multiple levels of security. Log files are created at OS,
There is no security.
Command, and object level.
Data redundancy is common in this model. Keys and indexes do not allow Data redundancy.
DBMS does not support client-server architecture RDBMS supports client-server architecture.
It deals with small quantity of data. It deals with large amount of data.
DBMS does not support distributed database. RDBMS supports distributed database.
Examples of DBMS are a file system, XML, Example of RDBMS is MySQL, Oracle, SQL Server, etc.
Windows Registry, etc.
Magnetic disks and magnetic tapes are used to store data in RDBMS.The disk space analyser maintains records for
available space and used space in the disk.
Memory Hierarchy
The computer system handles various types of memory to achieve faster execution of process.
2. Secondary memory
Secondary memory or storage is used to store data in computer system. The secondary storage is relatively
slower than cache or main memory.
Example: Magnetic tape, hard disk, CD, DVD etc.
Index is the small table having two columns. The first columns consist of primary key of the table and
second column consists of a set of pointers holding the address of the disk, where the particular key
(value) can found.
The indexes are very useful to improve the search operation in the DBMS system.
Type of indexes are discussed as follows:
1. Function-based indexes
A function-based index computes the values of expression which are present in one or more column and
stored in the table.The expression can be an arithmetic expression or SQL function.A function-based index
cannot contain null value.
2. Bitmap indexes
Bitmap index is used to work with well for low-cardinal (refers to columns few unique values)
columns in tables.
For example: Boolean data which has only two values true or false.
Bitmap indexes are very useful in data ware house applications for joining the large fact tables.
3. Domain indexes
Domain index is used to create index type schema object and an application specific index. It is used for
indexing data in application specific domain.
4. Clusters
Clustering in DBMS is design for high availability of data. Clustering is applied on tables which are
repeatedly used by the user.
For example: When there are many employees in the department, we can create index of non-unique key,
such as that Dept-id. With this, all employees belonging to the same department are consider to be within a
same cluster.
File organization is used to describe the way in which the records are stored in terms of blocks, and the blocks
are placed on the storage medium.
1. Physical backup
Physical Backups are the backups of the physical files used in storing and recovering your database, such as
data files, control files and archived redo logs, log files.It is a copy of files storing database information to some
other location, such as disk, some offline storage like magnetic tape.Physical backups are the foundation of
the recovery mechanism in the database.Physical backup provides the minute details about the transaction
and modification to the database.
2. Logical backup
Logical Backup contains logical data which is extracted from a database.It includes backup of logical data like
views, procedures, functions, tables, etc.It is a useful supplement to physical backups in many circumstances
but not a sufficient protection against data loss without physical backups, because logical backup provides
only structuralinformation.
DATABASE RECOVERY –
What is recovery?
Recovery is the process of restoring a database to the correct state in the event of a failure. It ensures that the
database is reliable and remains in consistent state in case of a failure.
1. Rolling Forward
2. Rolling Back
Log-Based Recovery
Logs are the sequence of records, that maintain the records of actions performed by a transaction.In
Log – Based Recovery, log of each transaction is maintained in some stable storage. If any failure
occurs, it can be recovered from there to recover the database.The log contains the information
about the transaction being executed, values that have been modified and transaction state.All this
information will be stored in the order of execution.
Example:
Assume, a transaction to modify the address of an employee. The following logs are written for this
transaction,
Log 3: Transaction is completed. The log indicates the end of the transaction.
Log: <Tn COMMIT>
Recovery with Concurrent Transaction
When two transactions are executed in parallel, the logs are interleaved. It would become difficult
for the recovery system to return all logs to a previous point and then start recovering.
To overcome this situation 'Checkpoint' is used.
Checkpoint
SQL NO-SQL
SQL databases are primarily called RDBMS or Relational Databases NoSQL databases are primarily called as Non-relational or
distributed database
SQL databases have fixed or static or predefined schema. NoSQL databases have dynamic schema.
It was developed in the 1970s to deal with issues with flat file Developed in the late 2000s to overcome issues and
storage limitations of SQL databases.
SQL databases are table based databases NoSQL databases can be document based, key-value
pairs, graph databases
SQL databases are vertically scalable. NoSQL databases are horizontally scalable.
SQL databases use a powerful language "Structured Query In NoSQL databases, collection of documents are used to
Language" to define and manipulate the data. query the data. It is also called unstructured query
language. It varies from database to database.
It should be used when data validity is super important Use when it's more important to have fast data than correct
data
SQL databases are not best suited for hierarchical data storage. NoSQL databases are best suited for hierarchical data
storage.
ACID( Atomicity, Consistency, Isolation, and Durability) is a standard Base ( Basically Available, Soft state, Eventually
for RDBMS. Consistent) is a model of many NoSQL systems.
MySQL, Oracle, SQLite, PostgreSQL and MS-SQL etc. are the MongoDB, BigTable, Redis, RavenDB, Cassandra, Hbase,
example of SQL database. Neo4j, CouchDB etc. are the example of nosql database
Clustered Non-Clustered
Cluster index is a type of index which sorts the data rows in A Non-clustered index stores the data at one location
the table on their key values. and indices at another location.
In the Database, there is only one clustered index per table. A single table can have many non-clustered indexes
as an index in the non-clustered index is stored in
different places.
A clustering index is defined in the ordering field of the table. A non-clustering index is defined in the non-ordering
field of the table.
A clustered index actually describes the order in which A Non-Clustered Index defines a logical order that
records are physically stored on the disk, hence the reason does not match the physical order on disk.
you can only have one. A clustered index is essentially a sorted copy of the
data in the indexed columns.
Faster to read than non-clustered as data is physically stored Quicker for insert and update operations than a
in index order clustered index
Usually made on the primary key. Usually made on the any key.
The leaf nodes of a clustered index contain the data pages. The leaf node of a non-clustered index does not
consist of the data pages. Instead, the leaf nodes
contain index rows.
DBMS Storage & File Structure
9.2 DBMS Backup & Recovery
9.3 DBMS vs. RDBMS
9.4 SQL vs. NO-SQL
9.5 Clustered vs. Non-Clustered Index
1. The storage structure which do not survive system crashes are ______
A. Volatile storage
B. Non-volatile storage
C. Stable storage
D. Dynamic storage
Answer: A
Explanation: Volatile storage, is a computer memory that requires power to maintain the stored information,
in other words it needs power to reach the computer memory.
2. Storage devices like tertiary storage, magnetic disk comes under
A. Volatile storage
B. Non-volatile storage
C. Stable storage
D. Dynamic storage
Answer: b
Explanation: Information residing in nonvolatile storage survives system crashes.
4. The unit of storage that can store one are more records in a hash file organization are
A. Buckets
B. Disk pages
C. Blocks
D. Nodes
Answer: a
Explanation: Buckets are used to store one or more records in a hash file organization.
5. A ______ file system is software that enables multiple computers to share file storage while maintaining
consistent space allocation and file content.
A. Storage
B. Tertiary
C. Secondary
D. Cluster
Answer: d
Explanation: With a cluster file system, the failure of a computer in the cluster does not make the file
system unavailable.
8. Which of the following are the process of selecting the data storage and data access characteristics of the
database?
A. Logical database design
B. Physical database design
C. Testing and performance tuning
D. Evaluation and selecting
Answer: b
Explanation: Physical database design is the process of selecting the data storage and data access
characteristics of the database.
10. The process of saving information onto secondary storage devices is referred to as
A. Backing up
B. Restoring
C. Writing
D. Reading
Answer: c
Explanation: The information is written into the secondary storage device.
11. The file organization which allows us to read records that would satisfy the join condition by using one block
read is
A. Heap file organization
B. Sequential file organization
C. Clustering file organization
D. Hash file organization
Answer: c
Explanation: All systems in the cluster share a common file structure via NFS, but not all disks are
mounted on all other systems.
12. What are the correct features of a distributed database?
A. Is always connected to the internet
B. Always requires more than three machines
C. Users see the data in one global schema.
D. Have to specify the physical location of the data when an update is done
Answer: c
Explanation: Users see the data in one global schema.
13. Each tablespace in an Oracle database consists of one or more files called
A. Files
B. name space
C. datafiles
D. PFILE
Answer: c
Explanation: A data file is a computer file which stores data to use by a computer application or system.
14. The management information system (MIS) structure with one main computer system is called a
A. Hierarchical MIS structure
B. Distributed MIS structure
C. Centralized MIS structure
D. Decentralized MIS structure
Answer: c
Explanation: Structure of MIS may be understood by looking at the physical components of the
information system in an organization.
15. A top-to-bottom relationship among the items in a database is established by a
A. Hierarchical schema
B. Network schema
C. Relational schema
D. All of the mentioned
Answer: a
Explanation: A hierarchical database model is a data model in which the data is organized into a tree-
like structure. The structure allows representing information using parent/child relationships.
16. Choose the RDBMS which supports full-fledged client server application development
A. dBase V
B. Oracle 7.1
C. FoxPro 2.1
D. Ingress
Answer: b
Explanation: RDBMS is Relational Database Management System.
17. One approach to standardization storing of data?
A. MIS
B. Structured programming
C. CODASYL specification
D. None of the mentioned
Answer: c
Explanation: CODASYL is an acronym for “Conference on Data Systems Languages”.
18. The highest level in the hierarchy of data organization is called
A. Data bank
B. Data base
C. Data file
D. Data record
Answer: b
Explanation: Database is a collection of all tables which contains the data in form of fields.
19. What is the purpose of index in sql server
A. To enhance the query performance
B. To provide an index to a record
C. To perform fast searches
D. All of the mentioned
Answer: D
Explanation: A database index is a data structure that improves the speed of data retrieval operations on a
database table at the cost of additional writes.
20. How many types of indexes are there in sql server?
A. 1
B. 2
C. 3
D. 4
Answer: B
21. How non clustered index point to the data?
A. It never points to anything
B. It points to a data row
C. It is used for pointing data rows containing key values
D. None of the mentioned
Answer: C
22. Which one is true about clustered index?
A. Clustered index is not associated with table
B. Clustered index is built by default on unique key columns
C. Clustered index is not built on unique key columns
D. None of the mentioned
Answer: B
23. What is true about indexes?
A. Indexes enhance the performance even if the table is updated frequently
B. It makes harder for SQL server engines to work to work on index which have large keys
C. It doesn’t make harder for SQL server engines to work to work on index which have large keys
D. None of the mentioned
Answer: B
24. Does index take space in the disk?
A. It stores memory as and when required
B. Yes, Indexes are stored on disk
C. Indexes are never stored on disk
D. Indexes take no space
Answer: B
25. If an index is _________________ the metadata and statistics continue to exists
E. Disabling
F. Dropping
G. Altering
H. Both a and b
Answer: A
26. A clustering index is defined on the fields which are of type