0% found this document useful (0 votes)
50 views

Dbms All Units Notes

Uploaded by

fruitnnut010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Dbms All Units Notes

Uploaded by

fruitnnut010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

DATABASE MANAGEMENT SYSTEMS

UNIT - I
Database System Applications: A Historical Perspective, File Systems versus a DBMS, the Data
Model, Levels of Abstraction in a DBMS, Data Independence, Structure of a DBMS
Introduction to Database Design: Database Design and ER Diagrams, Entities, Attributes, and
Entity Sets, Relationships and Relationship Sets, Additional Features of the ER Model, Conceptual
Design With the ER Model

Introduction
Data - Data is a collection of facts and figures. Ex: name, class, etc. The data collection was increasing
day to day and they needed to be stored in a device or a software which is safer.
Data is a collection of a distinct small unit of information. It can be used in a variety of forms like text,
numbers, media, bytes, etc. it can be stored in pieces of paper or electronic memory, etc.
Based on the way the data is stored it can be classified into 3 types: Structured data, Semi-structured data,
and Unstructured data.
Structured data is data whose elements are addressable for effective analysis. It has been organized into
a formatted repository that is typically a database. It concerns all data which can be stored in database
SQL in a table with rows and columns. They have relational keys and can easily be mapped into pre-
designed fields. Today, those data are most processed in the development and simplest way to manage
information. Example: Relational data.
Semi-structured data is information that does not reside in a relational database but that has some
organizational properties that make it easier to analyze. With some processes, you can store them in the
relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist
to ease space. Example: XML data.
Unstructured data is a data which is not organized in a predefined manner or does not have a predefined
data model; thus, it is not a good fit for a mainstream relational database. So, for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used
by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.
https://www.geeksforgeeks.org/difference-between-structured-semi-structured-and-unstructured-data/
Information - when data is processed ” record is also information “. example – pass or fail table etc.
Database – It is an organized collection of data and information or interrelated data collected at one
place.
We can organize data into tables, rows, columns, and index it to make it easier to find relevant
information.
The main purpose of the database is to operate a large amount of information by storing, retrieving, and
managing data.
There are many databases available like MySQL, Sybase, Oracle, MongoDB, Informix, PostgreSQL,
SQL Server, etc.
Modern databases are managed by the database management system (DBMS).
SQL or Structured Query Language is used to operate on the data stored in a database.
History of DBMS –
1960 – Charles Bachman designed the first DBMS system
1970 – Codd introduced IBM’S Information Management System (IMS)
1976 – Peter Chen coined and defined the Entity-relationship model also known as the ER model
1980 – Relational Model becomes a widely accepted database component
1985 – Object-oriented DBMS develops.
1990 – Incorporation of object-orientation in relational DBMS.
1991 – Microsoft ships MS access, a personal DBMS and that displaces all other personal DBMS
products.
1995 – First Internet database applications
1997 – XML applied to database processing. Many vendors begin to integrate XML into DBMS products.
https://www.ilearnlot.com/database-management-system-dbms-history/64955/

Evolution of Databases - The database has completed more than 50 years of journey of its evolution
from flat-file system to relational and objects relational systems. It has gone through several generations.
File-Based - 1968 was the year when File-Based database were introduced. In file-based databases, data
was maintained in a flat file. These file systems are used to handle a single or multiple files and are not
very efficient. The functionalities of a File-based Data Management System are as follows −
>A file-based system helps in basic data management for any user.
>The data stored in the file-based system should remain consistent. Any transactions
done in the file-based system should not alter the consistency property.
>The file-based system should not allow any illegal or potentially hazardous operations
to occur on the data.
>The file-based system should allow concurrent access by different processes and this
should be carefully coordinated.
>The file-based system should make sure that the data is uniformly structured and stored
so it is easier to access it.
Advantages
>The file Based system is not complicated and is simpler to use.
>Because of the above point, this system is quite inexpensive.
>Because the file based system is simple and cheap, it is normally suitable for home users
and owners of small businesses.
>Since the file based system is used by smaller organisations or individual users, it stores
comparatively lesser amount of data. Hence, the data can be accessed faster and more
easily.
Disadvantages
> The File based system is limited to a smaller size and cannot store large amounts of
data.
>This system is relatively uncomplicated but this means it cannot support complicated
queries, data recovery etc.
>There may be redundant data in the file based system as it does not have a complex
mechanism to get rid of it.
>The data is not very secure in a file based system and may be corrupted or destroyed.
>The data files in the file based system may be stored across multiple locations.
>Consequently, it is difficult to share the data easily with multiple users.
https://www.tutorialspoint.com/File-based-Data-Management-System
Hierarchical Data Model - 1968-1980 was the era of the Hierarchical Database. Prominent hierarchical
database model was IBM's first DBMS. It was called IMS (Information Management System). In this
model, files are related in a parent/child manner. Like file system, this model also had some limitations
like complex implementation, lack structural independence, can't easily handle a many-many relationship,
etc. Below diagram represents Hierarchical Data Model. Small circle represents objects.

Network data model - Charles Bachman developed the first DBMS at Honeywell called Integrated Data
Store (IDS). It was developed in the early 1960s, but it was standardized in 1971 by the CODASYL group
(Conference on Data Systems Languages).In this model, files are related as owners and members, like to
the common network model.
Network data model identified the following components:
Network schema (Database organization)
Sub-schema (views of database per user)
Data management language (procedural)
This model also had some limitations like system complexity and difficult to design and maintain.
Relational Database - 1970 - Present: It is the era of Relational Database and Database Management. In
1970, the relational model was proposed by E.F. Codd. Relational database model has two main
terminologies called instance and schema. The instance is a table with rows or columns. Schema specifies
the structure like name of the relation, type of each column and name. This model uses some
mathematical concept like set theory and predicate logic. The first internet database application had been
created in 1995. During the era of the relational database, many more models had introduced like object-
oriented model, object-relational model, etc.
Relational database properties
Atomicity
Consistency
Integrity
Durability
Concurrency
Query processing
Cloud database - Cloud database facilitates you to store, manage, and retrieve their structured,
unstructured data via a cloud platform. This data is accessible over the Internet. Cloud databases are also
called a database as service (DBaaS) because they are offered as a managed service.
Some best cloud options are:
AWS (Amazon Web Services)
Snowflake Computing
Oracle Database Cloud Services
Microsoft SQL server
Google cloud spanner
Advantages of cloud database
Lower costs
Automated
Increased accessibility
NoSQL Database - A NoSQL database is an approach to design such databases that can accommodate a
wide variety of data models. NoSQL stands for "not only SQL." It is an alternative to traditional relational
databases in which data is placed in tables, and data schema is perfectly designed before the database is
built. NoSQL databases are useful for a large set of distributed data.
Some examples of NoSQL database system with their category are:
MongoDB, CouchDB, Cloudant (Document-based)
Memcached, Redis, Coherence (key-value store)
HBase, Big Table, Accumulo (Tabular)
Neo4j, Neptune database(Graph)
Advantage of NoSQL
High Scalability
High Availability
Disadvantage of NoSQL
Open source
Management challenge
GUI is not available
Backup
The Object-Oriented Databases - The object-oriented databases contain data in the form of object and
classes. Objects are the real-world entity, and types are the collection of objects. An object-oriented
database is a combination of relational model features with objects oriented principles. It is an alternative
implementation to that of the relational model. Object-oriented databases hold the rules of object-oriented
programming. An object-oriented database management system is a hybrid application.
Object-oriented programming properties
Objects
Classes
Inheritance
Polymorphism
Encapsulation
Abstraction
Graph Databases - A graph database is a NoSQL database. It is a graphical representation of data. It
contains nodes and edges. A node represents an entity, and each edge represents a relationship between
two edges. Every node in a graph database represents a unique identifier. Graph databases are beneficial
for searching the relationship between data because they highlight the relationship between relevant data.
Graph databases are very useful when the database contains a complex relationship and dynamic schema.
It is mostly used in supply chain management, identifying the source of IP telephony.
DBMS (Data Base Management System) - Database management System is software which is used to
store and retrieve the database. For example, Oracle, MySQL, etc.; these are some popular DBMS tools.
DBMS provides the interface to perform the various operations like creation, deletion,
modification, etc.
DBMS allows the user to create their databases as per their requirement.
DBMS accepts the request from the application and provides specific data through the operating
system.
DBMS contains the group of programs which acts according to the user instruction.
It provides security to the database.
Advantage of DBMS
Controls redundancy
Data sharing
Backup
Multiple user interfaces
Disadvantage of DBMS
Size
Cost
Complexity
RDBMS (Relational Database Management System) - The word RDBMS is termed as 'Relational
Database Management System.' It is represented as a table that contains rows and column. RDBMS is
based on the Relational model; it was introduced by E. F. Codd.
A relational database contains the following components:
Table
Record/ Tuple
Field/Column name /Attribute
Instance
Schema
Keys
An RDBMS is a tabular DBMS that maintains the security, integrity, accuracy, and consistency of the
data.
https://www.javatpoint.com/what-is-database

File Systems versus a DBMS


Basis File System DBMS

The file system is software that manages and


DBMS is software for managing
organizes the files in a storage medium
the database.
Structure within a computer.

Data Redundant data can be present in a file In DBMS there is no redundant


Redundancy system. data.

Backup and It doesn’t provide backup and recovery of It provides backup and recovery
Recovery data if it is lost. of data even if it is lost.

Query There is no efficient query processing in the Efficient query processing is there
processing file system. in DBMS.

Consistency There is less data consistency in the file There is more data consistency
Basis File System DBMS

system. because of the process of


normalization.

It has more complexity in


It is less complex as compared to DBMS. handling as compared to the file
Complexity system.

DBMS has more security


File systems provide less security in
Security mechanisms as compared to file
comparison to DBMS.
Constraints systems.

It has a comparatively higher cost


It is less expensive than DBMS.
Cost than a file system.

Data In DBMS data independence


There is no data independence.
Independence exists.

Multiple users can access data at a


Only one user can access data at a time.
User Access time.

The user has to write procedures for The user not required to write
Meaning managing databases procedures.

Data is distributed in many files. So, not easy Due to centralized nature sharing
Sharing to share data is easy

Data It give details of storage and representation It hides the internal details of
Abstraction of data Database

Integrity Integrity Constraints are difficult to Integrity constraints are easy to


Constraints implement implement

Example Cobol, C++ Oracle, SQL Server

https://www.geeksforgeeks.org/difference-between-file-system-and-dbms/

The Data Model


A data model in DBMS is a set of concepts and rules that are used to describe and organize the data in a
database.
It defines the structure, relationships, and constraints of the data, and provides a way to access and
manipulate the data.
Different data models are used to represent different types of data and relationships, and each has its own
set of advantages and disadvantages.
11 types of data models in DBMS
Relational Model - In a relational model, each table, also known as a relation, represents a
specific entity, such as a customer or an order. Each row in the table represents an instance of that
entity and each column represents an attribute or property of that entity. The tables are related to
each other through keys.
For example, a simple relational model for a retail store can have the following tables:
Customers: (columns for customer ID, name, address, and phone number)
Products (columns for product ID, name, description, and price)
Orders: (columns for order ID, customer ID, and order date)
Order_items: (columns for order ID, product ID, and quantity)
Flat Data Model - A flat data model, also known as a flat file model, is a type of data model that
stores data in a single table or file, with no nested or related structures. Each record in the file
represents an instance of an entity or concept, and each field within the record represents an
attribute or property of that entity or concept.
A flat data model is a simple and basic way to store data, and it is suitable for small and simple
databases where the relationships between the data are not complex.
It’s important to note that flat data model is not widely used in modern DBMS, and it’s not
recommended for large or complex databases because it lacks the ability to handle relationships
between the data and it’s not efficient for data retrieval and manipulation.
For example, a simple flat data model for a retail store can include a single file with the following
fields: Order ID, Customer name, Customer address, Customer phone number, Product name,
Product description, Product price, Order date, Quantity.
Entity-Relationship Model - The entity-relationship (ER) model is a data model that describes
the relationships between entities in a database. It is used to represent the structure and constraints
of a database in a conceptual way, independent of any specific DBMS or data model. The ER
model is often used during the early stages of database design to define the requirements and
constraints of a database and to communicate them to stakeholders.
In the ER model, entities are represented as rectangles and relationships are represented as
diamond shapes. Entities are objects or concepts that have a distinct identity and properties, such
as a customer or a product. Relationships are the associations between entities, such as a customer
placing an order.
The ER model is used to design and understand the structure of a database, it’s used as a blueprint
for the physical database design and it’s also used to communicate the requirements of the
database to the stakeholders.
For example, a simple ER model for a retail store with the following entities:
Customer
Product
Order
and the following relationships:
A customer places one or more orders
An order is placed by one customer
An order contains one or more products
A product is contained in one or more orders
Network Model - The network model is a type of data model that represents data as a collection
of records and relationships. It is similar to the hierarchical model in that it represents data as a
tree-like structure, with each record having a parent-child relationship with one or more other
records. However, the network model allows for many-to-many relationships, whereas the
hierarchical model only allows for one-to-many relationships.
In the network model, records are represented as nodes and relationships are represented as links
between nodes. Each node has a unique identifier, also called a record number, and can have
multiple links to other nodes. The relationships are defined by link types and link sets, which
specify the type of relationship and the number of links between nodes.
The network model is mainly used in legacy systems and is not as popular as the relational model,
it has some advantages over the hierarchical model, such as more flexibility in defining
relationships and it allows for many-to-many relationships. However, it’s more complex than the
relational model and it’s not as efficient for data retrieval and manipulation.
For example, a simple network model for a retail store might include the following records:
Customers: with fields for customer ID, name, address, and phone number.
Products: with fields for product ID, name, description, and price.
Orders: with fields for order ID, customer ID, and order date.
Hierarchical Model - A hierarchical data model is a way of organizing data in a tree-like
structure, where each record is a parent or child of one or more other records. This structure is
similar to an organizational chart, where each record is a parent of one or more child records, and
each child record is a parent of one or more child records.
In a hierarchical data model, each record has a unique identifier, also called a primary key, and
one or more fields that store the data for that record. The relationships between records are
defined by pointers, which are fields that contain the primary key of the parent or child record.
Hierarchical data models were widely used in the past, but they have been largely replaced by
other types of data models, such as the relational model, because they have some limitations like
one-to-many relationships and they are not flexible for the data retrieval and manipulation. They
are also not efficient for large and complex databases.
For example, a simple hierarchical data model for a retail store might include the following
records:
Customers: with fields for customer ID, name, address, and phone number.
Products: with fields for product ID, name, description, and price.
Orders: with fields for order ID, customer ID, product ID, and order date.
Object-oriented Data Model – The object-oriented data model (OODM) is a data model that
represents data as objects, which are instances of classes. Classes are templates that define the
properties and methods of the objects, and objects are instances of those classes with specific
values for those properties.
In an object-oriented data model, each object has a unique identity, a set of properties (or
attributes), and a set of methods (or behaviors). Objects also have a class, which is a template that
defines the properties and methods of the object. Classes can also be organized into a class
hierarchy, where a class can inherit properties and methods from a parent class.
The object-oriented data model is widely used in object-oriented programming languages and in
applications that handle complex and dynamic data, such as video games, virtual reality, and
simulations. This model allows for the encapsulation of data and behavior, inheritance and
polymorphism which are the key features of object-oriented programming. Additionally, it’s a
powerful model for data manipulation, but it’s not well suited for handling large amounts of data.
For example, a simple object-oriented data model for a retail store might include the following
classes:
Customer: with properties for customer ID, name, address, and phone number, and
methods for placing an order and viewing their order history.
Product: with properties for product ID, name, description, and price, and methods for
viewing product details and checking availability.
Order: with properties for order ID, customer ID, product ID, and order date, and
methods for viewing order details and calculating the total cost.
Object Relation Model - The Object-Relational Model (ORM) is a data model that combines the
features of the object-oriented data model and the relational data model. ORM is a way to work
with relational databases using an object-oriented programming language, like Java or C#.
In an ORM, data is represented using classes and objects, like in an object-oriented data model,
but the data is stored in a relational database. The ORM provides an abstraction layer between the
object-oriented code and the relational database, allowing developers to work with the data as if it
were in-memory objects, rather than dealing with the complexities of SQL and database schemas.
ORM is widely used in modern web and mobile application development, as it allows developers
to work with databases using their preferred programming languages and it simplify the database
access and management. But, it can also cause performance issues if it’s not implemented
correctly.
For example, a simple ORM for a retail store might include the following classes:
Customer: with properties for customer ID, name, address, and phone number, and
methods for placing an order and viewing their order history.
Product: with properties for product ID, name, description, and price, and methods for
viewing product details and checking availability.
Order: with properties for order ID, customer ID, product ID, and order date, and
methods for viewing order details and calculating the total cost.
Record Based Model - A Record-Based Model is a data model that represents data as a
collection of records, where each record is a collection of fields. Each field has a unique name
and a specific data type, and each record has a unique identifier, also known as a primary key.
In a Record-Based Model, the data is stored in a flat file, which is a file that contains all the
records of the database, with no hierarchical or relational structure. The records are stored in a
specific order, typically in the order they were added to the database.
For example, a simple Record-Based Model for a retail store might include the following fields:
Customer: with fields for customer ID, name, address, and phone number.
Product: with fields for product ID, name, description, and price.
Order: with fields for order ID, customer ID, product ID, and order date.
Semi-structured Model - A Semi-structured Data Model is a data model that allows for the
storage and management of data that has some inherent structure, but also allows for the presence
of unstructured or semi-structured data. In this kind of data model, the data can be of any type,
and it can be organized in a variety of ways, as opposed to the strict structure of traditional data
models like relational or hierarchical models.
Semi-structured data models are becoming increasingly popular in big data and NoSQL databases
as they allow for the storage and management of large amounts of data in a flexible and scalable
way.
An example of semi-structured data is an XML document. XML is a markup language that allows
for the creation of custom tags, which can be used to specify the structure of the data. However,
the data within the tags can be of any type, and the structure of the tags does not have to be the
same for every document.
For example, in social media platforms like Twitter, the data is semi-structured because the tweets
have some structure (e.g. text, hashtags, mentions, etc.) but users can use any text format to
express themselves.
Associative Model - An Associative Data Model, also known as the Object-Associative Data
Model, is a data model that combines the features of the object-oriented data model and the
relational data model. It is a hybrid data model that allows the representation of data in the form
of objects and relationships between those objects, similar to the Entity-Relationship model.
In the Associative Data Model, data is represented using classes and objects, like in an object-
oriented data model, but the relationships between the objects are represented using associations,
which are similar to the relationships in a relational data model. The associations are defined by
the developer and can be one-to-one, one-to-many, or many-to-many.
For example, a simple Associative Data Model for a retail store might include the following
classes:
Customer: with properties for customer ID, name, address, and phone number, and
associations to the orders they have placed.
Product: with properties for product ID, name, description, and price, and associations to
the orders in which it appears.
Order: with properties for order ID, customer ID, product ID, and order date, and
associations to the customer and product objects.
In this example, the Associative Data Model allows developers to work with the data in a more
natural way, as they can navigate the associations between the objects rather than dealing with the
complexities of SQL and database schemas. It allows for a more intuitive representation of data
and relationships, but it’s not as widely used as other models like RDBMS.
Context Data Model - A Context Data Model is a data model that captures the context in which
data is created, stored, and used. The context includes information about the data itself, as well as
information about the environment in which the data is generated, stored, and used.
In a Context Data Model, data is represented as a set of entities, where each entity is a collection
of attributes. The attributes of an entity describe the properties of the data, such as its type,
format, and size. The context of the data is also represented as attributes of the entity, such as the
time and location of data creation, the user who created it, and the applications and systems that
use it.
The Context Data Model is still not widely used, it’s more of a theoretical concept that could be
used in specific scenarios where understanding and capturing the context of the data is crucial.
It’s one of the newer data models that is still being researched and developed.
For example, a simple Context Data Model for a retail store might include the following entities:
Product: with attributes for product ID, name, description, price, and category.
Order: with attributes for order ID, customer ID, product ID, and order date.
Customer: with attributes for customer ID, name, address, phone number, and email.
https://databasetown.com/data-models-in-dbms-with-examples/

Levels of Abstraction in a DBMS


Data abstraction is the procedure of concealing irrelevant or unwanted data from the end user.
Example, A person went to buy a pair of shoes
In DBMS, there are three levels of data abstraction, which are as follows:
Physical or Internal Level - The physical or internal layer is the lowest level of data abstraction in the
database management system. It is the layer that defines how data is actually stored in the database. It
defines methods to access the data in the database. It defines complex data structures in detail, so it is
very complex to understand, which is why it is kept hidden from the end user.
Data Administrators (DBA) decide how to arrange data and where to store data. The Data Administrator
(DBA) is the person whose role is to manage the data in the database at the physical or internal level.
There is a data center that securely stores the raw data in detail on hard drives at this level.
Logical or Conceptual Level - The logical or conceptual level is the intermediate or next level of data
abstraction. It explains what data is going to be stored in the database and what the relationship is
between them.
It describes the structure of the entire data in the form of tables. The logical level or conceptual level is
less complex than the physical level. With the help of the logical level, Data Administrators (DBA)
abstract data from raw data present at the physical level.
View or External Level - View or External Level is the highest level of data abstraction. There are
different views at this level that define the parts of the overall data of the database. This level is for the
end-user interaction; at this level, end users can access the data based on their queries.
Advantages of data abstraction in DBMS
Users can easily access the data based on their queries.
It provides security to the data stored in the database.
Database systems work efficiently because of data abstraction.
https://www.javatpoint.com/data-abstraction-in-dbms

Data Independence
Data Independence is defined as a property of DBMS that helps you to change the Database schema at
one level of a database system without requiring to change the schema at the next higher level. Data
independence helps you to keep data separated from all programs that make use of it.
You can use this stored data for computing and presentation. In many systems, data independence is an
essential function for components of the system.

A database schema is the logical representation of a database


The schema does not physically contain the data itself; instead, it gives information about the shape of
data and how it can be related to other tables or models.
Type of Schema
External/View Schema - The view level design of a database is known as view schema. This
schema generally describes the end-user interaction with the database systems.
Conceptual/Logical Schema - The Logical database schema specifies all the logical constraints
that need to be applied to the stored data. It defines the views, integrity constraints, and table.
Here the term integrity constraints define the set of rules that are used by DBMS (Database
Management System) to maintain the quality for insertion & update the data. The logical schema
represents how the data is stored in the form of tables and how the attributes of a table are linked
together.
Internal/Physical Schema - A physical schema specifies how the data is stored physically on a
storage system or disk storage in the form of Files and Indices. Designing a database at the
physical level is called a physical schema.
Types of Data Independence
Physical Data Independence - Physical data independence helps you to separate conceptual levels from
the internal/physical levels. It allows you to provide a logical description of the database without the need
to specify physical structures. Compared to Logical Independence, it is easy to achieve physical data
independence.
With Physical independence, you can easily change the physical storage structures or devices with an
effect on the conceptual schema. Any change done would be absorbed by the mapping between the
conceptual and internal levels. Physical data independence is achieved by the presence of the internal
level of the database and then the transformation from the conceptual level of the database to the internal
level.
Examples of changes under Physical Data Independence
Due to Physical independence, any of the below change will not affect the conceptual layer.
Using a new storage device like Hard Drive or Magnetic Tapes
Modifying the file organization technique in the Database
Switching to different data structures.
Changing the access method.
Modifying indexes.
Changes to compression techniques or hashing algorithms.
Change of Location of Database from say C drive to D Drive
Logical Data Independence - Logical Data Independence is the ability to change the conceptual scheme
without changing
External views
External API or programs
Any change made will be absorbed by the mapping between external and conceptual levels.
When compared to Physical Data independence, it is challenging to achieve logical data independence.
Examples of changes under Logical Data Independence
Due to Logical independence, any of the below change will not affect the external layer.
Add/Modify/Delete a new attribute, entity or relationship is possible without a rewrite of existing
application programs
Merging two records into one
Breaking an existing record into two or more records

Importance of Data Independence


Helps you to improve the quality of the data
Database system maintenance becomes affordable
Enforcement of standards and improvement in database security
You don’t need to alter data structure in application programs
Permit developers to focus on the general structure of the Database rather than worrying about the
internal implementation
It allows you to improve state which is undamaged or undivided
Database incongruity is vastly reduced
Easily make modifications in the physical level is needed to improve the performance of the
system
Logical Data Independence Physical Data Independence
Logical Data Independence is mainly concerned Mainly concerned with the storage of the data
with the structure or changing the data definition
It is difficult as the retrieving of data is mainly It is easy to retrieve
dependent on the logical structure of data
Compared to Logic Physical independence it is Compared to Logical Independence it is easy to
difficult to achieve logical data independence achieve physical data independence
You need to make changes in the Application A change in the physical level usually does not
program if new fields are added or deleted from the need change at the Application program level
database
Modification at the logical levels is significant Modifications made at the internal levels may or
whenever the logical structures of the database are may not be needed to improve the performance of
changed the structure
Concerned with conceptual schema Concerned with internal schema
Example: Add/Modify/Delete a new attribute Example: change in compression techniques,
hashing algorithms, storage devices, etc
https://www.javatpoint.com/database-schema
https://www.guru99.com/dbms-data-independence.html

Structure of a DBMS
Structure of Database Management System is also referred to as Overall System Structure or Database
Architecture.
Database Management System (DBMS) is software that allows access to data stored in a database and
provides an easy and effective method of –
Defining the information
Storing the information
Manipulating the information
Protecting the information from system crashes or data theft
Differentiating access permissions for different users
Data Theft: When somebody steals the information stored on databases, and servers, this process is
known as Data Theft.
The database system is divided into three components:
Query Processor
Storage Manager
Disk Storage
Query Processor - It interprets the requests (queries) received from end user via an application program
into instructions. It also executes the user request which is received from the DML compiler.
Query Processor contains the following components –
DML Compiler - It processes the DML statements into low level instruction (machine language),
so that they can be executed.
DDL Interpreter - It processes the DDL statements into a set of table containing meta data (data
about data).
Embedded DML Pre-compiler - It processes DML statements embedded in an application
program into procedural calls.
Query Optimizer - It executes the instruction generated by DML Compiler.
Storage Manager: Storage Manager is a program that provides an interface between the data stored in
the database and the queries received. It is also known as Database Control System. It maintains the
consistency and integrity of the database by applying the constraints and executing the DCL statements. It
is responsible for updating, storing, deleting, and retrieving data in the database.
It contains the following components –
Authorization Manager: It ensures role-based access control, i.e,. checks whether the particular
person is privileged to perform the requested operation or not.
Integrity Manager: It checks the integrity constraints when the database is modified.
Transaction Manager: It controls concurrent access by performing the operations in a scheduled
way that it receives the transaction. Thus, it ensures that the database remains in the consistent
state before and after the execution of a transaction.
File Manager: It manages the file space and the data structure used to represent information in
the database.
Buffer Manager: It is responsible for cache memory and the transfer of data between the
secondary storage and main memory.
Disk Storage: It contains the following components –
Data Files: It stores the data.
Data Dictionary: It contains the information about the structure of any database object. It is the
repository of information that governs the metadata.
Indices: It provides faster retrieval of data item.
s
DATABASE MANAGEMENT SYSTEMS
UNIT - I
Database System Applications: A Historical Perspective, File Systems versus a DBMS, the Data
Model, Levels of Abstraction in a DBMS, Data Independence, Structure of a DBMS
Introduction to Database Design: Database Design and ER Diagrams, Entities, Attributes, and
Entity Sets, Relationships and Relationship Sets, Additional Features of the ER Model, Conceptual
Design With the ER Model

Introduction to Database Design: Database Design


Database design can be generally defined as a collection of tasks or processes that enhance the designing,
development, implementation, and maintenance of enterprise data management system.
Designing a proper database reduces the maintenance cost thereby improving data consistency and the
cost-effective measures are greatly influenced in terms of disk storage space.
The designer should follow the constraints and decide how the elements correlate and what kind of data
must be stored.

Why is Database Design important


The importance of database design is as follows:
1. Database designs provide the blueprints of how the data is going to be stored in a system. A proper
design of a database highly affects the overall performance of any application.
2. The designing principles defined for a database give a clear idea of the behavior of any application
and how the requests are processed.
3. Another instance to emphasize the database design is that a proper database design meets all the
requirements of users.
4. Lastly, the processing time of an application is greatly reduced if the constraints of designing a highly
efficient database are properly implemented.

Life Cycle

Requirement Analysis
First of all, the planning has to be done on what are the basic requirements of the project under which the
design of the database has to be taken forward. Thus, they can be defined as

Planning - This stage is concerned with planning the entire DDLC (Database Development Life
Cycle). The strategic considerations are taken into account before proceeding.
System definition - This stage covers the boundaries and scopes of the proper database after
planning.
Database Designing
The next step involves designing the database considering the user-based requirements and splitting them
out into various models so that load or heavy dependencies on a single aspect are not imposed. Therefore,
there has been some model-centric approach and that's where logical and physical models play a crucial
role.
Physical Model - The physical model is concerned with the practices and implementations of the
logical model.
Logical Model - This stage is primarily concerned with developing a model based on the
proposed requirements. The entire model is designed on paper without any implementation or
adopting DBMS considerations.
Implementation
The last step covers the implementation methods and checking out the behavior that matches our
requirements. It is ensured with continuous integration testing of the database with different data sets and
conversion of data into machine understandable language. The manipulation of data is primarily focused
on these steps where queries are made to run and check if the application is designed satisfactorily or not.
Data conversion and loading - This section is used to import and convert data from the old to
the new system.
Testing - This stage is concerned with error identification in the newly implemented system.
Testing is a crucial step because it checks the database directly and compares the requirement
specifications.
Database Design Process
The process of designing a database carries various conceptual approaches that are needed to be kept in
mind. An ideal and well-structured database design must be able to:
Save disk space by eliminating redundant data.
Maintains data integrity and accuracy.
Provides data access in useful ways.
Comparing Logical and Physical data models.
https://www.javatpoint.com/database-design

Introduction to Database Design: ER Diagrams


ER model stands for an Entity-Relationship model. It is a high-level data model. This model is used to
define the data elements and relationship for a specified system.
It develops a conceptual design for the database. It also develops a very simple and easy to design view of
data.
In ER modeling, the database structure is portrayed as a diagram called an entity-relationship diagram.
For example, we are designing college database. In this database, the student will be an entity with
attributes like address, name, id, age, etc. The address can be another entity with attributes like city, street
name, pin code, etc and there will be a relationship between them.
Component of ER Diagram

Entity - An entity can be any place, person, object, or class. In an ER Diagram, the entity can
be portrayed as rectangles
Weak Entity - A weak entity is the one which is dependent on another entity. It is portrayed
by a double rectangle.
Attribute - The attribute is used to describe the property of an entity. Eclipse is used to
represent an attribute. For example, id, age, contact number, name, etc. can be attributes of a
student.
Key Attribute - The key attribute is used to represent the main characteristics of an entity. It
represents a primary key. The key attribute is represented by an ellipse with the text
underlined.
Composite Attribute - An attribute that composed of many other attributes is known as a
composite attribute. The composite attribute is represented by an ellipse, and those ellipses are
connected with an ellipse.
Multivalued Attribute - An attribute can have more than one value. These attributes are
known as a multivalued attribute. The double oval is used to represent multivalued attribute.
For example, a student can have more than one phone number.
Derived Attribute - An attribute that can be derived from other attribute is known as a derived
attribute. It can be represented by a dashed ellipse. For example, A person's age changes over
time and can be derived from another attribute like Date of birth.
Relationship
A diamond-shaped box represents relationships. All the entities (rectangle-shaped) participating
in a relationship get connected using a line.

There are four types of relationships. These are:


One-to-one: When only a single instance of an entity is associated with the relationship, it
istermed as '1:1'.

One-to-many: When more than one instance of an entity is related and linked with a
relationship,it is termed as '1:N'.

Many-to-one: When more than one instance of an entity is linked with the relationship, it
istermed as 'N:1'.

Many-to-many: When more than one instance of an entity on the left and more than one
instance of an entity on the right can be linked with the relationship, then it is termed as ‘N:N’
relationship.

Participation Constraints
Total Participation: Each entity is involved in the relationship. Total participation is represented by
double lines.
Partial participation: Not all entities are involved in the relationship. Partial participation is represented
by single lines.

https://www.javatpoint.com/dbms-er-model-concept
https://www.tutorialspoint.com/dbms/er_diagram_representation.htm
Entities
The role of the entity is the representation and management of data.
An entity is referred to as an object or thing that exists in the real world. For example, customer, car, pen,
etc.
Entities are stored in the database, and they should be distinguishable, i.e., they should be easily
identifiable from the group.
For extracting data from the database, each data must be unique in its own way so that it becomes easier
to differentiate between them. Distinct and unique data is known as an entity.
An entity has some attributes which depict the entity's characteristics.
For example, an entity "Student" has attributes such as "Student_roll_no", "Student_name",
"Student_subject", and "Student_marks".
Kinds of Entity:
There are two kinds of entities, which are as follows:
Tangible Entity: It is an entity in DBMS, which is a physical object that we can touch or see. In
simple words, an entity that has a physical existence in the real world is called a tangible entity.
For example, in a database, a table represents a tangible entity because it contains a physical
object that we can see and touch in the real world. It includes colleges, bank lockers, mobiles,
cars, watches, pens, paintings, etc.
Intangible Entity: It is an entity in DBMS, which is a non-physical object that we cannot see or
touch. In simple words, an entity that does not have any physical existence in the real world is
known as an intangible entity. For example, a bank account logically exists, but we cannot see or
touch it.
Entity Type:
A collection of entities with general characteristics is known as an entity type.
For example, a database of a corporate company has entity types such as employees, departments, etc. In
DBMS, every entity type contains a set of attributes that explain the entity.
The Employee entity type can have attributes such as name, age, address, phone number, and salary.
The Department entity type can have attributes such as name, number, and location in the department.
Kinds of Entity Type
There are two kinds of entity type, which are as follows:
Strong Entity Type: It is an entity that has its own existence and is independent. The entity
relationship diagram represents a strong entity type with the help of a single rectangle. Below is
the ERD of the strong entity type:
Weak Entity Type: It is an entity that does not have its own existence and relies on a strong
entity for its existence. The Entity Relationship Diagram represents the weak entity type using
double rectangles. Below is the ERD of the weak entity type:

In the above example, "Address" is a weak entity type with attributes such as House No., City, Location,
and State.
The relationship between a strong and a weak entity type is known as an identifying relationship.
Using a double diamond, the Entity-Relationship Diagram represents a relationship between the strong
and the weak entity type.
Entity Relationsip Diagram for strong and weak entity types

https://www.javatpoint.com/entity-in-dbms

Attributes
In DBMS, we have entities, and each entity contains some property about their behavior which is also
called the attribute. In relational databases, we have tables, and each column contains some entity that has
some attributes, so all the entries for that column should strictly follow the attribute of the entity. Entities
define the characteristic property of the attributes.
An eclipse is considered to show an attribute. For instance, Contact number, ID, age, etc
Simple Attribute: It is also known as atomic attributes. When an attribute cannot be divided
further, then it is called a simple attribute.
For example, in a student table, the branch attribute cannot be further divided. It is called a simple
or atomic attribute because it contains only a single value that cannot be broken further.
Composite Attribute: Composite attributes are those that are made up of the composition of
more than one attribute. When any attribute can be divided further into more sub-attributes, then
that attribute is called a composite attribute.
For example, in a student table, we have attributes of student names that can be further broken
down into first name, middle name, and last name. So the student name will be a composite
attribute.
Another example from a personal detail table would be the attribute of address. The address can
be divided into a street, area, district, and state.
Single-valued Attribute: Those attributes which can have exactly one value are known as single
valued attributes. They contain singular values, so more than one value is not allowed.
For example, the DOB of a student can be a single valued attribute. Another example is gender
because one person can have only one gender.
Multi-valued Attribute: Those attributes which can have more than one entry or which contain
more than one value are called multi valued attributes.
In the Entity Relationship (ER) diagram, we represent the multi valued attribute by double oval
representation.
For example, one person can have more than one phone number, so that it would be a multi
valued attribute. Another example is the hobbies of a person because one can have more than one
hobby.
Derived Attribute: Derived attributes are also called stored attributes. When one attribute can be
derived from the other attribute, then it is called a derived attribute. We can do some calculations
on normal attributes and create derived attributes.
For example, the age of a student can be a derived attribute because we can get it by the DOB of
the student.
Another example can be of working experience, which can be obtained by the date of joining of
an employee.
In the ER diagram, we represent the derived attributes by a dotted oval shape.
Complex Attribute: If any attribute has the combining property of multi values and composite
attributes, then it is called a complex attribute. It means if one attribute is made up of more than
one attribute and each attribute can have more than one value, then it is called a complex
attribute.
For example, if a person has more than one office and each office has an address made from a
street number and city. So the address is a composite attribute, and offices are multi valued
attributes, So combing them is called complex attributes.
Key Attribute: Those attributes which can be identified uniquely in the relational table are called
key attributes.
For example, a student is a unique attribute.
In the below example, we have an ER diagram of a table named Employee. We have a lot of attributes
from the above table.
Department is a single valued attribute that can have only one value.
Name is a composite attribute because it is made up of a first name and the last name as the
middle name attribute.
Work Experience attribute is a derived attribute, and it is represented by a dotted oval. We can get
the work experience by the other attribute date of joining.
Phone number is a multi-valued attribute because one employee can have more than one phone
number, which is represented by a double oval representation.
https://www.javatpoint.com/attributes-in-dbms

Entity Sets
An entity set is a group of entities of the same entity type.
For example, an entity set of students, an entity set of motorbikes, an entity of smartphones, an entity of
customers, etc.

Entity sets can be classified into two types:


Strong Entity Set: In a DBMS, a strong entity set consists of a primary key. For example, an
entity of motorbikes with the attributes, motorbike's registration number, motorbike's name,
motorbike's model, and motorbike's colour.
Below is the representation of a strong entity set in tabular form:

Example of Entity Relationship Diagram representation of the above strong entity set:
Weak Entity Set: In a DBMS, a weak entity set does not contain a primary key. For example, An
entity of smartphones with its attributes, phone's name, phone's colour, and phone's RAM.
Below is the representation of a weak entity set in tabular form:

Example of Entity Relationship Diagram representation of the above weak entity set:

https://www.javatpoint.com/entity-in-dbms

Keys
Keys are one of the basic requirements of a relational database model. It is widely used to identify the
tuples(rows) uniquely in the table. We also use keys to set up relations amongst various columns and
tables of a relational database.
Types of Keys
Candidate Key: The minimal set of attributes that can uniquely identify a tuple is known as a candidate
key. For Example, STUD_NO in STUDENT relation.
It is a minimal super key.
It is a super key with no repeated data is called a candidate key.
The minimal set of attributes that can uniquely identify a record.
It must contain unique values.
It can contain NULL values.
Every table must have at least a single candidate key.
A table can have multiple candidate keys but only one primary key (the primary key cannot have
a NULL value, so the candidate key with a NULL value can’t be the primary key).
The value of the Candidate Key is unique and may be null for a tuple.
There can be more than one candidate key in a relationship.
Example:
STUD_NO is the candidate key for relation STUDENT.
Table STUDENT
STUD_NO SNAME ADDRESS PHONE
1 Shyam Delhi 123456789
2 Rakesh Kolkata 223365796
3 Suraj Delhi 175468965
The candidate key can be simple (having only one attribute) or composite as well.
Example:
{STUD_NO, COURSE_NO} is a composite
candidate key for relation STUDENT_COURSE.
Table STUDENT_COURSE
STUD_NO TEACHER_NO COURSE_NO
1 001 C001
2 056 C005

Primary Key: There can be more than one candidate key in relation out of which one can be chosen as
the primary key. For Example, STUD_NO, as well as STUD_PHONE, are candidate keys for relation
STUDENT but STUD_NO can be chosen as the primary key (only one out of many candidate keys).
It is a unique key.
It can identify only one tuple (a record) at a time.
It has no duplicate values, it has unique values.
It cannot be NULL.
Primary keys are not necessarily to be a single column; more than one column can also be a
primary key for a table.
Example:
STUDENT table -> Student(STUD_NO, SNAME, ADDRESS, PHONE) , STUD_NO is a primary key
Table STUDENT
STUD_NO SNAME ADDRESS PHONE
1 Shyam Delhi 123456789
2 Rakesh Kolkata 223365796
3 Suraj Delhi 175468965
Super Key: The set of attributes that can uniquely identify a tuple is known as Super Key. For Example,
STUD_NO, (STUD_NO, STUD_NAME), etc. A super key is a group of single or multiple keys that
identifies rows in a table. It supports NULL values.
Adding zero or more attributes to the candidate key generates the super key.
A candidate key is a super key but vice versa is not true.
Example:
Consider the table shown above.
STUD_NO+PHONE is a super key.

Alternate Key: The candidate key other than the primary key is called an alternate key.
All the keys which are not primary keys are called alternate keys.
It is a secondary key.
It contains two or more fields to identify two or more records.
These values are repeated.
Eg:- SNAME, and ADDRESS is Alternate keys
Example:
Consider the table shown above.
STUD_NO, as well as PHONE both, are candidate keys for relation STUDENT but PHONE will be an
alternate key
(only one out of many candidate keys).

Foreign Key: If an attribute can only take the values which are present as values of some other attribute,
it will be a foreign key to the attribute to which it refers. The relation which is being referenced is called
referenced relation and the corresponding attribute is called referenced attribute the relation which refers
to the referenced relation is called referencing relation and the corresponding attribute is called
referencing attribute. The referenced attribute of the referenced relation should be the primary key to it.
It is a key it acts as a primary key in one table and it acts as
secondary key in another table.
It combines two or more relations (tables) at a time.
They act as a cross-reference between the tables.
For example, DNO is a primary key in the DEPT table and a non-key in EMP
Example:
Refer Table STUDENT shown above.
STUD_NO in STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
Table STUDENT_COURSE
STUD_NO TEACHER_NO COURSE_NO
1 001 C001
2 056 C005
It may be worth noting that, unlike the Primary Key of any given relation, Foreign Key can be NULL as
well as may contain duplicate tuples i.e. it need not follow uniqueness constraint. For Example,
STUD_NO in the STUDENT_COURSE relation is not unique. It has been repeated for the first and third
tuples. However, the STUD_NO in STUDENT relation is a primary key and it needs to be always unique,
and it cannot be null.

Composite Key: Sometimes, a table might not have a single column/attribute that uniquely identifies all
the records of a table. To uniquely identify rows of a table, a combination of two or more
columns/attributes can be used. It still can give duplicate values in rare cases. So, we need to find the
optimal set of attributes that can uniquely identify rows in a table.
It acts as a primary key if there is no primary key in a table
Two or more attributes are used together to make a composite key.
Different combinations of attributes may give different accuracy in terms of identifying the rows
uniquely.
Example:
FULLNAME + DOB can be combined together to access the details of a student.

https://www.geeksforgeeks.org/types-of-keys-in-relational-model-candidate-super-primary-alternate-and-
foreign/
Relationships and Relationship Sets
Relationship
The association among entities is called a relationship. For example, an employee works_at a
department, a student enrolls in a course. Here, Works_at and Enrolls are called relationships.
A diamond-shaped box represents relationships. All the entities (rectangle-shaped) participating
in a relationship get connected using a line.

There are four types of relationships. These are:


One-to-one: When only a single instance of an entity is associated with the relationship, it
istermed as '1:1'.

One-to-many: When more than one instance of an entity is related and linked with a
relationship,it is termed as '1:N'.

Many-to-one: When more than one instance of an entity is linked with the relationship, it
istermed as 'N:1'.

Many-to-many: When more than one instance of an entity on the left and more than one
instance of an entity on the right can be linked with the relationship, then it is termed as ‘N:N’
relationship.

Relationship set
A relationship set is a set of relationships of same type.
Like entities, a relationship too can have attributes. These attributes are called descriptive attributes.
Degree of a Relationship Set: The number of entities sets that participate in a relationship set is termed
as the degree of that relationship set. Thus,
Degree of a relationship set = Number of entities sets participating in a relationship set
Binary = degree 2
Ternary = degree 3
n-ary = degree
On the basis of degree of a relationship set, a relationship set can be classified into the following types-
Unary relationship set - Unary relationship set is a relationship set where only one entity set
participates in a relationship set.

Binary relationship set - Binary relationship set is a relationship set where two entity sets
participate in a relationship set. Example- Student is enrolled in a Course

Ternary relationship set - Ternary relationship set is a relationship set where three entity sets
participate in a relationship set
N-ary relationship set - N-ary relationship set is a relationship set where ‘n’ entity sets
participate in a relationship set.
Cardinality - Cardinality defines the number of entities in one entity set, which can be associated with
the number of entities of other set via relationship set.
One-to-one − One entity from entity set A can be associated with at most one entity of entity set B and
vice versa.

One-to-many − One entity from entity set A can be associated with more than one entities of entity set B
however an entity from entity set B, can be associated with at most one entity.

Many-to-one − More than one entities from entity set A can be associated with at most one entity of
entity set B, however an entity from entity set B can be associated with more than one entity from entity
set A.

Many-to-many − One entity from A can be associated with more than one entity from B and vice versa.

https://www.javatpoint.com/dbms-er-model-concept
https://www.gatevidyalay.com/relationship-sets/
https://dbmswithsuman.blogspot.com/p/the-er-model-defines-conceptual-view-of.html
Additional Features of the ER Model
Generalization Specialization and Aggregation in DBMS are abstraction mechanisms used to model
information. The abstraction is the mechanism used to hide the superfluous details of a set of objects.
For example, vehicle is a abstraction, that includes the types car, jeep and bus.
Generalization – Generalization is the process of extracting common properties from a set of entities and
create a generalized entity from it. It is a bottom-up approach in which two or more entities can be
generalized to a higher-level entity if they have some attributes in common.
For Example, STUDENT and FACULTY can be generalized to a higher level entity called PERSON as
shown below. In this case, common attributes like P_NAME, P_ADD become part of higher entity
(PERSON) and specialized attributes like S_FEE become part of specialized entity (STUDENT).

Specialization – In specialization, an entity is divided into sub-entities based on their characteristics. It is


a top-down approach where higher level entity is specialized into two or more lower-level entities.
For Example, EMPLOYEE entity in an Employee management system can be specialized into
DEVELOPER, TESTER etc. as shown below. In this case, common attributes like E_NAME, E_SAL etc.
become part of higher entity (EMPLOYEE) and specialized attributes like TES_TYPE become part of
specialized entity (TESTER).

Aggregation – An ER diagram is not capable of representing relationship between an entity and a


relationship which may be required in some scenarios. In those cases, a relationship with its
corresponding entities is aggregated into a higher level entity. Aggregation is an abstraction through
which we can represent relationships as higher level entity sets.
For Example, Employee working for a project may require some machinery. So, REQUIRE relationship
is needed between relationship WORKS_FOR and entity MACHINERY. Using aggregation,
WORKS_FOR relationship with its entities EMPLOYEE and PROJECT is aggregated into single entity
and relationship REQUIRE is created between aggregated entity and MACHINERY.

https://dbmswithsuman.blogspot.com/p/generalization-generalization-bottom-up.html
https://www.geeksforgeeks.org/generalization-specialization-and-aggregation-in-er-model/

Conceptual Design with the ER Model


Conceptual database design phase starts with the formation of a conceptual data model of the enterprise
that is entirely independent of implementation details such as the target DBMS, use of application
programs, programming languages used, hardware platform, performance issues, or any other physical
deliberations.

The first step in conceptual database design is to build one (or more) conceptual data replica of the data
requirements of the enterprise. A conceptual data model comprises these following elements:
entity types
types of relationship
attributes and the various attribute domains
primary keys and alternate keys
integrity constraints
The conceptual data model is maintained by documentation, including ER diagrams and a data dictionary,
which is produced throughout the development of the model.
ER Diagram for Hotel
ER Diagram for Library
UNIT II
Introduction to the Relational Model: Integrity constraint over relations, enforcing integrity
constraints, querying relational data, logical data base design, introduction to views,
destroying/altering tables and views. Relational Algebra, Tuple relational Calculus, Domain
relational calculus.

Relational Model
The relational model represents how data is stored in Relational Databases. A relational database consists
of a collection of tables, each of which is assigned a unique name.
Table Student
ROLL_NO NAME ADDRESS PHONE AGE
1 RAM DELHI 9455123451 18
2 RAMESH GURGAON 9652431543 18
3 SUJIT ROHTAK 9156253131 20
4 SURESH DELHI 18
Important Terminologies:
Attribute: Attributes are the properties that define an entity. e.g.; ROLL_NO, NAME, ADDRESS
Relation Schema: A relation schema defines the structure of the relation and represents the name of the
relation with its attributes. e.g.; STUDENT (ROLL_NO, NAME, ADDRESS, PHONE, and AGE) is the
relation schema for STUDENT. If a schema has more than 1 relation, it is called Relational Schema.
Tuple: Each row in the relation is known as a tuple. The above relation contains 4 tuples, one of which
is shown as:
1 RAM DELHI 9455123451 18
Relation Instance: The set of tuples of a relation at a particular instance of time is called a relation
instance. Table 1 shows the relation instance of STUDENT at a particular time. It can change whenever
there is an insertion, deletion, or update in the database.
Degree: The number of attributes in the relation is known as the degree of the relation.
The STUDENT relation defined above has degree 5.
Cardinality: The number of tuples in a relation is known as cardinality. The STUDENT relation defined
above has cardinality 4.
Column: The column represents the set of values for a particular attribute. The column ROLL_NO is
extracted from the relation STUDENT.
Roll No
1
2
3
4
NULL Values: The value which is not known or unavailable is called a NULL value. It is represented
by blank space. e.g.; PHONE of STUDENT having ROLL_NO 4 is NULL.
Relation Key: These are basically the keys that are used to identify the rows uniquely or also help in
identifying tables. These are of the following types.
Primary Key
Candidate Key
Super Key
Foreign Key
Alternate Key
Composite Key
Constraints in Relational Model
While designing the Relational Model, we define some conditions which must hold for data present in the
database are called Constraints. These constraints are checked before performing any operation (insertion,
deletion, and updation ) in the database. If there is a violation of any of the constraints, the operation will
fail.
• Domain Constraints - These are attribute-level constraints. An attribute can only take values that
lie inside the domain range. e.g.; If a constraint AGE>0 is applied to STUDENT relation,
inserting a negative value of AGE will result in failure.
• Key Integrity - Every relation in the database should have at least one set of attributes that
defines a tuple uniquely. Those set of attributes is called keys. e.g.; ROLL_NO in STUDENT is
key. No two students can have the same roll number. So a key has two properties:
It should be unique for all tuples.
It can’t have NULL values.
• Referential Integrity - When one attribute of a relation can only take values from another
attribute of the same relation or any other relation, it is called referential integrity. Let us suppose
we have 2 relations
Table Student
ROLL_NO NAME ADDRESS PHONE AGE BRANCH_CODE
1 RAM DELHI 9455123451 18 CS
2 RAMESH GURGAON 9652431543 18 CS
3 SUJIT ROHTAK 9156253131 20 ECE
4 SURESH DELHI 18 IT
Table Branch
BRANCH_CODE BRANCH_NAME
CS COMPUTER SCIENCE
IT INFORMATION TECHNOLOGY
ECE ELECTRONICS AND COMMUNICATION ENGINEERING
CV CIVIL ENGINEERING
BRANCH_CODE of STUDENT can only take the values which are present in BRANCH_CODE of
BRANCH which is called referential integrity constraint. The relation which is referencing another
relation is called REFERENCING RELATION (STUDENT in this case) and the relation to which other
relations refer is called REFERENCED RELATION (BRANCH in this case).
Advantages of the Relational Model
• Simple model: Relational Model is simple and easy to use in comparison to other languages.
• Flexible: Relational Model is more flexible than any other relational model present.
• Secure: Relational Model is more secure than any other relational model.
• Data Accuracy: Data is more accurate in the relational data model.
• Data Integrity: The integrity of the data is maintained in the relational model.
• Operations can be Applied Easily: It is better to perform operations in the relational model.
Disadvantages of the Relational Model
• Relational Database Model is not very good for large databases.
• Sometimes, it becomes difficult to find the relation between tables.
• Because of the complex structure, the response time for queries is high.
Characteristics of the Relational Model
• Data is represented in rows and columns called relations.
• Data is stored in tables having relationships between them called the Relational model.
• The relational model supports the operations like Data definition, Data manipulation, and
Transaction management.
• Each column has a distinct name and they are representing attributes.
• Each row represents a single entity.
https://www.geeksforgeeks.org/relational-model-in-dbms/

Integrity Constraints over Relations


For any stored data if we want to preserve the consistency and correctness, a relational DBMS typically
imposes one or more data integrity constraints. These constraints restrict the data values which can be
inserted into the database or created by a database update.
Data Integrity Constraints
There are different types of data integrity constraints that are commonly found in relational databases,
including the following −
• Required data − Some columns in a database contain a valid data value in each row; they
are not allowed to contain NULL values. In the sample database, every order has an
associated customer who placed the order. The DBMS can be asked to prevent NULL
values in this column.
• Validity checking − Every column in a database has a domain, a set of data values which
are legal for that column. The DBMS allowed preventing other data values in these
columns.
• Entity integrity − The primary key of a table contains a unique value in each row that is
different from the values in all other rows. Duplicate values are illegal because they are not
allowing the database to differentiate one entity from another. The DBMS can be asked to
enforce this unique values constraint.
• Referential integrity − A foreign key in a relational database links each row in the child
table containing the foreign key to the row of the parent table containing the matching
primary key value. The DBMS can be asked to enforce this foreign key/primary key
constraint.
• Other data relationships − The real-world situation which is modeled by a database often
has additional constraints which govern the legal data values that may appear in the
database. The DBMS is allowed to check modifications to the tables to make sure that their
values are constrained in this way.
• Business rules − Updates to a database that are constrained by business rules governing
the real-world transactions which are represented by the updates.
• Consistency − Many real-world transactions that cause multiple updates to a database. The
DBMS is allowed to enforce this type of consistency rule or to support applications that
implement such rules.
https://www.tutorialspoint.com/what-are-integrity-constraints-over-the-relation-in-dbms

Enforcing Integrity Constraints in DBMS


Introduction
Integrity constraints are rules that specify the conditions that must be met for the data in a database to be
considered valid. These constraints help to ensure the accuracy and consistency of the data by limiting the
values that can be entered for a particular attribute and specifying the relationships between entities in the
database.
There are several types of integrity constraints that can be enforced in a DBMS:
Domain constraints: These constraints specify the values that can be assigned to an attribute in a
database. For example, a domain constraint might specify that the values for an "age" attribute
must be integers between 0 and 120.
Participation constraints: These constraints specify the relationship between entities in a
database. For example, a participation constraint might specify that every employee must be
assigned to a department.
Entity integrity constraints: These constraints specify rules for the primary key of an entity. For
example, an entity integrity constraint might specify that the primary key cannot be null.
Referential integrity constraints: These constraints specify rules for foreign keys in a database.
For example, a referential integrity constraint might specify that a foreign key value must match
the value of the primary key in another table.
User-defined constraints: These constraints are defined by the database administrator and can be
used to specify custom rules for the data in a database.
Ways to Enforce Integrity Constraints
There are several ways to enforce integrity constraints in a DBMS:
Declarative referential integrity: This method involves specifying the integrity constraints at the
time of database design and allowing the DBMS to enforce them automatically.
Triggers: A trigger is a special type of stored procedure that is executed automatically by the DBMS
when certain events occur (such as inserting, updating, or deleting data). Triggers can be used to
enforce integrity constraints by checking for and rejecting invalid data.
Stored procedures: A stored procedure is a pre-defined set of SQL statements that can be executed
as a single unit. Stored procedures can be used to enforce integrity constraints by performing checks
on the data before it is inserted, updated, or deleted.
Application-level code: Integrity constraints can also be enforced at the application level by writing
code to check for and reject invalid data before it is entered into the database.
It is important to carefully consider the appropriate method for enforcing integrity constraints in a DBMS
in order to ensure the accuracy and consistency of the data.
https://www.tutorialandexample.com/enforcing-integrity-constraints-in-dbms

Querying Relational Algebra


Query - A database query is a request for data from a database. The request should come in a database
table or a combination of tables using a code known as the query language. This way, the system can
understand and process the query accordingly.
How Does Query Work?
Let’s say that you want to buy a dairy milk at general stores. You make a request by saying, “Can I
have a Dairy Milk?”. The shopkeeper will understand the meaning of your request and give you the
ordered item.
A query works the same way – it adds meaning to the code, allowing the system to understand and
execute actions accordingly. Be it SQL or any other query language, both the user and the database
can exchange information as long as they use the same language.
Meanwhile, a well-designed database stores data in multiple tables. They consist of columns that
hold the data’s attributes, along with rows or records of information. A query then helps retrieve
data from different tables, arrange them, and display them according to the commands.
A query can either be a select, an action, or a combination of both. Select queries can retrieve
information from data sources, and action queries work for data manipulation, for example, to add,
change or delete data.
Advanced users can also use query commands to perform various programming tasks, from creating
MySQL users and granting permissions to changing WordPress URLs in MySQL databases.
Below are some of the most common query commands along with their functions:
SELECT – fetch data from the database. It’s one of the most popular commands, as every
request begins with a select query.
AND – combine data from one or more tables.
CREATE TABLE – build different tables and specify the name of each column within.
ORDER BY – sort data results either numerically or alphabetically.
SUM – summarize data from a specific column.
UPDATE – modify existing rows in a table.
INSERT – add new data or rows to an existing table.
WHERE – filter data and get its value based on a set condition.
Advantages if querying:
Review data from multiple tables simultaneously.
Filter records containing only certain fields and of certain criteria.
Automate data management tasks and perform calculations.
We saw all the commands in 3rd unit
https://www.hostinger.com/tutorials/what-is-a-
query#:~:text=Below%20are%20some%20of%20the%20most%20common%20query,modify%20existin
g%20rows%20in%20a%20table.%20More%20items

Logical Database Design


A Logical Database is a special type of ABAP (Advance Business Application and Programming) that
is used to retrieve data from various tables and the data is interrelated to each other. Also, a logical
database provides a read-only view of Data.
Structure Of Logical Database
A Logical database uses only a hierarchical structure of tables i.e. Data is organized in a Tree-like
Structure and the data is stored as records that are connected to each other through edges (Links).
Logical Database contains Open SQL statements which are used to read data from the database. The
logical database reads the program, stores them in the program if required, and passes them line by line
to the application program.

Features of Logical Database


• We can select only that type of Data that we need.
• Data Authentication is done in order to maintain security.
• Logical Database uses hierarchical Structure due to this data integrity is maintained.
Goal Of Logical Database
The goal of Logical Database is to create well-structured tables that reflect the need of the user. The
tables of the Logical database store data in a non-redundant manner and foreign keys will be used in
tables so that relationships among tables and entities will be supported.
Tasks of Logical Database
• With the help of the Logical database, we will read the same data from multiple programs.
• A logical database defines the same user interface for multiple programs.
• Logical Database ensures the Authorization checks for the centralized sensitive database.
• With the help of a Logical Database, Performance is improved. Like in Logical Database we
will use joins instead of multiple SELECT statements, which will improve response time and
this will increase the Performance of Logical Database.
Data View Of Logical Database
Logical Database provides a particular view of Logical Database tables. A logical database is
appropriately used when the structure of the Database is Large. It is convenient to use flow i.e
• SELECT
• READ
• PROCESS
• DISPLAY
In order to work with databases efficiently. The data of the Logical Database is hierarchical in nature.
The tables are linked to each other in a Foreign Key relationship.
Diagrammatically, the Data View of Logical Database is shown as:

Points To Remember
• Tables must have Foreign Key Relationship.
• A logical Database consists of logically related tables that are arranged in a hierarchical
manner used for reading or retrieving Data.
• Logical Database consist of three main elements:
• Structure of Database
• Selections of Data from Database
• Database Program
• If we want to improve the access time on data, then we use VIEWS in Logical Database.
Advantages Of Logical Database
• In a Logical database, we can select meaningful data from a large amount of data.
• Logical Database consists of Central Authorization which checks for Database Accesses is
Authenticated or not.
• In this Coding, the part is less required to retrieve data from the database as compared to
Other Databases.
• Access performance of reading data from the hierarchical structure of the Database is good.
• Easy to understand user interfaces.
• Logical Database firstly check functions which further check that user input is complete,
correct, and plausible.
Disadvantages Of Logical Database
• Logical Database takes more time when the required data is at the last because if that table
which is required at the lowest level then firstly all upper-level tables should be read which
takes more time and this slows down the performance.
• In Logical Database ENDGET command doesn’t exist due to this the code block associated
with an event ends with the next event statement.
https://www.geeksforgeeks.org/logical-database/

Introduction to views
Destroying/Altering Views
Views in SQL are kind of virtual tables. A view also has rows and columns as they are in a real table in
the database. We can create a view by selecting fields from one or more tables present in the database.
A View can either have all the rows of a table or specific rows based on certain condition.
StudentDetails

StudentMarks

CREATING VIEWS
We can create View using CREATE VIEW statement. A View can be created from a single table or
multiple tables.
Syntax:
CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE condition;
view_name: Name for the View
table_name: Name of the table
condition: Condition to select rows
Examples:
Creating View from a single table
CREATE VIEW DetailsView AS SELECT NAME, ADDRESS FROM StudentDetails WHERE S_ID
< 5;
SELECT * FROM DetailsView;
Output:
CREATE VIEW StudentNames AS SELECT S_ID, NAME FROM StudentDetails ORDER BY
NAME;
SELECT * FROM StudentNames;
Output:

Creating View from multiple tables


CREATE VIEW MarksView AS
SELECT StudentDetails.NAME, StudentDetails.ADDRESS, StudentMarks.MARKS
FROM StudentDetails, StudentMarks
WHERE StudentDetails.NAME = StudentMarks.NAME;
SELECT * FROM MarksView;
Output:

LISTING ALL VIEWS IN A DATABASE


We can list View using the SHOW FULL TABLES statement or using the information_schema table.
A View can be created from a single table or multiple tables.
Syntax: Using SHOW FULL TABLES
Ex: show full tables where table_type like "%VIEW";
Syntax: Using information_schema
Ex: select * from information_schema.views
where table_schema = "database_name";
Ex: select table_schema,table_name,view_definition
from information_schema.views
where table_schema = "database_name";

DELETING VIEWS
SQL allows us to delete an existing View. We can delete or drop a View using the DROP statement.
Syntax: DROP VIEW view_name;
Ex: DROP VIEW MarksView;

UPDATING VIEWS
There are certain conditions needed to be satisfied to update a view. If any one of these conditions
is not met, then we will not be allowed to update the view.
• The SELECT statement which is used to create the view should not include GROUP BY clause
or ORDER BY clause.
• The SELECT statement should not have the DISTINCT keyword.
• The View should have all NOT NULL values.
• The view should not be created using nested queries or complex queries.
• The view should be created from a single table. If the view is created using multiple tables then
we will not be allowed to update the view.

CREATE OR REPLACE VIEW


Syntax:
CREATE OR REPLACE VIEW view_name AS
SELECT column1,column2,..
FROM table_name
WHERE condition;
Ex:
CREATE OR REPLACE VIEW MarksView AS
SELECT StudentDetails.NAME, StudentDetails.ADDRESS, StudentMarks.MARKS,
StudentMarks.AGE
FROM StudentDetails, StudentMarks
WHERE StudentDetails.NAME = StudentMarks.NAME;
SELECT * FROM MarksView;
Output:

Inserting a row in a view


Syntax:
INSERT INTO view_name(column1, column2 , column3,..)
VALUES(value1, value2, value3..);
view_name: Name of the View
Ex:
INSERT INTO DetailsView(NAME, ADDRESS)
VALUES("Suresh","Gurgaon");
SELECT * FROM DetailsView;
Output:

Deleting a row from a View


Syntax:
DELETE FROM view_name WHERE condition;
view_name:Name of view from where we want to delete rows
condition: Condition to select rows
Ex:
DELETE FROM DetailsView
WHERE NAME="Suresh";
SELECT * FROM DetailsView;
Output:
WITH CHECK OPTION
The With check option clause in SQL is a very useful clause for views. It is applicable to an updatable
view. If the view is not updatable, then there is no meaning of including this clause in the CREATE
VIEW statement.
• The WITH CHECK OPTION clause is used to prevent the insertion of rows in the view
where the condition in the WHERE clause in CREATE VIEW statement is not satisfied.
• If we have used the WITH CHECK OPTION clause in the CREATE VIEW statement, and
if the UPDATE or INSERT clause does not satisfy the conditions then they will return an
error.
Ex:
CREATE VIEW SampleView AS SELECT S_ID, NAME FROM StudentDetails
WHERE NAME IS NOT NULL WITH CHECK OPTION;
Uses of a View
• Restricting data access – Views provide an additional level of table security by restricting
access to a predetermined set of rows and columns of a table.
• Hiding data complexity – A view can hide the complexity that exists in multiple tables join.
• Simplify commands for the user – Views allow the user to select information from multiple
tables without requiring the users to actually know how to perform a join.
• Store complex queries – Views can be used to store complex queries.
• Rename Columns – Views can also be used to rename the columns without affecting the base
tables provided the number of columns in view must match the number of columns specified in
select statement. Thus, renaming helps to hide the names of the columns of the base tables.
• Multiple view facility – Different views can be created on the same table for different users.
https://www.geeksforgeeks.org/sql-views/

Relational Algebra
Relational algebra is a procedural query language. It gives a step by step process to obtain the result of the
query. It uses operators to perform queries.
Types of Relational operation
Select Operation:
• The select operation selects tuples that satisfy a given predicate.
• It is denoted by sigma (σ).
• Notation: σ p(r) Where σ is used for selection prediction, r is used for relation, p is used as a
propositional logic formula which may use connectors like: AND OR and NOT.
• These relational can use as relational operators like =, ≠, ≥, <, >, ≤.
Ex: LOAN Relation
BRANCH_NAME LOAN_NO AMOUNT
Downtown L-17 1000
Redwood L-23 2000
Perryride L-15 1500
Downtown L-14 1500
Mianus L-13 500
Roundhill L-11 900
Perryride L-16 1300
Input: σ BRANCH_NAME="perryride" (LOAN)
Output:
BRANCH_NAME LOAN_NO AMOUNT
Perryride L-15 1500
Project Operation:
• This operation shows the list of those attributes that we wish to appear in the result. Rest of the
attributes are eliminated from the table.
• It is denoted by ∏.
• Notation: ∏ A1, A2, An (r) Where A1, A2, A3 is used as an attribute name of relation r.
Ex: CUSTOMER RELATION
NAME STREET CITY
Jones Main Harrison
Smith North Rye
Hays Main Harrison
Curry North Rye
Johnson Alma Brooklyn
Brooks Senator Brooklyn

Input: ∏ NAME, CITY (CUSTOMER)


Output:
NAME STREET
Jones Main
Smith North
Hays Main
Curry North
Johnson Alma
Brooks Senator
Union Operation:
• Suppose there are two tuples R and S. The union operation contains all the tuples that are either in
R or S or both in R & S.
• It eliminates the duplicate tuples. It is denoted by ∪.
• Notation: R ∪ S
• A union operation must hold the following condition:
o R and S must have the attribute of the same number.
o Duplicate tuples are eliminated automatically.
Ex: DEPOSITOR RELATION
CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
Lindsay A-284
BORROW RELATION
CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17
Input: ∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Jones
Smith
Hayes
Jackson
Curry
Smith
Williams
CUSTOMER_NAME
Jones
Smith
Set Intersection:
• Suppose there are two tuples R and S. The set intersection operation contains all tuples that are in
both R & S.
• It is denoted by intersection ∩.
• Notation: R ∩ S
Ex: Using the above DEPOSITOR table and BORROW table
Input: ∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Smith
Jones
Set Difference:
• Suppose there are two tuples R and S. The set intersection operation contains all tuples that are in
R but not in S.
• It is denoted by intersection minus (-).
• Notation: R - S
Ex: Using the above DEPOSITOR table and BORROW table
Input: ∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
Cartesian product:
• The Cartesian product is used to combine each row in one table with each row in the other table. It
is also known as a cross product.
• It is denoted by X.
• Notation: E X D
Ex: EMPLOYEE
EMP_ID EMP_NAME EMP_DEPT
1 Smith A
2 Harry C
3 John B
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input: EMPLOYEE X DEPARTMENT
Output:
EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME
1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal
Rename Operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Ex: ρ(STUDENT1, STUDENT)
https://www.javatpoint.com/dbms-relational-algebra

Relational Calculus
As Relational Algebra is a procedural query language, Relational Calculus is a non-procedural query
language. It basically deals with the end results. It always tells me what to do but never tells me how to
do it.
There are two types of Relational Calculus
1. Tuple Relational Calculus(TRC)
2. Domain Relational Calculus(DRC)
https://www.geeksforgeeks.org/introduction-of-relational-algebra-in-dbms/

Tuple Relational Calculus


• Tuple Relational Calculus (TRC) is a non-procedural query language used in relational database
management systems (RDBMS) to retrieve data from tables. TRC is based on the concept of tuples,
which are ordered sets of attribute values that represent a single row or record in a database table.
• TRC is a declarative language, meaning that it specifies what data is required from the database,
rather than how to retrieve it. TRC queries are expressed as logical formulas that describe the
desired tuples.
Syntax: { t | P(t) }
where t is a tuple variable and P(t) is a logical formula that describes the conditions that the tuples in the
result must satisfy. The curly braces {} are used to indicate that the expression is a set of tuples.
Employee
Employee ID
Name
Salary
Department ID
To retrieve the names of all employees who earn more than $50,000 per year, we can use the following
TRC query: { t | Employees(t) ∧ t.Salary > 50000 }
• TRC can also be used to perform more complex queries, such as joins and nested queries, by using
additional logical operators and expressions.
• TRC is a powerful query language, it can be more difficult to write and understand than other SQL-
based query languages, such as Structured Query Language (SQL).
• It is useful in certain applications, such as in the formal verification of database schemas and in
academic research.
• TRC is a non-procedural query language, unlike relational algebra. Tuple Calculus provides only
the description of the query but it does not provide the methods to solve it. Thus, it explains what
to do but not how to do it.
Table Loan
Loan number Branch name Amount
L33 ABC 10000
L35 DEF 15000
L49 GHI 9000
L98 DEF 65000
Query: Find the loan number, branch, and amount of loans greater than or equal to 10000 amount.
{t| t ∈ loan ∧ t[amount]>=10000}
Loan number Branch name Amount
L33 ABC 10000
L35 DEF 15000
L98 DEF 65000

Query: Find the loan number for each loan of an amount greater or equal to 10000.
{t| ∃ s ∈ loan(t[loan number] = s[loan number] ∧ s[amount]>=10000)}
Loan number
L33
L35
L98
https://www.geeksforgeeks.org/tuple-relational-calculus-trc-in-dbms/

Domain Relational Calculus


Domain Relational Calculus is a non-procedural query language equivalent in power to Tuple
Relational Calculus. Domain Relational Calculus provides only the description of the query but it does
not provide the methods to solve it. In Domain Relational Calculus, a query is expressed as,
{ < x1, x2, x3, ..., xn > | P (x1, x2, x3, ..., xn ) }
where, < x1, x2, x3, …, xn > represents resulting domains variables and P (x1, x2, x3, …, xn ) represents
the condition or formula equivalent to the Predicate calculus.
Predicate Calculus Formula:
1. Set of all comparison operators
2. Set of connectives like and, or, not
3. Set of quantifiers
Table Loan
Loan number Branch name Amount
L01 Main 200
L03 Main 150
L10 Sub 90
L08 Main 60
Query: Find the loan number, branch, amount of loans of greater than or equal to 100 amount.
{≺l, b, a≻ | ≺l, b, a≻ ∈ loan ∧ (a ≥ 100)}\
Loan number Branch name Amount
L01 Main 200
L03 Main 150
Query: Find the loan number for each loan of an amount greater or equal to 150.
{≺l≻ | ∃ b, a (≺l, b, a≻ ∈ loan ∧ (a ≥ 150)}
Loan number
L01
L03
https://www.geeksforgeeks.org/domain-relational-calculus-in-dbms/
UNIT 3

SQL: QUERIES, CONSTRAINTS, TRIGGERS: form of basic SQL query, UNION, INTERSECT,
and EXCEPT, Nested Queries, aggregation operators, NULL values, complex integrity constraints
in SQL, triggers and active data bases.
Schema Refinement: Problems caused by redundancy, decompositions, problems related to
decomposition, reasoning about functional dependencies, FIRST, SECOND, THIRD normal forms,
BCNF, lossless join decomposition, multi-valued dependencies, FOURTH normal form, FIFTH normal
form.

The basic form of an SQL query:


SELECT * DISTINCT+,*| column_name1, column_name2...) FROM table_name WHERE
condition + *GROUP BY column_list+ *HAVING condition+ *ORDER BY column_list. • SELECT
specifies which columns are to appear in the output DISTINCT eliminates duplicate • FROM
specifies the tables to be used
• WHERE filters the rows according to the condition The where condition is a boolean
combination (using AND, OR, and NOT) of conditions of the form expression op expression
where op is one of the comparison operators (<=, =, <>, >=, >)
• GROUP BY forms groups of rows with the same column value
• HAVING filters the group
• ORDER BY sorts the order of the output

Set Operations:
The SQL Set operation is used to combine the two or more SQL SELECT
statements. • Union
• UnionAll
• Except(minus)
• Intersect
Union
• The SQL Union operation is used to combine the result of two or more SQL SELECT queries. • In
the union operation, all the number of datatype and columns must be same in both the tables on
which UNION operation is being applied.
• The union operation eliminates the duplicate rows from its resultset.
• Syntax :
SELECT column_name FROM table1
UNION
SELECT column_name FROM table2;
• Example:
First_Table Second_Table Result Table: SELECT * FROM First_Table UNION SELECT *
FROM Second_Table
1 Jack 3 Jackson
ID NAME
2 Harry
ID NAME 3 Jackson

3 Jackso ID NAME 4 Stephan


n
5 David
1 Jack
4 Stepha
n 2 Harry
5 David
UnionAll
• Union All operation is equal to the Union operation. It returns the set without removing
duplication and sorting the data. 1 Jack
• Syntax :
SELECT column_name FROM table1 2 Harry
UNION ALL
SELECT column_name FROM 3 Jackson
table2;
3 Jackson
• Example:
SELECT * FROM First_Table 4 Stephan
UNION ALL SELECT * FROM
Second_Table 5 David

Intersect
ID NAME
• It is used to combine two SELECT statements. The Intersect operation returns the common rows
from both the SELECT statements.
• In the Intersect operation, the number of datatype and columns must be the same. •
It has no duplicates and it arranges the data in ascending order by default. • Syntax :
SELECT column_name FROM table1 3 Jackso
INTERSECT n
SELECT column_name FROM
table2; • Example:
ID NAME
SELECT * FROM First_Table INTERSECT SELECT * FROM Second_Table
Minus
• It combines the result of two SELECT statements. Minus operator is used to display the rows which
are present in the first query but absent in the second query.
• It has no duplicates and data arranged in ascending order by default.
• Syntax : MINUS
SELECT column_name FROM table1 SELECT column_name FROM table2;
• Example: 1 Jack
SELECT * FROM First_Table MINUS SELECT *
FROM Second_Table 2 Harry
https://www.javatpoint.com/dbms-sql-set-operation

SQL Queries/Commands
ID NAME
• SQL commands are instructions. It is used to communicate with the database. It is also used to
perform specific tasks, functions, and queries of data.
• SQL can perform various tasks like create a table, add data to tables, drop the table, modify the
table, set permission for users.
• There are five types of SQL commands: DDL, DML, DCL, TCL, and DQL.
Data Definition Language (DDL)
• DDL changes the structure of the table like creating a table, deleting a table, altering a table, etc.
• All the command of DDL are auto-committed that means it permanently save all the changes in
the database.
• Here are some commands that come under DDL:
o CREATE - It is used to create a new table in the database
Syntax: CREATE TABLE TABLE_NAME (COLUMN_NAME DATATYPES[,....]);
Ex:
CREATE TABLE EMPLOYEE(Name VARCHAR2(20), Email VARCHAR2(100), DO
B DATE);
o ALTER - It is used to alter the structure of the database. This change could be either to
modify the characteristics of an existing attribute or probably to add a new attribute.
Syntax:
🟂 To add a new column in the table
ALTER TABLE table_name ADD column_name COLUMN-definition;
🟂 To modify existing column in the table:
ALTER TABLE table_name MODIFY(column_definitions....);
Ex:
🟂 ALTER TABLE STU_DETAILS ADD(ADDRESS VARCHAR2(20));
🟂 ALTER TABLE STU_DETAILS MODIFY (NAME VARCHAR2(20));
o DROP - It is used to delete both the structure and record stored in the table.
Syntax: DROP TABLE table_name;
Ex: DROP TABLE EMPLOYEE;
o TRUNCATE - It is used to delete all the rows from the table and free the space
containing the table.
Syntax: TRUNCATE TABLE table_name;
Ex: TRUNCATE TABLE EMPLOYEE;
Data Manipulation Language
• DML commands are used to modify the database. It is responsible for all form of changes in the
database.
• The command of DML is not auto-committed that means it can't permanently save all the changes
in the database. They can be rollback.
• Here are some commands that come under DML:
Ex: INSERT INTO javatpoint (Author, Subject) VALUES ("Sonoo", "DBMS"); o INSERT -
The INSERT statement is a SQL query. It is used to insert data into the row of a table.
Syntax: INSERT INTO TABLE_NAME (col1, col2, col3,.... col N) VALUES (value1,
value2, value3, .... valueN);
Or
INSERT INTO TABLE_NAME VALUES (value1, value2, value3, .... valueN); Ex:
INSERT INTO javatpoint (Author, Subject) VALUES ("Sonoo", "DBMS"); o UPDATE -
This command is used to update or modify the value of a column in the table. Syntax:
UPDATE table_name SET [column_name1 = value1,..column_nameN = valueN] [WHERE
CONDITION]
For example: UPDATE students SET User_Name = 'Sonoo' WHERE Student_Id = '3'
o DELETE - It is used to remove one or more row from a table.
Syntax: DELETE FROM table_name [WHERE condition];
For example: DELETE FROM javatpoint WHERE Author="Sonoo";
Data Control Language
• DCL commands are used to grant and take back authority from any database user. •
Here are some commands that come under DCL:
o Grant - It is used to give user access privileges to a database.
Ex: GRANT SELECT, UPDATE ON MY_TABLE TO SOME_USER, ANOTHER_USER;
o Revoke - It is used to take back permissions from the user.
Ex: REVOKE SELECT, UPDATE ON MY_TABLE FROM USER1, USER2;
Transaction Control Language
• TCL commands can only use with DML commands like INSERT, DELETE and UPDATE only. •
These operations are automatically committed in the database that's why they cannot be used while
creating tables or dropping them.
• Here are some commands that come under TCL:
o COMMIT - Commit command is used to save all the transactions to the database.
Syntax: COMMIT;
Ex: DELETE FROM CUSTOMERS WHERE AGE = 25;
COMMIT;
o ROLLBACK – Rollback command is used to undo transactions that have not already been
saved to the database.
Syntax: ROLLBACK;
Ex: DELETE FROM CUSTOMERS WHERE AGE = 25; ROLLBACK;
o SAVEPOINT – It is used to roll the transaction back to a certain point without rolling back
the entire transaction.
Syntax: SAVEPOINT SAVEPOINT_NAME;
Data Query Language
• DQL is used to fetch the data from the database.
• It uses only one command:
o SELECT - This is the same as the projection operation of relational algebra. It is used to
select the attribute based on the condition described by WHERE clause.
Syntax: SELECT expressions FROM TABLES WHERE conditions;
Ex: SELECT emp_name FROM employee WHERE age > 20;
https://www.javatpoint.com/dbms-sql-command

NESTED QUERIES
A nested query is a query that has another query embedded within it. The embedded query is called a
subquery.
Inner or sub query returns a value which is used by the outer query.
A subquery typically appears within the WHERE clause of a query. It can sometimes appear in the
FROM clause or HAVING clause.
Types of subqueries:-
• Single row sub query - These returns only one row from inner select statement. It
uses only single row operator. (>,=,<,<=,>=)
Ex: - select ename, sal, job from emp where sal>(select sal from emp where empno=7566); •
Multiple row subquery - The subqueries return more than one row are called multiple row sub
queries. In this case multiple row operators are used.
o IN equal to any number of list.
o ANY compares value to each returned by subquery
🟂 <any less than the max value
🟂 >any greater than min value
Ex: - select empno, ename, job, from emp where sal<any(select sal from emp whee
job=“clerk“); • Multiple column subquery - In this the subquery return multiple columns.
Ex:- select ename, deptno from emp where(empno.deptno)in(select empno, deptno from emp
where sal>1200);
• Inline subquery- In this the subquery may be applied in select list and inform clause. Ex:- Select
ename, sal, deptno from (select ename, sal, deptno, mgr, hiredate from emp); • Correlated
subquery - In this the information of outer select participate as a condition in inner select.
Ex:- select deptno, ename, sal, from emp x where sal>(select avg(sal)from emp
where x.deptno=deptno)orer by deptno;
Here first outer query is executed and it pass the value of deptno to the inner query then the inner
query executed and give the result to the outer query
Ex:
Student Table
create table student(id number(10), name varchar2(20),classID number(10), marks
varchar2(20)); Insert into student values(1,'pinky',3,2.4);
Insert into student values(2,'bob',3,1.44);
Insert into student values(3,'Jam',1,3.24);
Insert into student
values(4,'lucky',2,2.67); Insert into
student values(5,'ram',2,4.56); select *
from student;
2 Bob 3 1.44

3 Jam 1 3.24

4 Lucky 2 2.67

Teacher Table
Id Name classID Marks

1 Pinky 3 2.4

Create table teacher(id number(10), name varchar(20), subject varchar2(10), classID number(10), salary
number(30)); 1 Bhanu Computer 3 5000
Insert into teacher
values(1,’bhanu’,’computer’,3,5000); 2 Rekha Science 1 5000
Insert into teacher values(2,'rekha','science',1,5000);
Insert into teacher 3 Siri Social NULL 4500
values(3,'siri','social',NULL,4500); Insert into
teacher values(4,'kittu','mathsr',2,5500); select * from 4 Kittu Maths 2 5500
teacher;
Id Name Subject classID Salary

Class Table
Create table class(id number(10), grade number(10), teacherID number(10), noofstudents number(10));
insert into class values(1,8,2,20); insert into class 1 8 2 20
values(2,9,3,40); insert into class values(3,10,1,38);
select * from class; 2 9 3 40

3 10 1 38

Examples:
Id Grade teacherID No.ofstudents

1. Select AVG(noofstudents) from class where teacherID IN(


Select id from teacher Where subject=’science’ OR subject=’maths’);
Output - 20.0
2. SELECT * FROM student WHERE classID = (
SELECT id FROM class WHERE noofstudents = (
SELECT MAX(noofstudents) FROM class));
Output - 4|lucky |2|2.67
5|ram |2|4.56
https://www.tutorialspoint.com/explain-about-nested-queries-in-dbms
Aggregation Operators
SQL aggregation function is used to perform the calculations on multiple rows of a single column of a
table. It returns a single value.
It is also used to summarize the data.
Types of SQL Aggregation Function:
o COUNT - COUNT function is used to Count the number of rows in a database table. It can work
on both numeric and non-numeric data types.
COUNT function uses the COUNT(*) that returns the count of all the rows in a specified table.
COUNT(*) considers duplicate and Null.
Syntax - COUNT(*) or COUNT( [ALL|DISTINCT] expression )
Table: PRODUCT_MAST
Item3 Com1 2 30 60

Item4 Com3 5 10 50

Item5 Com2 2 20 40

Item6 Cpm1 3 25 75

Item7 Com1 5 30 150

Item8 Com1 3 10 30

Ex: Item9 Com2 2 25 50


PRODUCT Item10 Com3 4 30 120
Item1

Item2
SELECT COUNT(*) FROM PRODUCT_MAST; Output: 10
SELECT COUNT(*) FROM PRODUCT_MAST; WHERE RATE>=20; Output: 7
COUNT() with DISTINCT:
SELECT COUNT(DISTINCT COMPANY) FROM PRODUCT_MAST; Output: 3
COUNT() with GROUP BY:
SELECT COMPANY, COUNT(*) FROM PRODUCT_MAST GROUP BY COMPANY;
Output:
Com1 5
Com2 3
Com3 2
COUNT() with HAVING:
SELECT COMPANY, COUNT(*) FROM PRODUCT_MAST GROUP BY COMPANY
HAVING COUNT(*)>2;
Output:
Com1 5
Com2 3
o SUM - Sum function is used to calculate the sum of all selected columns. It works on numeric
fields only.
Syntax - SUM() or SUM( [ALL|DISTINCT] expression )
Ex:
SELECT SUM(COST) FROM PRODUCT_MAST; Output: 670
SUM() with WHERE:
SELECT SUM(COST) FROM PRODUCT_MAST WHERE QTY>3; Output: 320
SUM() with GROUP BY:
SELECT SUM(COST) FROM PRODUCT_MAST WHERE QTY>3 GROUP BY COMPANY;
Output:
Com1 150
Com2 170
SUM() with HAVING:
SELECT COMPANY, SUM(COST) FROM PRODUCT_MAST GROUP BY COMPANY
HAVING SUM(COST)>=170;
Output:
Com1 335
Com3 170
o AVG - The AVG function is used to calculate the average value of the numeric type. AVG
function returns the average of all non-Null values.
Syntax - AVG() or AVG( [ALL|DISTINCT] expression )
Ex:
SELECT AVG(COST) FROM PRODUCT_MAST; Output: 67.00
o MAX - MAX function is used to find the maximum value of a certain column. This function
determines the largest value of all selected values of a column.
Syntax - MAX() or MAX( [ALL|DISTINCT] expression )
Ex:
SELECT MAX(RATE) FROM PRODUCT_MAST; Output: 30
o MIN - MIN function is used to find the minimum value of a certain column. This function
determines the smallest value of all selected values of a column.
Syntax - MIN() or MIN( [ALL|DISTINCT] expression )
Ex:
SELECT MIN(RATE) FROM PRODUCT_MAST; Output: 10
https://www.javatpoint.com/dbms-sql-aggregate-function

NULL values
A field with a NULL value is a field with no value.
If a field in a table is optional, it is possible to insert a new record or update a record without adding a
value to this field. Then, the field will be saved with a NULL value.
How to Test for NULL Values
It is not possible to test for NULL values with comparison operators, such as =, <, or
<>. We will have to use the IS NULL and IS NOT NULL operators instead.
o The IS NULL Operator - The IS NULL operator is used to test for empty values (NULL values).
Syntax: SELECT column_names FROM table_name WHERE column_name IS NULL; Ex:
SELECT CustomerName, ContactName, Address FROM Customers WHERE Address IS
NULL;
o The IS NOT NULL Operator - The IS NOT NULL operator is used to test for non-empty values
(NOT NULL values).
Syntax: SELECT column_names FROM table_name WHERE column_name IS NOT
NULL; Ex: SELECT CustomerName, ContactName, Address FROM Customers WHERE
Address IS NOT NULL;
https://www.w3schools.com/sql/sql_null_values.asp

Complex Integrity Constraints in SQL


Integrity constraints are a set of rules. It is used to maintain the quality of information. Integrity
constraints ensure that the data insertion, updating, and other processes have to be performed in such a
way that data integrity is not affected.
Thus, integrity constraint is used to guard against accidental damage to the
database. Types of Integrity Constraint
o Domain constraints - Domain constraints can be defined as the definition of a valid set of values
for an attribute.
The data type of domain includes string, character, integer, time, date, currency, etc. The value of
the attribute must be available in the corresponding domain.

o Entity integrity constraints - The entity integrity constraint states that primary key value can't be
null.
This is because the primary key value is used to identify individual rows in relation and if the
primary key has a null value, then we can't identify those rows.
A table can contain a null value other than the primary key field.
o Referential Integrity Constraints - A referential integrity constraint is specified between two
tables.
In the Referential integrity constraints, if a foreign key in Table 1 refers to the Primary Key of
Table 2, then every value of the Foreign Key in Table 1 must be null or be available in Table 2.

o Key
constraints - Keys are the entity set that is used to identify an entity within its entity set uniquely.
An entity set can have multiple keys, but out of which one key will be the primary key. A primary
key can contain a unique and null value in the relational table.
https://www.javatpoint.com/dbms-integrity-constraints
https://www.geeksforgeeks.org/sql-constraints/

Triggers
A Trigger in Structured Query Language is a set of procedural statements which are executed
automatically when there is any response to certain events on the particular table in the database. Triggers
are used to protect the data integrity in the database.
In SQL, this concept is the same as the trigger in real life. For example, when we pull the gun trigger, the
bullet is fired.
In Structured Query Language, triggers are called only either before or after the below
events: INSERT Event: This event is called when the new row is entered in the table.
UPDATE Event: This event is called when the existing record is changed or modified in the
table.
DELETE Event: This event is called when the existing record is removed from the
table. Types of Triggers in SQL
AFTER INSERT Trigger - This trigger is invoked after the insertion of data in the table.
AFTER UPDATE Trigger - This trigger is invoked in SQL after the modification of the data in
the table.
AFTER DELETE Trigger - This trigger is invoked after deleting the data from the table.
BEFORE INSERT Trigger - This trigger is invoked before the inserting the record in the table.
BEFORE UPDATE Trigger - This trigger is invoked before the updating the record in the table.
BEFORE DELETE Trigger - This trigger is invoked before deleting the record from the table. Syntax
of Trigger in SQL
CREATE TRIGGER Trigger_Name
[ BEFORE | AFTER ] [ Insert | Update | Delete]
ON [Table_Name]
[ FOR EACH ROW | FOR EACH COLUMN ]
AS
Set of SQL Statement
Example:
Student_Trigger table
CREATE TABLE Student_Trigger
(
Student_RollNo INT NOT NULL PRIMARY KEY,
Student_FirstName Varchar (100),
Student_EnglishMarks INT,
Student_PhysicsMarks INT,
Student_ChemistryMarks INT,
Student_MathsMarks INT,
Student_TotalMarks INT,
Student_Percentage );
The following query fires a trigger before the insertion of the student record in the
table: CREATE TRIGGER Student_Table_Marks
BEFORE INSERT
ON
Student_Trigger
FOR EACH ROW
SET new.Student_TotalMarks = new.Student_EnglishMarks + new.Student_PhysicsMarks +
new.Student_ChemistryMarks + new.Student_MathsMarks,
new.Student_Percentage = ( new.Student_TotalMarks / 400) * 100;
The following query inserts the record into Student_Trigger table:
INSERT INTO Student_Trigger (Student_RollNo, Student_FirstName, Student_EnglishMarks,
Student_PhysicsMarks, Student_ChemistryMarks, Student_MathsMarks, Student_TotalMarks,
Student_Percentage) VALUES ( 201, Sorya, 88, 75, 69, 92, 0, 0);
To check the output of the above INSERT statement, you have to type the following SELECT
statement:
SELECT * FROM Student_Trigger;
Student Student Student Student Student Student Student Student
RollNo First English Physics chemistry Maths Total Percentage
Name Marks Marks Marks Marks Marks

201 Surya 88 75 69 92 324 81

Advantages of Triggers in SQL


SQL provides an alternate way for maintaining the data and referential integrity in the
tables. Triggers helps in executing the scheduled tasks because they are called
automatically. They catch the errors in the database layer of various businesses.
They allow the database users to validate values before inserting and updating.
Disadvantages of Triggers in SQL
They are not compiled.
It is not possible to find and debug the errors in triggers.
If we use the complex code in the trigger, it makes the application run slower.
Trigger increases the high load on the database system.
https://www.javatpoint.com/trigger-in-sql
Active Databases
An active Database is a database consisting of a set of triggers. These databases are very difficult to be
maintained because of the complexity that arises in understanding the effect of these triggers. In such
database, DBMS initially verifies whether the particular trigger specified in the statement that modifies
the database is activated or not, prior to executing the statement. If the trigger is active then DBMS
executes the condition part and then executes the action part only if the specified condition is evaluated to
true. It is possible to activate more than one trigger within a single statement. In such situation, DBMS
processes each of the trigger randomly. The execution of an action part of a trigger may either activate
other triggers or the same trigger that Initialized this action. Such types of trigger that activates itself is
called as ‘recursive trigger’. The DBMS executes such chains of trigger in some pre-defined manner but it
effects the concept of understanding.

Features of Active Database:


1. It possess all the concepts of a conventional database i.e. data modelling facilities, query language
etc.
2. It supports all the functions of a traditional database like data definition, data manipulation,
storage management etc.
3. It supports definition and management of ECA rules.
4. It detects event occurrence.
5. It must be able to evaluate conditions and to execute actions.
6. It means that it has to implement rule execution.
Advantages:
1. Enhances traditional database functionalities with powerful rule processing capabilities. 2.
Enable a uniform and centralized description of the business rules relevant to the information
system.
3. Avoids redundancy of checking and repair operations.
4. Suitable platform for building large and efficient knowledge base and expert systems.
https://www.geeksforgeeks.org/active-databases/
UNIT 3

SQL: QUERIES, CONSTRAINTS, TRIGGERS: form of basic SQL query, UNION, INTERSECT,
and EXCEPT, Nested Queries, aggregation operators, NULL values, complex integrity constraints in
SQL, triggers and active data bases.
Schema Refinement: Problems caused by redundancy, decompositions, problems related to
decomposition, reasoning about functional dependencies, FIRST, SECOND, THIRD
normal forms, BCNF, lossless join decomposition, multi-valued dependencies, FOURTH
normal form, FIFTH normal form.

Schema refinement
Schema refinement is just a fancy term for saying polishing tables. It is the last step before
considering physical design/tuning with typical workloads:
1) Requirement analysis : user needs
2) Conceptual design : high-level description, often using E/R diagrams
3) Logical design : from graphs to tables (relational schema)
4) Schema refinement : checking tables for redundancies and anomalies
Let’s see an example of redundancies and anomalies. Consider the following table where the
client’s name is the primary key.

The table is presenting information on employees (sales reps) and their


clients. If we want to insert data, we notice that:
each row requires an entry in the client field
we can’t insert data for newly hired sales reps until they’ve been assigned to one or more
clients
if sales reps are in a training process, even if they’ve been already hired, they can’t
actually join the database because they need to have a delegated client… unless
“dummy” clients are created.
If we want to update data, we notice that:
the sales reps name is repeated for each client.
what if, for a given client, we misspelled the name of the sales reps Crosby instead of
Cosby… how can we edit that without affecting all the sales reps called Crosby? If we want to
delete data, what if Mary doesn’t have a client anymore because she’s taking a year off? We are
forced to either
create a dummy client
incorrectly showing her with a client she no longer handled
delete Mary’s record (even if however she’s still an employee)
notice we can not have “null” as a client since primary field keys cannot store null. When
we have to treat with schema refinement we often notice that the main problem is
redundancy. In order to identify schemas with such problems, we’ll introduce the notion of
functional dependencies: a relationship that exists when one attribute uniquely determines
another attribute. A functional dependency is simply a new type of constraint between two
attributes.
Say that R is a relation with attributes X and Y, we say that there is a functional dependency X ->
Y when Y is functionally dependent on X (where X is the determinant set and Y is the dependent
attribute).
http://blog.dancrisan.com/intro-to-database-systems-schema-refinement-functional-dependencies

Problems caused by redundancy


Redundancy - Data redundancy means the occurrence of duplicate copies of similar data. It is
done intentionally to keep the same piece of data at different places, or it occurs accidentally.
In DBMS, when the same data is stored in different tables, it causes data redundancy.
Sometimes, it is done on purpose for recovery or backup of data, faster access of data, or
updating data easily. Redundant data costs extra money, demands higher storage capacity, and
requires extra effort to keep all the files up to date.
Sometimes, unintentional duplicity of data causes a problem for the database to work properly, or
it may become harder for the end user to access data. Redundant data unnecessarily occupy space
in the database to save identical copies, which leads to space constraints, which is one of the
major problems.
Ex: Student table that contains data such as "Student_id", "Name", "Course", "Session", "Fee",
and "Department". As you can see, some data is repeated in the table, which causes
redundancy.
Student_id Name Course Session Fee Department

101 Devi B. Tech 2022 90,000 CS

102 Sona B. Tech 2022 90,000 CS

103 Varun B. Tech 2022 90,000 CS

104 Satish B. Tech 2022 90,000 CS

105 Amisha B. Tech 2022 90,000 CS

Problems that are caused due to redundancy in the database


Redundancy in DBMS gives rise to anomalies, and we will study it further. In a database
management system, the problems that occur while working on data include inserting, deleting,
and updating data in the database.
student_id student_name student_age dept_id dept_name dept_head

1 Shiva 19 104 Information Technology Jaspreet Kaur

2 Khushi 18 102 Electronics Avni Singh

3 Harsh 19 104 Information Technology Jaspreet Kaur

Insertion Anomaly: Insertion anomaly arises when you are trying to insert some data
into the database, but you are not able to insert it. Example: If you want to add the details
of the student in the above table, then you must know the details of the department;
otherwise, you will not be able to add the details because student details are dependent on
department details.
Deletion Anomaly: Deletion anomaly arises when you delete some data from the
database, but some unrelated data is also deleted; that is, there will be a loss of data due
to deletion anomaly. Example: If we want to delete the student detail, which has
student_id 2, we will also lose the unrelated data, i.e., department_id 102, from the above
table.
Updating Anomaly: An update anomaly arises when you update some data in the
database, but the data is partially updated, which causes data inconsistency. Example: If
we want to update the details of dept_head from Jaspreet Kaur to Ankit Goyal for
Dept_id 104, then we have to update it everywhere else; otherwise, the data will get
partially updated, which causes data inconsistency.
Advantages
Provides Data Security
Provides Data Reliability
Create Data Backup
Disadvantages
Data corruption
Wastage of storage
High cost
Ways to reduce reduce data redundancy
Database Normalization: We can normalize the data using the normalization method. In
this method, the data is broken down into pieces, which means a large table is divided
into two or more small tables to remove redundancy. Normalization removes insert
anomaly, update anomaly, and delete anomaly.
Deleting Unused Data: It is important to remove redundant data from the database as it
generates data redundancy in the DBMS. It is a good practice to remove unwanted data to
reduce redundancy.
Master Data: The data administrator shares master data across multiple systems.
Although it does not remove data redundancy, but it updates the redundant data whenever
the data is changed.
https://www.javatpoint.com/redundancy-in-dbms

Decompositions
Problems related to decomposition
When a relation in the relational model is not in appropriate normal form then the decomposition
of a relation is required.
In a database, it breaks the table into multiple tables.
If the relation has no proper decomposition, then it may lead to problems like loss of
information.
Decomposition is used to eliminate some of the problems of bad design like anomalies,
inconsistencies, and redundancy.
Types of decomposition
Lossless Decomposition - If the information is not lost from the relation that is decomposed, then
the decomposition will be lossless.
The lossless decomposition guarantees that the join of relations will result in the same relation as
it was decomposed.
The relation is said to be lossless decomposition if natural joins of all the decomposition give the
original relation.

Example: EMPLOYEE_DEPARTMENT table


EMP_ID EMP_NAME EMP_AGE EMP_CITY DEPT_ID DEPT_NAME

22 Denim 28 Mumbai 827 Sales

33 Alina 25 Delhi 438 Marketing

46 Stephan 30 Bangalore 869 Finance

52 Katherine 36 Mumbai 575 Production

60 Jack 40 Noida 678 Testing

The above relation is decomposed into two relations EMPLOYEE and


DEPARTMENT EMPLOYEE table
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai

33 Alina 25 Delhi

46 Stephan 30 Bangalore

52 Katherine 36 Mumbai

60 Jack 40 Noida

DEPARTMENT table
DEPT_ID EMP_ID DEPT_NAME

827 22 Sales

438 33 Marketing

869 46 Finance

575 52 Production

678 60 Testing

when these two relations are joined on the common column "EMP_ID", then the resultant
relation will look like
EMP_ID EMP_NAME EMP_AGE EMP_CITY DEPT_ID DEPT_NAME

22 Denim 28 Mumbai 827 Sales

33 Alina 25 Delhi 438 Marketing

46 Stephan 30 Bangalore 869 Finance

52 Katherine 36 Mumbai 575 Production

60 Jack 40 Noida 678 Testing

Hence, the decomposition is Lossless join decomposition.


Dependency Preserving
It is an important constraint of the database.
In the dependency preservation, at least one decomposed table must satisfy every dependency.
If a relation R is decomposed into relation R1 and R2, then the dependencies of R either must be
a part of R1 or R2 or must be derivable from the combination of functional dependencies of R1
and R2.
For example, suppose there is a relation R (A, B, C, D) with functional dependency set (A→BC).
The relational R is decomposed into R1(ABC) and R2(AD) which is dependency preserving
because FD A→BC is a part of relation R1(ABC).
Issues of decomposition in DBMS?
There are many problems regarding the decomposition in DBMS are:
Redundant Storage - Many instances where the same information gets stored in a single
place can confuse the programmers. It will take lots of space in the system. Insertion
Anomalies - It isn’t essential for storing important details unless some kind of
information is stored in a consistent manner.
Deletion Anomalies - It isn’t possible to delete some details without eliminating any sort
of information.
https://www.javatpoint.com/dbms-relational-decomposition
https://whatisdbms.com/decomposition-in-dbms/#google_vignette

Functional dependencies
In a relational database management, functional dependency is a concept that specifies the
relationship between two sets of attributes where one attribute determines the value of another
attribute. It is denoted as X → Y, where the attribute set on the left side of the arrow, X is called
Determinant, and Y is called the Dependent.
roll_n nam dept_name dept_building
o e

42 abc CO A4

43 pqr IT A3

44 xyz CO A4

45 xyz IT A3

46 mno EC B2

47 jkl ME B2

From the above table we can conclude some valid functional dependencies: roll_no → {
name, dept_name, dept_building },→ Here, roll_no can determine values of fields name,
dept_name and dept_building, hence a valid Functional dependency roll_no → dept_name ,
Since, roll_no can determine whole set of {name, dept_name, dept_building}, it can determine
its subset dept_name also.
dept_name → dept_building , Dept_name can identify the dept_building accurately, since
departments with different dept_name will also have a different dept_building More valid
functional dependencies: roll_no → name, {roll_no, name} ⇢ {dept_name,
dept_building}, etc.
Here are some invalid functional dependencies:
name → dept_name Students with the same name can have different dept_name, hence this is
not a valid functional dependency.
dept_building → dept_name There can be multiple departments in the same building. Example,
in the above table departments ME and EC are in the same building B2, hence dept_building →
dept_name is an invalid functional dependency.
More invalid functional dependencies: name → roll_no, {name, dept_name} → roll_no,
dept_building → roll_no, etc.
Armstrong’s axioms/properties of functional dependencies:
Reflexivity: If Y is a subset of X, then X→Y holds by reflexivity rule
Example, {roll_no, name} → name is valid.
Augmentation: If X → Y is a valid dependency, then XZ → YZ is also valid by the
augmentation rule.
Example, {roll_no, name} → dept_building is valid, hence {roll_no, name, dept_name} →
{dept_building, dept_name} is also valid.
Transitivity: If X → Y and Y → Z are both valid dependencies, then X→Z is also valid by the
Transitivity rule.
Example, roll_no → dept_name & dept_name → dept_building, then roll_no → dept_building is
also valid.
Types of Functional Dependencies in DBMS
Trivial functional dependency - In Trivial Functional Dependency, a dependent is always a
subset of the determinant. i.e. If X → Y and Y is the subset of X, then it is called trivial
functional dependency
Example:
roll_n nam age
o e

42 abc 17

43 pqr 18

44 xyz 18

Here, {roll_no, name} → name is a trivial functional dependency, since the dependent name is a
subset of determinant set {roll_no, name}. Similarly, roll_no → roll_no is also an example of
trivial functional dependency.
Non-Trivial functional dependency - In Non-trivial functional dependency, the dependent is
strictly not a subset of the determinant. i.e. If X → Y and Y is not a subset of X, then it is called
Non-trivial functional dependency.
Example:
roll_n nam age
o e

42 abc 17

43 pqr 18

44 xyz 1

Here, roll_no → name is a non-trivial functional dependency, since the dependent name is not a
subset of determinant roll_no. Similarly, {roll_no, name} → age is also a non-trivial functional
dependency, since age is not a subset of {roll_no, name}
Multivalued functional dependency - In Multivalued functional dependency, entities of the
dependent set are not dependent on each other. i.e. If a → {b, c} and there exists no functional
dependency between b and c, then it is called a multivalued functional dependency. For
example,
roll_n nam age
o e

42 abc 17

43 pqr 18

44 xyz 18

45 abc 19

Here, roll_no → {name, age} is a multivalued functional dependency, since the dependents name
& age are not dependent on each other(i.e. name → age or age → name doesn’t exist !)
Transitive functional dependency - In transitive functional dependency, dependent is indirectly
dependent on determinant. i.e. If a → b & b → c, then according to axiom of transitivity, a → c.
This is a transitive functional dependency.
For example,
enrol_no nam dept building_n
e o
42 abc CO 4

43 pqr EC 2

44 xyz IT 1

45 abc EC 2

Here, enrol_no → dept and dept → building_no. Hence, according to the axiom of transitivity,
enrol_no → building_no is a valid functional dependency. This is an indirect functional
dependency, hence called Transitive functional dependency.
Fully Functional Dependency - In full functional dependency an attribute or a set of attributes
uniquely determines another attribute or set of attributes. If a relation R has attributes X, Y, Z
with the dependencies X->Y and X->Z which states that those dependencies are fully
functional. Partial Functional Dependency - In partial functional dependency a non key
attribute depends on a part of the composite key, rather than the whole key. If a relation R has
attributes X, Y, Z where X and Y are the composite key and Z is non key attribute. Then X->Z
is a partial functional dependency in RBDMS.
Advantages
Data Normalization
Query Optimization
Consistency of Data
Data Quality Improvement
https://www.google.com/amp/s/www.geeksforgeeks.org/types-of-functional-dependencies-in
dbms/amp/

Reasoning about functional dependencies


Functional dependencies are an essential concept in database management systems (DBMS) that
help ensure data integrity and optimize database design. Here are 5-10 key points about
functional dependencies:
Definition: Functional dependency is a relationship between two sets of attributes in a database
table. It states that for a given set of values in one set of attributes (the determinant), there is a
unique set of values in another set of attributes (the dependent). In other words, the value of the
determinant determines the value of the dependent.
Notation: Functional dependencies are denoted using arrow notation: A → B, where A is the
determinant set of attributes, and B is the dependent set of attributes.
Transitivity: Functional dependencies are transitive, meaning that if A → B and B → C hold,
then A → C also holds. For example, if "Employee_ID → Department_ID" and "Department_ID
→ Department_Name" are both functional dependencies, then "Employee_ID →
Department_Name" is implied.
Closure of Attributes: The closure of attributes with respect to a set of functional dependencies
is the set of all attributes that are functionally dependent on the given set. It helps in determining
the superkeys and candidate keys of a relation.
Keys and Superkeys: A superkey is a set of one or more attributes that uniquely identifies tuples
in a relation. A key is a minimal superkey, i.e., a superkey from which we cannot remove any
attributes while still maintaining uniqueness.
Determining Keys: Functional dependencies play a crucial role in determining keys. If a set of
attributes can determine all other attributes in a relation, it is a candidate key. A relation can have
multiple candidate keys, and the one chosen as the primary key becomes the main unique
identifier for the tuples.
Normalization: Functional dependencies are used to normalize the database schema, which
involves breaking down large tables into smaller, well-structured ones to reduce data redundancy
and improve data integrity.
First Normal Form (1NF): Each attribute in a relation must be atomic (indivisible), and each
tuple must be unique. Functional dependencies help in ensuring that each attribute contains only
one value and no duplicate tuples exist.
Boyce-Codd Normal Form (BCNF): A relation is in BCNF if, for every non-trivial functional
dependency A → B (where A is not a superkey), A is a candidate key. BCNF ensures that there
are no partial dependencies.
Lossless Decomposition: When decomposing a relation into multiple smaller relations during
normalization, functional dependencies help guarantee that we can join these smaller relations
back together without losing any information.
Understanding functional dependencies is crucial for proper database design, normalization, and
maintaining data integrity, ensuring that the database structure remains efficient and reliable.

Normalization
Types of normal forms
A large database defined as a single relation may result in data duplication. This repetition of
data may result in:
Making relations very large.
It isn't easy to maintain and update data as it would involve searching many records in
relation.
Wastage and poor utilization of disk space and resources.
The likelihood of errors and inconsistencies increases.
So to handle these problems, we should analyze and decompose the relations with redundant data
into smaller, simpler, and well-structured relations that are satisfy desirable properties.
Normalization is a process of decomposing the relations into relations with fewer attributes.
Normalization
Normalization is the process of organizing the data in the database.
Normalization is used to minimize the redundancy from a relation or set of relations. It is
also used to eliminate undesirable characteristics like Insertion, Update, and Deletion
Anomalies.
Normalization divides the larger table into smaller and links them using relationships.
The normal form is used to reduce redundancy from the database table.
The main reason for normalizing the relations is removing these anomalies. Failure to eliminate
anomalies leads to data redundancy and can cause data integrity and other problems as the
database grows. Normalization consists of a series of guidelines that helps to guide you in
creating a good database structure.
Data modification anomalies can be categorized into three types:
Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple into a
relationship due to lack of data.
Deletion Anomaly: The delete anomaly refers to the situation where the deletion of data results
in the unintended loss of some other important data.
Updatation Anomaly: The update anomaly is when an update of a single data value requires
multiple rows of data to be updated.
Types of Normal Forms:
1NF A relation is in 1NF if it contains an atomic value.
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional
dependent on the primary key.
3NF A relation will be in 3NF if it is in 2NF and no transition dependency exists. BCNF A
stronger definition of 3NF is known as Boyce Codd's normal form. 4NF A relation will be in
4NF if it is in Boyce Codd's normal form and has no multi-valued dependency.
5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining
should be lossless.
Advantages
Normalization helps to minimize data redundancy.
Greater overall database organization.
Data consistency within the database.
Much more flexible database design.
Enforces the concept of relational integrity.
Disadvantages
You cannot start building the database before knowing what the user needs. The
performance degrades when normalizing the relations to higher normal forms, i.e.,
4NF, 5NF.
It is very time-consuming and difficult to normalize relations of a higher degree.
Careless decomposition may lead to a bad database design, leading to serious problems. First
Normal Form (1NF)
A relation will be 1NF if it contains an atomic value.
It states that an attribute of a table cannot hold multiple values. It must hold only single-valued
attribute.
First normal form disallows the multi-valued attribute, composite attribute, and their
combinations.
Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute
EMP_PHONE. EMPLOYEE table:
EMP_I EMP_NAM EMP_PHONE EMP_STAT
D E E

14 John 72728263 14
85,
9064738
238

20 Harry 8574783832 Bihar

12 Sam 73903723 12
89,
8589830
302

The decomposition of the EMPLOYEE table into 1NF has been shown below:
EMP_I EMP_NAM EMP_PHON EMP_STAT
D E E E

14 John 7272826385 UP

14 John 9064738238 UP

20 Harry 8574783832 Bihar

12 Sam 7390372389 Punjab

12 Sam 8589830302 Punjab

Second Normal Form (2NF)


In the 2NF, relational must be in 1NF.
In the second normal form, all non-key attributes are fully functional dependent on the primary
key
Example: Let's assume, a school can store the data of teachers and the subjects they teach. In a
school, a teacher can teach more than one subject.
TEACHER table
TEACHER_ID SUBJECT TEACHER_AG
E

25 Chemistry 30

25 Biology 30

47 English 35

83 Math 38

83 Computer 38

In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which


is a proper subset of a candidate key. That's why it violates the rule for 2NF. To convert the
given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE

25 30

TEACHER_SUBJECT 47 35

83 38

TEACHER_ID SUBJECT

table: Third Normal Form 25 Chemistry

25 Biology

47 English

83 Math

(3NF) 83 Computer

A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency.
3NF is used to reduce the data duplication. It is also used to achieve the data integrity. If there
is no transitive dependency for non-prime attributes, then the relation must be in third normal
form.
A relation is in third normal form if it holds atleast one of the following conditions for every non
trivial function dependency X → Y.
X is a super key.
Y is a prime attribute, i.e., each element of Y is part of some candidate
key. Example:
EMPLOYEE_DETAIL table:
EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY

222 Harry 201010 UP Noida

333 Stephan 02228 US Boston

444 Lan 60007 US Chicago

555 Katharine 06389 UK Norwich

666 John 462007 MP Bhopal

Super key:{EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so on


Candidate key: {EMP_ID}
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime. Here,
EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID.
The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new <EMPLOYEE_ZIP>
table, with EMP_ZIP as a Primary key.
EMPLOYEE table: EMPLOYEE_ZIP table:

EMP_I EMP_NAM EMP_ZI


D E P

222 Harry 201010

333 Stephan 02228


44 02228 US Boston

55 60007 US Chicago

66 06389 UK Norwich

462007 MP Bhopal

EM
P

201
Boyce Codd normal form (BCNF)
BCNF is the advance version of 3NF. It is stricter than 3NF.
A table is in BCNF if every functional dependency X → Y, X is the super key of the table. For
BCNF, the table should be in 3NF, and for every FD, LHS is super key. Example: Let's assume
there is a company where employees work in more than one department. EMPLOYEE table:
EMP_I EMP_COUNTR EMP_DEPT DEPT_TYP EMP_DEPT_N
D Y E O

264 India Designing D394 283

264 India Testing D394 300

364 UK Stores D283 232

364 UK Developin D283 549


g

In the above table Functional dependencies are as follows:


EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are
keys. To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
264 India

264 India
EMP_DEPT table:
EMP_I EMP_COUNTR
D Y
EMP_DEFunctional dependencies:

Designin

Testing

Stores EMP_I EMP_DEP


D T
Develop
g D394 283

D394 300

EMP_DEPT_MAPPING table: D283 232

D283 549

EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional dependencies is a key.
Fourth normal form (4NF)
A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation
will be a multi-valued dependency.
Example: STUDENT
STU_ID COURSE HOBBY

21 Computer Dancin
g

21 Math Singing

34 Chemistry Dancin
g

74 Biology Cricket
59 Physics Hockey

The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity.
Hence, there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math
and two hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID,
which leads to unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two
tables: STUDENT_COURSE
STU_ID COURSE

21 Computer

21 Math

34 Chemistry

74 Biology

59 Physics

STUDENT_HOBBY (5NF)

Fifth normal form STU_ID HOBBY

21 Dancin
g

21 Singing

34 Dancin
g

74 Cricket

59 Hockey
A relation is in 5NF if it is in 4NF and not contains any join dependency and joining should be
lossless.
5NF is satisfied when all the tables are broken into as many tables as possible in order to avoid
redundancy.
5NF is also known as Project-join normal form (PJ/NF).
Example
SUBJECT LECTURE SEMESTER
R

Computer Anshika Semester 1

Computer John Semester 1

Math John Semester 1

Math Akash Semester 2

Chemistry Praveen Semester 1

In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take
Math class for Semester 2. In this case, combination of all these fields required to identify a valid
data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who will
be taking that subject so we leave Lecturer and Subject as NULL. But all three columns together
acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 &
P3: P1

SUBJECT LECTURE
R
P2 Computer Anshika
SEMESTER SUBJECT
Computer John
Semester 1 Computer
Math John
Semester 1 Math
Math Akash
Semester 1 Chemistry
Chemistry Praveen
Semester 2 Math
P3
SEMSTER LECTURE
R

Semester 1 Anshika

Semester 1 John

Semester 1 John

Semester 2 Akash

Semester 1 Praveen

https://www.javatpoint.com/dbms-normalization
https://www.javatpoint.com/dbms-first-normal-form
https://www.javatpoint.com/dbms-second-normal-form
https://www.javatpoint.com/dbms-third-normal-form
https://www.javatpoint.com/dbms-boyce-codd-normal-form
https://www.javatpoint.com/dbms-forth-normal-form
https://www.javatpoint.com/dbms-fifth-normal-form

Lossless join decomposition


The original relation and relation reconstructed from joining decomposed relations must contain
same number of tuples if number is increased or decreased then it is Losssy Join decomposition.
Lossless join decomposition ensures that never get the situation where spurious tuple are
generated in relation, for every value on the join attributes there will be a unique tuple in one of
the relations.
Lossless join decomposition is a decomposition of a relation R into relations R1, R2 such that if
we perform a natural join of relation R1 and R2, it will return the original relation R. This is
effective in removing redundancy from databases while preserving the original data. In other
words by lossless decomposition, it becomes feasible to reconstruct the relation R from
decomposed tables R1 and R2 by using Joins.
Only 1NF,2NF,3NF and BCNF are valid for lossless join decomposition. In Lossless
Decomposition, we select the common attribute and the criteria for selecting a common
attribute is that the common attribute must be a candidate key or super key in either relation
R1, R2, or both.
Decomposition of a relation R into R1 and R2 is a lossless-join decomposition if at least one of
the following functional dependencies are in F+ (Closure of functional dependencies)
Example:
— Employee (Employee_Id, Ename, Salary, Department_Id, Dname) –
— Can be decomposed using lossless decomposition as,
— Employee_desc (Employee_Id, Ename, Salary, Department_Id)
— Department_desc (Department_Id, Dname) . :
– Alternatively the lossy decomposition would be as joining these tables is not possible so not
possible to get back original data.
– Employee_desc (Employee_Id, Ename, Salary)
– Department_desc (Department_Id, Dname)
https://www.geeksforgeeks.org/lossless-decomposition-in-dbms/

Multi valued dependencies


MVD or multivalued dependency means that for a single value of attribute ‘a’ multiple values of
attribute ‘b’ exist. We write it as,
a→→b
It is read as: a is multi-valued dependent on b.
Suppose a person named Geeks is working on 2 projects Microsoft and Oracle and has 2 hobbies
namely Reading and Music. This can be expressed in a tabular format in the following way.

Project and Hobby are multivalued attributes as they have more than one value for a single
person i.e., Geeks.
Multi Valued Dependency (MVD) :
We can say that multivalued dependency exists if the following conditions are
met. Conditions for MVD :
Any attribute say a multiple define another attribute b; if any legal relation r(R), for all pairs of
tuples t1 and t2 in r, such that,
t1[a] = t2[a]
Then there exists t3 and t4 in r such that.
t1[a] = t2[a] = t3[a] = t4[a]
t1[b] = t3[b]; t2[b] = t4[b]
t1 = t4; t2 = t3
Then multivalued (MVD) dependency exists.
To check the MVD in given table, we apply the conditions stated above and we check it with the
values in the given table.

Condition-1 for MVD –


t1[a] = t2[a] = t3[a] = t4[a]
Finding from table,
t1[a] = t2[a] = t3[a] = t4[a] = Geeks
So, condition 1 is Satisfied.
Condition-2 for MVD –
t1[b] = t3[b]
And
t2[b] = t4[b]
Finding from table,
t1[b] = t3[b] = MS
And
t2[b] = t4[b] = Oracle
So, condition 2 is Satisfied.
Condition-3 for MVD –
∃c ∈ R-(a ∪ b) where R is the set of attributes in the relational
table. t1 = t4
And
t2=t3
Finding from table,
t1 = t4 = Reading
And
t2 = t3 = Music
So, condition 3 is Satisfied.
All conditions are satisfied, therefore,
a→→b
According to table we have got,
name →→ project
And for,
a→→C
We get,
name → → hobby
Hence, we know that MVD exists in the above table and it can be stated by,
name → → project
name → → hobby
https://www.geeksforgeeks.org/multivalued-dependency-mvd-in-dbms/
UNIT - IV
Transaction Concept, Transaction State, Implementation of Atomicity and Durability,
Concurrent Executions, Serializability, Recoverability, Implementation of Isolation, Testing
for serializability, Lock Based Protocols, Timestamp Based Protocols, Validation- Based
Protocols, Multiple Granularity, Recovery and Atomicity, Log–Based Recovery, Recovery
with Concurrent Transactions.

Transaction Concept
Transaction - Any logical work or set of works that are done on the data of a database is known
as a transaction. Logical work can be inserting a new value in the current database, deleting
existing values, or updating the current values in the database.
For example, adding a new member to the database of a team is a transaction. To complete a
transaction, we have to follow some steps which make a transaction successful. For example, we
withdraw the cash from ATM is an example of a transaction, and it can be done in the following
steps:
Initialization if transaction
Inserting the ATM card into the machine
Choosing the language
Choosing the account type
Entering the cash amount
Entering the pin
Collecting the cash
Aborting the transaction
So, in the same way, we have three steps in the DBMS for a transaction which are the
following: Read Data
Write Data
Commit
We can understand the above three states by an example. Let suppose we have two accounts,
account1, and account2, with an initial amount as 1000Rs. each. If we want to transfer Rs.500 from
account1 to account2, then we will commit the transaction.
All the account details are in secondary memory so that they will be brought into primary
memory for the transaction.
Now we will read the data of account1 and deduct the Rs.500 from the account1. Now,
account1 contains Rs.500.
Now we will read the data of the account2 and add Rs.500 to it. Now, account2 will have
Rs.1500.
In the end, we will use the commit command, which indicates that the transaction has
been successful, and we can store the changes in secondary memory.
If, in any case, there is a failure before the commit command, then the system will be back
into its previous state, and no changes will be there.
During the complete process of a transaction, there are a lot of states which are described below:
Active State: When the transaction is going well without any error, then this is called an active
state. If all the operations are good, then it goes to a partially committed state, and if it fails, then
it enters into a failed state.
Partially Committed State: All the changes in the database after the read and write operation
needs to be reflected in permanent memory or database. So, a partially committed state system
enters into a committed state for the permanent changes, and if there is any error, then it enters into
a failed state.
Failed State: If there is any error in hardware or software which makes the system fail, then it
enters into the failed state. In the failed state, all the changes are discarded, and the system gets its
previous state which was consistent.
Aborted State: If there is any failure during the execution, then the system goes from failed to an
aborted state. From an aborted state, the transaction will start its execution from a fresh start or
from a newly active state.
Committed State: If the execution of a transaction is successful, the changes are made into the
main memory and stored in the database permanently, which is called the committed state.
Terminated State: If the transaction is in the aborted state(failure) or committed state(success),
then the execution stops, and it is called the terminated state.

Properties of Transaction
There are four properties of a transaction that should be maintained during the transaction.
Atomicity: It means either a transaction will take place, or it will fail. There will not be any
middle state like partial completion.
Atomicity involves the following two operations:
Abort: If a transaction aborts then all the changes made are not visible.
Commit: If a transaction commits then all the changes made are visible.
Example: Let's assume that following transaction T consisting of T1 and T2. A consists of Rs
600 and B consists of Rs 300. Transfer Rs 100 from account A to account B.

T1 T2

Read(A) Read(B)
A:= A-100 Y:= Y+100

Write(A) Write(B)

After completion of the transaction, A consists of Rs 500 and B consists of Rs 400.

If the transaction T fails after the completion of transaction T1 but before completion of transaction
T2, then the amount will be deducted from A but not added to B. This shows the inconsistent database
state. In order to ensure correctness of database state, the transaction must be executed in entirety.

Consistency: The database should be consistent before and after the transaction. Correctness and
integrity constraints should be maintained during the transaction. The integrity constraints are
maintained so that the database is consistent before and after the transaction.The execution of a
transaction will leave a database in either its prior stable state or a new stable state.The consistent
property of database states that every transaction sees a consistent database instance.The
transaction is used to transform the database from one consistent state to another consistent state.
For example: The total amount must be maintained before or after the transaction.
Total before T occurs = 600+300=900
Total after T occurs= 500+400=900
Therefore, the database is consistent. In the case when T1 is completed but T2 fails, then inconsistency
will occur.
Isolation: This property means multiple transactions can occur at the same time without affecting
each other. If one transaction is occurring, then it should not bring any changes in the data for the
other transaction, which is occurring concurrently. It shows that the data which is used at the time
of execution of a transaction cannot be used by the second transaction until the first one is
completed.In isolation, if the transaction T1 is being executed and using the data item X, then that
data item can't be accessed by any other transaction T2 until the transaction T1 ends. The
concurrency control subsystem of the DBMS enforced the isolation property.
Durability: It means if there is a successful transaction, then all changes should be permanent, so
if there is any system failure, we will be able to retrieve the updated data. The durability property
is used to indicate the performance of the database's consistent state. It states that the transaction
made the permanent changes.They cannot be lost by the erroneous operation of a faulty
transaction or by the system failure. When a transaction is completed, then the database reaches a
state known as the consistent state. That consistent state cannot be lost, even in the event of a
system's failure.The recovery subsystem of the DBMS has the responsibility of Durability
property.
https://www.javatpoint.com/transactions-in-dbms
https://www.javatpoint.com/dbms-transaction-property
https://www.geeksforgeeks.org/acid-properties-in-dbms/
Transaction State
Active state - The active state is the first state of every transaction. In this state, the transaction is
being executed.
For example: Insertion or deletion or updating a record is done here. But all the records are still
not saved to the database.
Partially committed - In the partially committed state, a transaction executes its final operation,
but the data is still not saved to the database.
In the total mark calculation example, a final display of the total marks step is executed in this
state.
Committed - A transaction is said to be in a committed state if it executes all its operations
successfully. In this state, all the effects are now permanently saved on the database system.
Failed state - If any of the checks made by the database recovery system fails, then the
transaction is said to be in the failed state.
In the example of total mark calculation, if the database is not able to fire a query to fetch the
marks, then the transaction will fail to execute.
Aborted - If any of the checks fail and the transaction has reached a failed state then the database
recovery system will make sure that the database is in its previous consistent state. If not then it
will abort or roll back the transaction to bring the database into a consistent state. If the
transaction fails in the middle of the transaction then before executing the transaction, all the
executed transactions are rolled back to its consistent state.
After aborting the transaction, the database recovery module will select one of the two
operations: Re-start the transaction
Kill the transaction
https://www.javatpoint.com/dbms-states-of-transaction

Implementation of Atomicity and Durability


Atomicity and durability are two important concepts in database management systems (DBMS)
that ensure the consistency and reliability of data.
Atomicity: One of the key characteristics of transactions in database management systems
(DBMS) is atomicity, which guarantees that every operation within a transaction is handled as a
single, indivisible unit of work.
Importance:
A key characteristic of transactions in database management systems is atomicity (DBMS). It
makes sure that every action taken as part of a transaction is handled as a single, indivisible item
of labor that can either be completed in full or not at all.
Even in the case of mistakes, failures, or crashes, atomicity ensures that the database maintains
consistency. The following are some of the reasons why atomicity is essential in DBMS:
Consistency: Atomicity ensures that the database remains in a consistent state at all times. All
changes made by a transaction are rolled back if it is interrupted or fails for any other reason,
returning the database to its initial state. By doing this, the database's consistency and data
integrity are maintained.
Recovery: Atomicity guarantees that, in the event of a system failure or crash, the database
can be restored to a consistent state. All changes made by a transaction are undone if it is
interrupted or fails, and the database is then reset to its initial state using the undo log. This
guarantees that, even in the event of failure, the database may be restored to a consistent
condition.
Concurrency: Atomicity makes assurance that transactions can run simultaneously
without affecting one another. Each transaction is carried out independently of the others,
and its modifications are kept separate. This guarantees that numerous users can access the
database concurrently without resulting in conflicts or inconsistent data.
Reliability: Even in the face of mistakes or failures, atomicity makes the guarantee that
the database is trustworthy. By ensuring that transactions are atomic, the database remains
consistent and reliable, even in the event of system failures, crashes, or errors. Implementation
of Atomicity:
Here are some common techniques used to implement atomicity in DBMS: Undo Log: An undo
log is a mechanism used to keep track of the changes made by a transaction before it is
committed to the database. If a transaction fails, the undo log is used to undo the changes
made by the transaction, effectively rolling back the transaction. By doing this, the
database is guaranteed to remain in a consistent condition.
Redo Log: A redo log is a mechanism used to keep track of the changes made by a
transaction after it is committed to the database. If a system failure occurs after a transaction
is committed but before its changes are written to disk, the redo log can be used to redo the
changes and ensure that the database is consistent.
Two-Phase Commit: Two-phase commit is a protocol used to ensure that all nodes in a
distributed system commit or abort a transaction together. This ensures that the transaction
is executed atomically across all nodes and that the database remains consistent across the
entire system.
Locking: Locking is a mechanism used to prevent multiple transactions from accessing the
same data concurrently. By ensuring that only one transaction can edit a specific piece of
data at once, locking helps to avoid conflicts and maintain the consistency of the database.
Durability: One of the key characteristics of transactions in database management systems
(DBMS) is durability, which guarantees that changes made by a transaction once it has been
committed are permanently kept in the database and will not be lost even in the case of a system
failure or catastrophe.
Importance:
Durability is a critical property of transactions in database management systems (DBMS) that
ensures that once a transaction is committed, its changes are permanently stored in the database
and will not be lost, even in the event of a system failure or crash.
The following are some of the reasons why durability is essential in DBMS: Data Integrity:
Durability ensures that the data in the database remains consistent and accurate, even in
the event of a system failure or crash. It guarantees that committed transactions are
durable and will be recovered without data loss or corruption. Reliability: Durability
guarantees that the database will continue to be dependable despite faults or failures. In
the event of system problems, crashes, or failures, the database is kept consistent and
trustworthy by making sure that committed transactions are durable. Recovery: Durability
guarantees that, in the event of a system failure or crash, the database can be restored to a
consistent state. The database can be restored to a consistent state if a committed
transaction is lost due to a system failure or crash since it can be recovered from the redo
log or other backup storage.
Availability: Durability ensures that the data in the database is always available for access
by users, even in the event of a system failure or crash. It ensures that committed
transactions are always retained in the database and are not lost in the event of a system
crash.
Implementation of Durability:
Here are some common techniques used to implement durability in DBMS: Write-Ahead
Logging: Write-ahead logging is a mechanism used to ensure that changes made by a
transaction are recorded in the redo log before they are written to the database. This
makes sure that the changes are permanent and that they can be restored from the redo
log in the event of a system failure.
Checkpointing: Checkpointing is a technique used to periodically write the database state
to disk to ensure that changes made by committed transactions are permanently stored.
Checkpointing aids in minimizing the amount of work required for database recovery.
Redundant storage: Redundant storage is a technique used to store multiple copies of the
database or its parts, such as the redo log, on separate disks or systems. This ensures that
even in the event of a disk or system failure, the data can be recovered from the redundant
storage.
RAID: In order to increase performance and reliability, a technology called RAID
(Redundant Array of Inexpensive Disks) is used to integrate several drives into a single
logical unit. RAID can be used to implement redundancy and ensure that data is durable
even in the event of a disk failure.
Techniques used by DBMS to Implement Atomicity and Durability:
Transactions: Transactions are used to group related operations that need to be executed
atomically. They are either committed, in which case all their changes become permanent, or rolled
back, in which case none of their changes are made permanent.
Logging: Logging is a technique that involves recording all changes made to the database in a
separate file called a log. The log is used to recover the database in case of a failure. Write-ahead
logging is a common technique that guarantees that data is written to the log before it is written to
the database.
Shadow Paging: Shadow paging is a technique that involves making a copy of the database before
any changes are made. The copy is used to provide a consistent view of the database in case of
failure. The modifications are made to the original database after a transaction has been committed.
Backup and Recovery: In order to guarantee that the database can be recovered to a consistent
state in the event of a failure, backup and recovery procedures are used. This involves making
regular backups of the database and keeping track of changes made to the database since the last
backup.
https://www.javatpoint.com/implementation-of-atomicity-and-durability-in-dbms

Concurrent Executions
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent
execution. Concurrent Execution in DBMS
In a multi-user system, multiple users can access and use the same database at one time, which is
known as the concurrent execution of the database. It means that the same database is executed
simultaneously on a multi-user system by different users.
While working on the database transactions, there occurs the requirement of using the database by
multiple users for performing different operations, and in that case, concurrent execution of the
database is performed.
The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction
operations, there occur several challenging problems that need to be solved.
Problems with Concurrent Execution
In a database transaction, the two main operations are READ and WRITE operations. So, there is
a need to manage these two operations in the concurrent execution of the transactions as if these
operations are not performed in an interleaved manner, and the data may become inconsistent. So,
the following problems occur with the Concurrent Execution of the operations:
Problem 1: Lost Update Problems (W - W Conflict)
The problem occurs when two different database transactions perform the read/write operations on
the same database items in an interleaved manner (i.e., concurrent execution) that makes the values
of the items incorrect hence making the database inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are performed on the same
account A where the balance of account A is $300.
At time t1, transaction TX reads the value of account A, i.e., $300 (only read). At time t2,
transaction TX deducts $50 from account A that becomes $250 (only deducted and not
updated/write).
Alternately, at time t3, transaction TY reads the value of account A that will be $300 only
because TX didn't update the value yet.
At time t4, transaction TY adds $100 to account A that becomes $400 (only added but not
updated/write).
At time t6, transaction TX writes the value of account A that will be updated as $250
only, as TY didn't update the value yet.
Similarly, at time t7, transaction TY writes the values of account A, so it will write as
done at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is lost.
Hence data becomes incorrect, and database sets to inconsistent.
Dirty Read Problems (W-R Conflict)
The dirty read problem occurs when one transaction updates an item of the database, and
somehow the transaction fails, and before the data gets rollback, the updated database item is
accessed by another transaction. There comes the Read-Write Conflict between both
transactions. For example:
Consider two transactions TX and TY in the below diagram performing read/write operations on
account A where the available balance in account A is $300:

At time t1, transaction TX reads the value of account A, i.e., $300.


At time t2, transaction TX adds $50 to account A that becomes $350.
At time t3, transaction TX writes the updated value in account A, i.e.,
$350. Then at time t4, transaction TY reads account A that will be read as
$350.
Then at time t5, transaction TX rollbacks due to server problem, and the value changes back
to $300 (as initially).
But the value for account A remains $350 for transaction TY as committed, which is the dirty
read and therefore known as the Dirty Read Problem.
Unrepeatable Read Problem (W-R Conflict)
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different
values are read for the same database item.
For example:
Consider two transactions, TX and TY, performing the read/write operations on account A,
having an available balance = $300. The diagram is shown below:

At time t1, transaction TX reads the value from account A, i.e., $300.
At time t2, transaction TY reads the value from account A, i.e., $300.
At time t3, transaction TY updates the value of account A by adding $100 to the available
balance, and then it becomes $400.
At time t4, transaction TY writes the updated value, i.e., $400.
After that, at time t5, transaction TX reads the available value of account A, and that will be
read as $400.
It means that within the same transaction TX, it reads two different values of account A, i.e., $
300 initially, and after updation made by transaction TY, it reads $400. It is an unrepeatable
read and is therefore known as the Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that take place in
concurrent execution, management is needed, and that is where the concept of Concurrency
Control comes into role.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the database.
Thus, for maintaining the concurrency of the database, we have the concurrency control
protocols. Concurrency Control Protocols
The concurrency control protocols ensure the atomicity, consistency, isolation, durability and
serializability of the concurrent execution of the database transactions. Therefore, these protocols
are categorized as:
Lock Based Concurrency Control Protocol
Time Stamp Concurrency Control Protocol
Validation Based Concurrency Control Protocol
Lock Based Concurrency Control Protocol - In this type of protocol, any transaction cannot
read or write data until it acquires an appropriate lock on it. There are two types of lock: Shared
lock: It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction. It can be shared between the transactions because when the transaction holds a lock,
then it can't update the data on the data item.
Exclusive lock: In the exclusive lock, the data item can be both reads as well as written by the
transaction. This lock is exclusive, and in this lock, multiple transactions do not modify the same
data simultaneously.
There are four types of lock protocols available:
Simplistic lock protocol - It is the simplest way of locking the data while transaction. Simplistic
lock-based protocols allow all the transactions to get the lock on the data before insert or delete or
update on it. It will unlock the data item after completing the transaction.
Pre-claiming Lock Protocol - Pre-claiming Lock Protocols evaluate the transaction to list all the
data items on which they need locks. Before initiating an execution of the transaction, it requests
DBMS for all the lock on all those data items. If all the locks are granted then this protocol allows
the transaction to begin. When the transaction is completed then it releases all the lock. If all the
locks are not granted then this protocol allows the transaction to rolls back and waits until all the
locks are granted.

Two-phase locking
(2PL) - The two-phase locking protocol divides the execution phase of the transaction into three
parts. In the first part, when the execution of the transaction starts, it seeks permission for the
lock it requires. In the second part, the transaction acquires all the locks. The third phase is
started as soon as the transaction releases its first lock. In the third phase, the
transaction cannot demand any new locks. It only releases the acquired locks.
There are two phases of 2PL:
Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released,
but no new locks can be acquired.
In the below example, if lock conversion is allowed then the following phase can
happen: Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.
Downgrading of lock (from X(a) to S(a)) must be done in shrinking
phase. Example:

The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
Growing phase: from step 1-3
Shrinking phase: from step 5-7
Lock point: at 3
Transaction T2:
Growing phase: from step 2-6
Shrinking phase: from step 8-9
Lock point: at 6
Strict Two-phase locking (Strict-2PL) - The first phase of Strict-2PL is similar to 2PL. In the
first phase, after acquiring all the locks, the transaction continues to execute normally. The only
difference between 2PL and strict 2PL is that Strict-2PL does not release a lock after using it.
Strict-2PL waits until the whole transaction to commit, and then it releases all the locks
at a time. Strict-2PL protocol does not have shrinking phase of lock release.

It does not have cascading abort as 2PL does.


Time Stamp Concurrency Control Protocol - The Timestamp Ordering Protocol is used to order
the transactions based on their Timestamps. The order of transaction is nothing but the ascending
order of the transaction creation.
The priority of the older transaction is higher that's why it executes first. To determine the
timestamp of the transaction, this protocol uses system time or logical counter. The lock-based
protocol is used to manage the order between conflicting pairs among transactions at the
execution time. But Timestamp based protocols start working as soon as a transaction is created.
Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the
system at 007 times and transaction T2 has entered the system at 009 times. T1 has the higher
priority, so it executes first as it is entered the system first.
The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write' operation
on a data.
Basic Timestamp ordering protocol works as follows:
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
If W_TS(X) >TS(Ti) then the operation is rejected.
If W_TS(X) <= TS(Ti) then the operation is executed.
Timestamps of all the data items are updated.
2. Check the following condition whenever a transaction Ti issues a Write(X) operation:
If TS(Ti) < R_TS(X) then the operation is rejected.
If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the
operation is executed.
Where, TS(TI) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.
Advantages and Disadvantages
TO protocol ensures serializability since the precedence graph is as

follows:
TS protocol ensures freedom from deadlock that means no transaction ever
waits. But the schedule may not be recoverable and may not even be cascade-
free.
Validation Based Concurrency Control Protocol - Validation phase is also known as optimistic
concurrency control technique. In the validation based protocol, the transaction is executed in the
following three phases:
Read phase: In this phase, the transaction T is read and executed. It is used to read the value of
various data items and stores them in temporary local variables. It can perform all the write
operations on temporary variables without an update to the actual database. Validation phase: In
this phase, the temporary variable value will be validated against the actual data to see if it
violates the serializability.
Write phase: If the validation of the transaction is validated, then the temporary results are
written to the database or system otherwise the transaction is rolled back.
Here each phase has the following different timestamps:
Start(Ti): It contains the time when Ti started its execution.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation
phase. Finish(Ti): It contains the time when Ti finishes its write phase.
This protocol is used to determine the time stamp for the transaction for serialization using the time
stamp of the validation phase, as it is the actual phase which determines if the transaction will
commit or rollback.
Hence TS(T) = validation(T).
The serializability is determined during the validation process. It can't be decided in advance.
While executing the transaction, it ensures a greater degree of concurrency and also less number
of conflicts.
Thus it contains transactions which have less number of rollbacks.
https://www.javatpoint.com/dbms-concurrency-control
DBMS Lock based Protocol - javatpoint
DBMS Timestamp Ordering Protocol - javatpoint
DBMS Validation based Protocol - javatpoint

Serializability
serializability is a term that is a property of the system that describes how the different process
operates the shared data. If the result given by the system is similar to the operation performed by
the system, then in this situation, we call that system serializable. Here the cooperation of the
system means there is no overlapping in the execution of the data. In DBMS, when the data is
being written or read then, the DBMS can stop all the other processes from accessing the data.
In MongoDB, the most restricted level for serializability is the employee can be restricted by two
phase locking or 2PL. In the first phase of the locking level, the data objects are locked before the
execution of the operation. When the transaction has been accomplished, then the lock for the
data
object is released. This process guarantees that there is no conflict in operation and that all the
transaction views the database as a conflict database.
The two-phase locking or 2PL system provides a strong guarantee for the conflict of the database.
Serializability is the system's property that describes how the different process operates the
shared data. In DBMS, the overall Serializable property is adopted by locking the data during the
execution of other processes. Also, serializability ensures that the final result is equivalent to the
sequential operation of the data.
Serializable Schedule
Serializable schedule is a property in which the read and write operation sequence does not disturb
the serializability property. This property ensures that the transaction is executed automatically
with the other transaction. The order of the serializability must be the same as some serial schedules
of the same transaction.
There are several algorithms available to check the serializability of the database. One of the most
important algorithms is the conflict serializability algorithm. It is the ability to check the potential
of the conflict in the database. When the two transactions access the same data, this conflict occurs
in the database. If there is no conflict, then there is guaranteed serializability in the database.
However, if there is a conflict occurs, then there is a chance of serializability.
Another algorithm that is used is the DBMS algorithm which helps to check the potential mutual
dependencies between two transactions. When the two transactions give the correct output, then
mutual dependencies exist. When there are no mutual dependencies, then there is a guaranteed
serializable in the database. However, if there are mutual dependencies, then there will be a chance
of serializable.
We can also check the serializability using the precedence graph algorithm. A precedence
relationship exists when one transaction must precede another transaction for the schedule to be
valid. If there are no cycles, then the serializability of schedules in DBMS is guaranteed. However,
if there are cycles, the schedule may or may not be serializable. To understand different algorithms
comprehensively, take the MongoDB Administration certification and get expert analysis on the
concept of serializability in DBMS.
Types of Serializability
Conflict Serializability - Conflict serializability is a type of conflict operation in serializability
that operates the same data item that should be executed in a particular order and maintains the
consistency of the database. In DBMS, each transaction has some unique value, and every
transaction of the database is based on that unique value of the database.This unique value ensures
that no two operations having the same conflict value are executed concurrently.
For example, let's consider two examples, i.e., the order table and the customer table. One customer
can have multiple orders, but each order only belongs to one customer. There is some condition
for the conflict serializability of the database. These are as below.
Both operations should have different transactions.
Both transactions should have the same data item.
There should be at least one write operation between the two operations.
If there are two transactions that are executed concurrently, one operation has to add the transaction
of the first customer, and another operation has added by the second operation. This process
ensures that there would be no inconsistency in the database.
View Serializability - View serializability is a type of operation in the serializable in which each
transaction should produce some result and these results are the output of proper sequential
execution of the data item. Unlike conflict serialized, the view serializability focuses on preventing
inconsistency in the database. It provides the user to view the database in a conflicting way.
In DBMS, we should understand schedules S1 and S2 to understand view serializability better.
These two schedules should be created with the help of two transactions T1 and T2. To maintain
the equivalent of the transaction each schedule has to obey the three transactions. These three
conditions are as follows.
The first condition is each schedule has the same type of transaction. The meaning of this
condition is that both schedules S1 and S2 must not have the same type of set of
transactions. If one schedule has committed the transaction but does not match the
transaction of another schedule, then the schedule is not equivalent to each other.
The second condition is that both schedules should not have the same type of read or write
operation. On the other hand, if schedule S1 has two write operations while schedule S2
has one write operation, we say that both schedules are not equivalent to each other. We
may also say that there is no problem if the number of the read operation is different, but
there must be the same number of the write operation in both schedules.
The final and last condition is that both schedules should not have the same conflict.
Order of execution of the same data item. For example, suppose the transaction of schedule
S1 is T1, and the transaction of schedule S2 is T2. The transaction T1 writes the data item
A, and the transaction T2 also writes the data item A. in this case, the schedule is not
equivalent to each other. But if the schedule has the same number of each write operation
in the data item then we called the schedule equivalent to each other.
Testing of Serializability
If there are several transactions executed concurrently, then the main work of the serializability
function is to arrange these several transactions in a sequential manner.
Suppose there are two users Sona and Archita. Each executes two transactions.
Let's transactions T1 and T2 are executed by Sona
T3 and T4 are executed by Archita.
Suppose transaction T1 reads and writes the data item A, transaction T2 reads the data item B,
transaction T3 reads and writes the data item C and transaction T4 reads the data item D. Lets the
schedule the above transaction as below.
T1: Read A → Write A→ Read B → Write B
T2: Read B → Write B
T3: Read C → Write C→ Read D → Write D
T4: Read D → Write D
Let's first discuss why these transactions are not serializable.
In order for a schedule to be considered serializable, it must first satisfy the conflict serializability
property. In our example schedule above, notice that Transaction 1 (T1) and Transaction 2 (T2)
read data item B before either writing it. This causes a conflict between T1 and T2 because they
are both trying to read and write the same data item concurrently. Therefore, the given schedule
does not conflict with serializability.
However, there is another type of serializability called view serializability which our example does
satisfy. View serializability requires that if two transactions cannot see each other's updates, the
schedule is considered to view serializable. In our example, Transaction 2 (T2) cannot see any
updates made by Transaction 4 (T4) because they do not share common data items. Therefore, the
schedule is viewed as serializable.
It's important to note that conflict serializability is a stronger property than view serializability
because it requires that all potential conflicts be resolved before any updates are. View
serializability only requires that if two transactions cannot see each other's updates, then the
schedule is view serializable & it doesn't matter whether or not there are potential conflicts between
them.
Both properties are necessary for ensuring correctness in concurrent transactions in a database
management system.
Advantages
Predictable execution: In serializable, all the threads of the DBMS are executed at one
time. There are no such surprises in the DBMS. In DBMS, all the variables are updated as
expected, and there is no data loss or corruption.
Easier to Reason about & Debug: In DBMS all the threads are executed alone, so it is
very easier to know about each thread of the database. This can make the debugging process
very easy. So we don't have to worry about the concurrent process.
Reduced Costs: With the help of serializable property, we can reduce the cost of the
hardware that is being used for the smooth operation of the database. It can also reduce the
development cost of the software.
Increased Performance: In some cases, serializable executions can perform better than
their non-serializable counterparts since they allow the developer to optimize their code for
performance.
Testing for serializability
Serialization Graph is used to test the Serializability of a schedule.
Assume a schedule S. For S, we construct a graph known as precedence graph. This graph has a
pair G = (V, E), where V consists a set of vertices, and E consists a set of edges. The set of vertices
is used to contain all the transactions participating in the schedule. The set of edges is used to
contain all edges Ti ->Tj for which one of the three conditions holds:
Create a node Ti → Tj if Ti executes write (Q) before Tj executes read
(Q). Create a node Ti → Tj if Ti executes read (Q) before Tj executes write
(Q). Create a node Ti → Tj if Ti executes write (Q) before Tj executes write
(Q). Precedence graph for schedule S:

If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti are
executed before the first instruction of Tj is executed.
If a precedence graph for schedule S contains a cycle, then S is non-serializable. If the precedence
graph has no cycle, then S is known as serializable.
For example:
Explanation:
Read(A): In T1, no subsequent writes to A, so no new edges
Read(B): In T2, no subsequent writes to B, so no new edges
Read(C): In T3, no subsequent writes to C, so no new edges
Write(B): B is subsequently read by T3, so add edge T2 → T3
Write(C): C is subsequently read by T1, so add edge T3 → T1
Write(A): A is subsequently read by T2, so add edge T1 → T2
Write(A): In T2, no subsequent reads to A, so no new edges
Write(C): In T1, no subsequent reads to C, so no new edges
Write(B): In T3, no subsequent reads to B, so no new edges
Precedence graph for schedule S1:

The precedence graph for schedule S1 contains a cycle that's why Schedule S1 is non
serializable.
Explanation:
Read(A): In T4,no subsequent writes to A, so no new edges
Read(C): In T4, no subsequent writes to C, so no new edges
Write(A): A is subsequently read by T5, so add edge T4 → T5
Read(B): In T5,no subsequent writes to B, so no new edges
Write(C): C is subsequently read by T6, so add edge T4 → T6
Write(B): A is subsequently read by T6, so add edge T5 → T6
Write(C): In T6, no subsequent reads to C, so no new edges
Write(A): In T5, no subsequent reads to A, so no new edges
Write(B): In T6, no subsequent reads to B, so no new edges
Precedence graph for schedule S2:

The precedence graph for schedule S2 contains no cycle that's why ScheduleS2 is
serializable. https://www.javatpoint.com/serializability-in-dbms
https://www.javatpoint.com/dbms-testing-of-serializability

Recoverability
Sometimes a transaction may not execute completely due to a software issue, system crash or
hardware failure. In that case, the failed transaction has to be rollback. But some other transaction
may also have used value produced by the failed transaction. So we also have to rollback those
transactions.

The above table 1 shows a schedule which has two transactions. T1 reads and writes the value of A
and that value is read and written by T2. T2 commits but later on, T1 fails. Due to the failure, we
have to rollback T1. T2 should also be rollback because it reads the value written by T1, but T2 can't
be rollback because it already committed. So this type of schedule is known as irrecoverable
schedule.
Irrecoverable schedule: The schedule will be irrecoverable if Tj reads the updated value of Ti and
Tj committed before Ti commit.

The above table 2 shows a schedule with two transactions. Transaction T1 reads and writes A, and
that value is read and written by transaction T2. But later on, T1 fails. Due to this, we have to
rollback T1. T2 should be rollback because T2 has read the value written by T1. As it has not
committed before T1 commits so we can rollback transaction T2 as well. So it is recoverable with
cascade rollback.
Recoverable with cascading rollback: The schedule will be recoverable with cascading rollback if
Tj reads the updated value of Ti. Commit of Tj is delayed till commit of Ti.
The above Table 3 shows a schedule with two transactions. Transaction T1 reads and write A and
commits, and that value is read and written by T2. So this is a cascade less recoverable schedule.
https://www.javatpoint.com/dbms-recoverability-of-schedule

Implementation of Isolation
Isolation is a database-level characteristic that governs how and when modifications are made, as
well as whether they are visible to other users, systems, and other databases. One of the purposes
of isolation is to allow many transactions to run concurrently without interfering with their
execution.
Isolation is a need for database transactional properties. It is the third ACID (Atomicity,
Consistency, Isolation, and Durability) standard property that ensures data consistency and
accuracy.
ACID:
To maintain database consistency, "ACID properties" are followed before and after a transaction.
Atomicity: The term atomicity relates to the ACID Property in DBMS, which refers to the notion
that data is kept atomic.
That means that any operation done on the data must be finished entirely or not at all.It also
suggests that the operation should not be discontinued or completed only halfway. When working
on a transaction, operations should be done completely rather than partially.
The transaction is canceled if any of the operations is unfinished. When another operation
enters with a higher priority, the current operation might be carried out.This terminates the current
operation and causes it to be aborted.
Consistency: This ACID Property will make sure that the sum of the remaining seats in the train
plus the number of seats that users have reserved will equal the total number of seats in the train.
Each transaction ends with a consistency test to make sure nothing goes wrong. Durability: The
term "durability" in relation to DBMS refers to the idea that if an operation is successfully
finished, the database remains in the disc forever. The database's resilience should allow it to
continue operating even if the system malfunctions or crashes.
The recovery manager is in charge of guaranteeing the database's long-term viability in the
event that it is lost. Every time we make a change, we must use the COMMIT command to commit
the values.
Isolation: Isolation is referred to as a state of separation. A DBMS's isolation feature ensures that
several transactions can take place simultaneously and that no data from one database should have
an impact on another. In other words, the process on the second state of the database will start after
the operation on the first state is finished.
Phenomena Defining Isolation Level:
A transaction that reads data that hasn't yet been committed is said to have performed a "Dirty
Read". Imagine that when Transaction 2 receives the modified row, Transaction 1 modifies the
row and leaves it uncommitted. Transaction 2 will have read data that was never intended to exist
if transaction 1 reverses the change.
Non Repeatable Read occurs when a transaction reads the same row twice and receives a different
value each time. Assume that transaction T1 reads data. Because of concurrency, another
transaction, T2, modifies and commits the same data. Transaction T1 will get a different value if
it reads the same data a second time.
When two identical queries are run, but the rows returned by the two are different, this phenomenon
is known as a "Phantom Read." Assume transaction T1 receives a collection of records that meet
some search criteria. Transaction T2 now creates some new data that fit the transaction T1 search
criteria. Transaction T1 will acquire a different set of rows if it re-executes the statement that reads
the rows.
Levels of Isolation:
Isolation is divided into four stages. The ability of users to access the same data concurrently is
constrained by higher isolation. The greater the isolation degree, the more system resources are
required, and the greater the likelihood that database transactions would block one another.
"Serializable," the highest level, denotes that one transaction must be completed before
another can start.
Repeatable Reads allow transactions to be accessed after they have begun, even if they
have not completed. This level enables phantom reads or the awareness of inserted or
deleted rows even when changes to existing rows are not readable.
Read Committed allows you access to information only after it has been committed to
the database.
Read Uncommitted is the lowest level of isolation, allowing access to data before
modifications are performed.
Users are more prone to experience read phenomena like uncommitted dependencies, often known
as dirty reads, where data is read from a row that has been modified by another user but has not
yet been committed to the database, and the lower the isolation level.
https://www.javatpoint.com/isolation-in-dbms

Multiple Granularity
Granularity: It is the size of data item allowed to lock.
Multiple Granularity - It can be defined as hierarchically breaking up the database into blocks
which can be locked. The Multiple Granularity protocol enhances concurrency and reduces lock
overhead. It maintains the track of what to lock and how to lock. It makes easy to decide either to
lock a data item or to unlock a data item. This type of hierarchy can be graphically represented as
a tree.
For example: Consider a tree which has four levels of nodes.
The first level or higher level shows the entire database.
The second level represents a node of type area. The higher level database consists of exactly
these areas. The area consists of children nodes which are known as files. No file can be present
in more than one area. Finally, each file contains child nodes known as records. The file has
exactly those records that are its child nodes. No records represent in more than one file.
Hence, the levels of the tree starting from the top level are as follows:
Database
Area
File
Record

In this example, the highest level shows the entire database. The levels below are file, record,
and fields. There are three additional lock modes with multiple granularity:
Intention-shared (IS): It contains explicit locking at a lower level of the tree but only with
shared locks. Intention-Exclusive (IX): It contains explicit locking at a lower level with
exclusive or shared locks. Shared & Intention-Exclusive (SIX): In this lock, the node is locked
in shared mode, and some node is locked in exclusive mode by the same transaction.
Multiple granularity uses the intention lock modes to ensure serializability. It requires that if a
transaction attempts to lock a node, then that node must follow these protocols:
Transaction T1 should follow the lock-compatibility matrix.
Transaction T1 firstly locks the root of the tree. It can lock it in any mode.
If T1 currently has the parent of the node locked in either IX or IS mode, then the
transaction T1 will lock a node in S or IS mode only.
If T1 currently has the parent of the node locked in either IX or SIX modes, then the
transaction T1 will lock a node in X, SIX, or IX mode only.
If T1 has not previously unlocked any node only, then the Transaction T1 can lock a node.
If T1 currently has none of the children of the node-locked only, then Transaction T1 will
unlock a node.
Observe that in multiple-granularity, the locks are acquired in top-down order, and locks must be
released in bottom-up order.
If transaction T1 reads record Ra9 in file Fa, then transaction T1 needs to lock the
database, area A1 and file Fa in IX mode. Finally, it needs to lock Ra2 in S mode.
If transaction T2 modifies record Ra9 in file Fa, then it can do so after locking the
database, area A1 and file Fa in IX mode. Finally, it needs to lock the Ra9 in X mode.
If transaction T3 reads all the records in file Fa, then transaction T3 needs to lock the
database, and area A in IS mode. At last, it needs to lock Fa in S mode.
If transaction T4 reads the entire database, then T4 needs to lock the database in
S mode. https://www.javatpoint.com/dbms-multiple-granularity

Recovery and Atomicity


A database could fail for any of the following reasons:
System breakdowns occur as a result of hardware or software issues in the system.
Transaction failures arise when a certain process dealing with data updates cannot be
completed.
Disk crashes may occur as a result of the system's failure to read the disc.
Physical damages include issues such as power outages or natural disasters. The data in
the database must be recoverable to the state they were in prior to the system failure,
even if the database system fails. In such situations, database recovery procedures in
DBMS are employed to retrieve the data.
The recovery procedures in DBMS ensure the database's atomicity and durability. If a system
crashes in the middle of a transaction and all of its data is lost, it is not regarded as durable. If
just a portion of the data is updated during the transaction, it is not considered atomic. Data
recovery procedures in DBMS make sure that the data is always recoverable to protect the
durability property and that its state is retained to protect the atomic property. The procedures
listed below are used to recover data from a DBMS,
Recovery based on logs.
Recovery through Deferred Update
Immediate Recovery via Immediate Update
The atomicity attribute of DBMS safeguards the data state. If a data modification is performed,
the operation must be completed entirely, or the data's state must be maintained as if the
manipulation never occurred. This characteristic may be impacted by DBMS failure brought on
by transactions, but DBMS recovery methods will protect it.
Log-Based Recovery
Every DBMS has its own system logs, which record every system activity and include timestamps
for the event's timing. Databases manage several log files for operations such as errors, queries,
and other database updates. The log is saved in the following file formats:
The structure [start transaction, T] represents the start of transaction T execution. [write
the item, T, X, old value, new value] indicates that the transaction T changes the value
of the variable X from the old value to the new value.
[read item, T, X] indicates that the transaction T reads the value of X.
[commit, T] signifies that the modifications to the data have been committed to the
database and cannot be updated further by the transaction. There will be no errors after the
database has been committed.
[abort, T] indicates that the transaction, T, has been cancelled.
We may utilize these logs to see how the state of the data changes during a transaction and
recover it to the prior or new state.
Data recovery activities can be carried out if the transaction fails and the data is in a partial state.
We can also use SQL instructions to mark the transaction's status and recover our data to that state.
To accomplish this, run the following commands:
The SAVEPOINT command is used to save the current state of data in a transaction. The syntax
of this command is, SAVEPOINT save_point_name;
The ROLLBACK command restores the data state to the save point provided by the command.
The command's syntax is, ROLLBACK TO save_point_name;
Database recovery methods used in DBMS to preserve the transaction log files include deferred
updates and rapid updates.
With a deferred update, the database's state of the data is not altered right away once a transaction
is completed; instead, the changes are recorded in the log file, and the database's state is updated
as soon as the commit is complete.
The database is directly updated at every transaction in the immediate update, and a log file
detailing the old and new values is also preserved.
Deferred Update Immediate Update

Changes to data are not instantly applied As soon as the transaction occurs, a
during a transaction. modification is made in the database.

The log file contains the changes that will be The log file contains the changes as well as
made. the new and old values.

This approach employs buffering and Shadow paging is used in this technique.
caching.

When a system fails, it takes longer to During the transaction, a huge number of
restore the data. I/O activities are conducted to manage the
logs.

When a rollback is made, the log files are When a rollback is executed, the log file
deleted, and no changes are made to the records are used to restore the data to its
database. previous state.
Backup Techniques
A backup is a copy of the database's current state that is kept in another location. This backup is
beneficial in the event that the system is destroyed due to natural disasters or physical harm. The
database can be restored to the state it was in at the time of the backup using these backups. Many
backup techniques are used, including the following ones:
1. Immediate backups are copies saved in devices such as hard drives or other storage. When
a disc fails, or a technical error occurs, we can use this information to retrieve the data. 2. An
archive backup is a copy of the database kept on a large storage system or in the cloud
in a different location. In the event that a natural calamity affects the system, these are
utilized to retrieve data.
Transaction Logs
Transaction logs are used to maintain track of all transactions that have updated the data in the
database. The methods below are taken to recover data from transaction logs. The recovery
manager scans through all log files for transactions with a start transaction stage but no commit
stage.
The above-mentioned transactions are rolled back to the previous state using the rollback
command and the logs.
Shadow Paging
In shadow paging, a database is divided into n- multiple pages, each of which represents a fixed
size disc memory.
Similarly, shadow pages, which are replicas of the original database, are created.
The database state is copied to the shadow pages at the start of a transaction. Only the original
database will be changed during the transaction, not the shadow pages. The updates to the shadow
pages are made when the transaction reaches the commit step. The modifications are done so that
if the i-th section of the hard disc is changed, the i-th shadow page is also changed.
In the event that the system fails, recovery procedures are carried out after comparing the
database's true pages to its shadow pages.
https://www.javatpoint.com/recovery-and-atomicity-in-dbms

Log–Based Recovery
The log is a sequence of records. Log of each transaction is maintained in some stable storage so
that if any failure occurs, then it can be recovered from there.
If any operation is performed on the database, then it will be recorded in the log. But the process
of storing the logs should be done before the actual transaction is applied in the database.
Let's assume there is a transaction to modify the City of a student. The following logs are written
for this transaction.
When the transaction is initiated, then it writes 'start' log. <Tn, Start>
When the transaction modifies the City from 'Noida' to 'Bangalore', then another log is written to
the file. <Tn, City, 'Noida', 'Bangalore' >
When the transaction is finished, then it writes another log to indicate the end of the transaction.
<Tn, Commit>
There are two approaches to modify the database:
Deferred database modification: The deferred modification technique occurs if the transaction
does not modify the database until it has committed. In this method, all the logs are created and
stored in the stable storage, and the database is updated when a transaction commits. Immediate
database modification: The Immediate modification technique occurs if database modification
occurs while the transact ion is still active. In this technique, the database is modified
immediately after every operation. It follows an actual database modification. Recovery using
Log records
When the system is crashed, then the system consults the log to find which transactions need to
be undone and which need to be redone.
If the log contains the record <Ti, Start> and <Ti, Commit> or <Ti, Commit>, then the
Transaction Ti needs to be redone.
If log contains record<Tn, Start> but does not contain the record either <Ti, commit> or
<Ti, abort>, then the Transaction Ti needs to be undone.
https://www.javatpoint.com/dbms-log-based-recovery
Recovery with Concurrent Transactions
Whenever more than one transaction is being executed, then the interleaved of logs occur. During
recovery, it would become difficult for the recovery system to backtrack all logs and then start
recovering.
To ease this situation, 'checkpoint' concept is used by most DBMS.
Concurrency control means that multiple transactions can be executed at the same time and then
the interleaved logs occur. But there may be changes in transaction results so maintain the order of
execution of those transactions.
During recovery, it would be very difficult for the recovery system to backtrack all the logs and
then start recovering.
Recovery with concurrent transactions can be done in the following four
ways. Interaction with concurrency control
Transaction rollback
Checkpoints
Restart recovery
Interaction with concurrency control:
In this scheme, the recovery scheme depends greatly on the concurrency control scheme that is
used. So, to rollback a failed transaction, we must undo the updates performed by the transaction.
Transaction rollback: In this scheme, we rollback a failed transaction by using the log. The
system scans the log backward a failed transaction, for every log record found in the log the
system restores the data item.
Checkpoints: Checkpoints is a process of saving a snapshot of the applications state so that it can
restart from that point in case of failure.
Checkpoint is a point of time at which a record is written onto the database form the
buffers. Checkpoint shortens the recovery process.
When it reaches the checkpoint, then the transaction will be updated into the database, and till that
point, the entire log file will be removed from the file. Then the log file is updated with the new
step of transaction till the next checkpoint and so on.
The checkpoint is used to declare the point before which the DBMS was in the consistent state,
and all the transactions were committed.
https://www.javatpoint.com/dbms-recovery-concurrent-transaction
https://www.geeksforgeeks.org/recovery-with-concurrent-transactions/
UNIT - V
Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and
Secondary Indexes, Index data Structures, Hash Based Indexing, Tree base Indexing,
Comparison of File Organizations, Indexes and Performance Tuning, Intuitions for tree
Indexes, Indexed Sequential Access Methods (ISAM), B+ Trees: A Dynamic Index
Structure

Data on External Storage


A database system provides an ultimate view of the stored data. However, data in the form of
bits, bytes get stored in different storage devices.
Types of Data Storage
For storing the data, there are different types of storage options available. These storage types
differ from one another as per the speed and accessibility. There are the following types of
storage devices used for storing the data:
Primary Storage - It is the primary area that offers quick access to the stored data. We also
know the primary storage as volatile storage. It is because this type of memory does not
permanently store the data. As soon as the system leads to a power cut or a crash, the data also
get lost.
Main Memory: It is the one that is responsible for operating the data that is available by
the storage medium. The main memory handles each instruction of a computer machine.
This type of memory can store gigabytes of data on a system but is small enough to carry
the entire database. At last, the main memory loses the whole content if the system shuts
down because of power failure or other reasons.
Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A
cache is a tiny storage media which is maintained by the computer hardware usually.
While designing the algorithms and query processors for the data structures, the designers
keep concern on the cache effects.
Secondary Storage - Secondary storage is also called as Online storage. It is the storage area
that allows the user to save and store data permanently. This type of memory does not lose the
data due to any power failure or system crash. That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in almost
every type of computer system:
Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which
are further plugged into the USB slots of a computer system. These USB keys help
transfer data to a computer system, but it varies in size limits. Unlike the main memory, it
is possible to get back the stored data which may be lost due to a power cut or other
reasons. This type of memory storage is most commonly used in the server systems for
caching the frequently used data. This leads the systems towards high performance and is
capable of storing large amounts of databases than the main memory.
Magnetic Disk Storage: This type of storage media is also known as online storage
media. A magnetic disk is used for storing the data for a long time. It is capable of storing
an entire database. It is the responsibility of the computer system to make availability of
the data from a disk to the main memory for further accessing. Also, if the system
performs any operation over the data, the modified data should be written back to the
disk. The tremendous capability of a magnetic disk is that it does not affect the data due
to a system crash or failure, but a disk failure can easily ruin as well as destroy the stored
data.
Tertiary Storage – It is the storage type that is external from the computer system. It has the
slowest speed. But it is capable of storing a large amount of data. It is also known as Offline
storage. Tertiary storage is generally used for data backup. There are following tertiary storage
devices available:
Optical Storage: An optical storage can store megabytes or gigabytes of data. A
Compact Disk (CD) can store 700 megabytes of data with a playtime of around 80
minutes. On the other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes
of data on each side of the disk.
Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access storage.
Disk storage is known as direct-access storage as we can directly access the data from
any location on disk.

Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These storage
media are organized on the basis of data accessing speed, cost per unit of data to buy the
medium, and by medium's reliability. Thus, we can create a hierarchy of storage media on the
basis of its cost and speed.
Thus, on arranging the above-described storage media in a hierarchy according to its speed and
cost, we conclude the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is
decreasing, and the access time is increasing. Also, the storage media from the main memory to
up represents the volatile nature, and below the main memory, all are non-volatile devices.
https://www.javatpoint.com/storage-system-in-dbms

File Organization and Indexing


File Organization - The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
File organization is a logical relationship among various records. This method defines how file
records are mapped onto disk blocks.
File organization is used to describe the way in which the records are stored in terms of blocks,
and the blocks are placed on the storage medium.
The first approach to map the database to the file is to use the several files and store only one
fixed length record in any given file. An alternative approach is to structure our files so that we
can contain multiple lengths for records.
Files of fixed length records are easier to implement than the files of variable length
records. Objective of file organization
It contains an optimal selection of records, i.e., records can be selected as fast as
possible. To perform insert, delete or update transaction on the records should be quick
and easy. The duplicate records cannot be induced as a result of insert, update or delete.
For the minimal cost of storage, records should be stored efficiently.
Types of file organization:
Sequential file organization - This method is the easiest method for file organization. In
this method, files are stored sequentially.
Pile File Method: It is a quite simple method. In this method, we store the record in a
sequence, i.e., one after another. Here, the record will be inserted in the order in which
they are inserted into tables. In case of updating or deleting of any record, the record will
be searched in the memory blocks. When it is found, then it will be marked for deleting,
and the new record is inserted.

Insertion of the new record:


Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence,
records are nothing but a row in the table. Suppose we want to insert a new record R2 in
the sequence, then it will be placed at the end of the file. Here, records are nothing but a
row in any table.

Sorted File Method: In this method, the new record is always inserted at the file's end,
and then it will sort the sequence in ascending or descending order. Sorting of records is
based on any primary key or any other key. In the case of modification of any record, it
will update the record and then sort the file, and lastly, the updated record is placed in the
right place.

Insertion of the new record:


Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6
and R7. Suppose a new record R2 has to be inserted in the sequence, then it will be
inserted at the end of the file, and then it will sort the sequence.

Pros of sequential file organization


It contains a fast and efficient method for the huge amount of data.
In this method, files can be easily stored in cheaper storage mechanism like
magnetic tapes.
It is simple in design. It requires no much effort to store the data.
This method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
This method is used for report generation or statistical calculations.
Cons of sequential file organization
It will waste time as we cannot jump on a particular record that is required but we
have to move sequentially which takes our time.
Sorted file method takes more time and space for sorting the records.
Heap file organization - It is the simplest and most basic type of organization. It works
with data blocks. In heap file organization, the records are inserted at the file's end. When
the records are inserted, it doesn't require the sorting and ordering of records. When the
data block is full, the new record is stored in some other block. This new data block need
not to be the very next data block, but it can select any data block in the memory to store
new records. The heap file is also known as an unordered file. In the file, every record has
a unique id, and every page in a file is of the same size. It is the DBMS responsibility to
store and manage the new records.

Insertion of a new record


Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to
insert a new record R2 in a heap. If the data block 3 is full then it will be inserted in any of
the database selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to
traverse the data from staring of the file till we get the requested record. If the database is
very large then searching, updating or deleting of record will be time consuming because
there is no sorting or ordering of records. In the heap file organization, we need to check
all the data until we get the requested record. Pros of Heap file organization
It is a very good method of file organization for bulk insertion. If there is a large number
of data which needs to load into the database at a time, then this method is best suited.
In case of a small database, fetching and retrieving of records is faster than the
sequential record.
Cons of Heap file organization
This method is inefficient for the large database because it takes time to search or
modify the record.
This method is inefficient for large databases.
Hash file organization - Hash File Organization uses the computation of hash function
on some fields of the records. The hash function's output determines the location of disk
block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way, when a
new record has to be inserted, then the address is generated using the hash key and record
is directly inserted. The same process is applied in the case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method,
each record will be stored randomly in the memory.

B+ file
organization - B+ tree file organization is the advanced method of an indexed
sequential access method. It uses a tree-like structure to store records in File. It uses the
same concept of key-index where the primary key is used to sort the records. For each
primary key, the value of the index is generated and mapped with the record. The B+
tree is similar to a binary search tree (BST), but it can have more than two children. In
this method, all the records are stored only at the leaf node. Intermediate nodes act as a
pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:


There is one root node of the tree, i.e., 25.
There is an intermediary layer with nodes. They do not store the actual record.
They have only pointers to the leaf node.
The nodes to the left of the root node contain the prior value of the root and nodes
to the right contain next value of the root, i.e., 15 and 30 respectively. There is
only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
Searching for any record is easier as all the leaf nodes are balanced.
In this method, searching any record can be traversed through the single path and
accessed easily.
Pros of B+ tree file organization
In this method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
Traversing through the tree structure is easier and faster.
The size of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
Cons of B+ tree file organization
This method is inefficient for the static method.
Indexed sequential access method (ISAM) - ISAM method is an advanced sequential
file organization. In this method, records are stored in the file using the primary key. An
index value is generated for each primary key and mapped with the record. This index
contains the address of the record in the file.

If any
record has to be retrieved based on its index value, then the address of the data block
is fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block, searching a record in
a huge database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the
index is based on the primary key values, we can retrieve the data for the given
range of value. In the same way, the partial value can also be easily searched, i.e.,
the student name starting with 'JA' can be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value. When the
new records are inserted, then these files have to be reconstructed to maintain
the sequence.
When the record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.
Cluster file organization - When the two or more records are stored in the same file, it is
known as clusters. These files will have two or more tables in the same data block, and
key attributes which are used to map these tables together are stored only once. This
method reduces the cost of searching for various records in different files. The cluster file
organization is used when there is a frequent need for joining the tables with the same
condition. These joins will give only a few records from both tables. In the given
example, we are retrieving the record for only particular departments. This method can't
be used to retrieve the record for the entire department.

In this method, we can


directly insert, update or delete any record. Data is sorted based on the key with which
searching is done. Cluster key is a type of key with which joining of the table is
performed.
Types of Cluster file organization:
Indexed Clusters: - In indexed cluster, records are grouped based on the cluster key and
stored together. The above EMPLOYEE and DEPARTMENT relationship is an example
of an indexed cluster. Here, all the records are grouped based on the cluster key- DEP_ID
and all the records are grouped.
Hash Clusters: It is similar to the indexed cluster. In hash cluster, instead of storing the
records based on the cluster key, we generate the value of the hash key for the cluster key
and store the records with the same hash key value.
Pros of Cluster file organization
The cluster file organization is used when there is a frequent request for joining
the tables with same joining condition.
It provides the efficient result when there is a 1:M mapping between the
tables.
Cons of Cluster file organization
This method has the low performance for the very large database.
If there is any change in joining condition, then this method cannot use. If we
change the condition of joining then traversing the file takes a lot of time.
This method is not suitable for a table with a 1:1 condition.
https://www.javatpoint.com/dbms-file-organization
https://www.javatpoint.com/dbms-sequential-file-organization
https://www.javatpoint.com/dbms-heap-file-organization
https://www.javatpoint.com/dbms-hash-file-organization
https://www.javatpoint.com/dbms-b-plus-file-organization
https://www.javatpoint.com/dbms-indexed-sequential-access-method
https://www.javatpoint.com/dbms-cluster-file-organization

Cluster Indexes
Primary and Secondary Indexes
Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
The index is a type of data structure. It is used to locate and access the data in a database table
quickly.
Index structure: Indexes can be created using some database columns.

The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
The second column of the database is the data reference. It contains a set of pointers holding the
address of the disk block where the value of the particular key can be found. Indexing Methods
Ordered indices - The indices are usually sorted to make searching faster. The indices which are
sorted are known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes. In the case of an
index, we will search using indexes and the DBMS will read the record after reading 542*2=
1084 bytes which are very less compared to the previous case. Primary Index - If the index is
created on the basis of the primary key of the table, then it is known as primary indexing. These
primary keys are unique to each record and contain 1:1 relation between the records.
As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
The primary index can be classified into two types: Dense index and Sparse index.
Dense index - The dense index contains an index record for every search key value in the data
file. It makes searching faster.
In this, the number of records in the index table is same as the number of records in the main
table.
It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.

Sparse index
In the data file, index record appears only for a few items. Each item points to a block. In this,
instead of pointing to each record in the main table, the index points to the records in the main
table in a gap.

Clustering Index
A clustered index can be defined as an ordered data file. Sometimes the index is created on non
primary key columns which may not be unique for each record.
In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index. The records
which have similar characteristics are grouped, and indexes are created for these group.
Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.

Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced. In
secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
If you want to find the record of roll 111 in the diagram, then it will search the highest entry
which is smaller than or equal to 111 in the first level index. It will get 100 at this level. Then
in the second index level, again it does max (111) <= 111 and gets 110. Now using the address
110, it goes to the data block and starts searching each record till it gets 111. This is how a
search is performed in this method. Inserting, updating or deleting is also done in the same
manner.
Primary Index Secondary Index

Definition A primary index is an index A secondary index is an


on a set of fields that index that is not a primary
includes the unique primary index and may have
key and is guaranteed not to duplicates.
contain duplicates

Order The primary index requires The primary index requires


the rows in data blocks to be the rows in data blocks to be
ordered on the index key ordered on the index key
Number of indexes There is only one There can be multiple
primary index se3condary indexes.

Duplicates There are no duplicates in There can be duplicates in


the primary index the secondary indexes.
https://www.javatpoint.com/indexing-in-dbms
https://pediaa.com/what-is-the-difference-between-primary-and-secondary-index/

Index data Structures


In a database management system (DBMS), an index is a data structure that improves the speed
of data retrieval operations on a database table. Indexes provide a way to quickly locate rows
based on the values of one or more columns. Without indexes, the DBMS would need to scan the
entire table to find the desired data, which can be very inefficient for large tables. Indexes work
similar to the index in a book – they provide a way to look up information more quickly. When
you create an index on a column or a set of columns, the DBMS creates a separate data structure
that stores the indexed column's values along with a reference to the corresponding rows in the
table. This allows the DBMS to perform lookups and queries more efficiently. There are different
types of index data structures commonly used in DBMS: B-Tree Index: This is the most
common type of index used in most relational database systems. B-Trees are balanced trees that
store data in a sorted order, allowing for efficient range queries, point queries, and insertions.
They are well-suited for disk-based storage.
Hash Index: Hash indexes use a hash function to map index keys to locations in the index. They
are effective for point queries but less efficient for range queries or ordered retrieval. Bitmap
Index: Bitmap indexes use a bitmap for each unique value in the indexed column, representing
whether a row contains that value or not. They are useful for low cardinality columns (columns
with few distinct values) and are efficient for certain types of queries like boolean operations.
Sparse Index: Sparse indexes are used in databases that have a lot of null values. Instead of
indexing every row, they only index non-null values, which reduces the size of the index and
speeds up lookups.
Clustered Index: In databases like SQL Server and InnoDB in MySQL, the clustered index
determines the physical order of data rows in the table. A table can have only one clustered
index, and it greatly affects the way data is stored on disk.
Non-Clustered Index: Non-clustered indexes are separate structures from the actual table data.
They include the indexed column's values and a reference to the corresponding rows in the table.
A table can have multiple non-clustered indexes.
Covering Index: A covering index includes all the columns needed for a query, so the DBMS
doesn't need to access the actual table to retrieve the required data. This can significantly
improve query performance.
Indexing involves a trade-off between query performance and the overhead of maintaining the
index during data modifications (inserts, updates, and deletes). While indexes can greatly speed
up read operations, they can slightly slow down write operations due to the additional index
maintenance overhead.
Hash Based Indexing
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on the
disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data bucket
or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of the
time, the hash function uses the primary key to generate the address of the data block. A hash
function is a simple mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means each row whose
address will be the same as a primary key stored in the data block.

The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we have
mod (5) hash function to determine the address of the data block. In this case, it applies mod (5)
hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and records are
stored in those data block addresses.
Types of Hashing:
Static Hashing - In static hashing, the resultant data bucket address will always be the
same. That means if we generate an address for EMP_ID =103 using the hash function mod (5)
then it will always result in same bucket address 3. Here, there will be no change in the bucket
address.
Hence in this static hashing, the number of data buckets in memory remains constant throughout.
In this example, we will have five data buckets in the memory used to store the data.

Operations of Static Hashing


Searching a record - When a record needs to be searched, then the same hash function
retrieves the address of the bucket where the data is stored.
Insert a Record - When a new record is inserted into the table, then we will generate an
address for a new record based on the hash key and record is stored in that location.
Delete a Record - To delete a record, we will first fetch the record which is supposed to
be deleted. Then we will delete the records for that address in memory.
Update a Record - To update a record, we will first search it using a hash function, and
then the data record is updated.
If we want to insert some new record into the file but the address of a data bucket generated by
the hash function is not empty, or data already exists in that address. This situation in the static
hashing is known as bucket overflow. This is a critical situation in this method. To overcome
this situation, there are various methods. Some commonly used methods are as follows:
Open Hashing - When a hash function generates an address at which data is already
stored, then the next bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system searches
next available data bucket, 113 and assigns R3 to it.

Close Hashing - When buckets are full, then a new data bucket is allocated for the same hash
result and is linked after the previous one. This mechanism is known as Overflow chaining. For
example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this case,
a new bucket is inserted at the end of 110 buckets and is linked to it.

Dynamic Hashing - The dynamic hashing method is used to overcome the problems of static
hashing like bucket overflow.
In this method, data buckets grow or shrink as the records increases or decreases. This method is
also known as Extendable hashing method.
This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor
performance.
How to search a key
First, calculate the hash address of the key.
Check how many bits are used in the directory, and these bits are called as i.
Take the least significant i bits of the hash address. This gives an index of the directory.
Now using the index, go to the directory and find bucket address where the record might
be.
How to insert a new record
Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket. If there is still space in that bucket, then place the record in it.
If the bucket is full, then we will split the bucket and redistribute the
records. For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.

Insert key 9 with hash address 10001 into the above structure:
Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5. Keys 2
and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because last
two bits of both the entry are 00.
Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.

Advantages of dynamic hashing


In this method, the performance does not decrease as the data grows in the system. It
simply increases the size of memory to accommodate the data.
In this method, memory is well utilized as it grows and shrinks with the data. There will
not be any unused memory lying.
This method is good for the dynamic database where data grows and shrinks
frequently.
Disadvantages of dynamic hashing
In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the data
address will keep changing as buckets grow and shrink. If there is a huge increase in data,
maintaining the bucket address table becomes tedious.
In this case, the bucket overflow situation will also occur. But it might take little time to
reach this situation than static hashing.
https://www.javatpoint.com/dbms-hashing
https://www.javatpoint.com/dbms-static-hashing
https://www.javatpoint.com/dbms-dynamic-hashing

Tree base Indexing


Tree-based indexing is a common approach used in database management systems to efficiently
organize and retrieve data from a database table. It involves creating a hierarchical data structure
(usually a tree) that allows for quick access to rows based on the values of indexed columns. The
two most prevalent types of tree-based indexing are B-Tree (Balanced Tree) and B+Tree
(Balanced Plus Tree).
B-Tree (Balanced Tree):B-Trees are self-balancing tree structures that maintain sorted
data. They are commonly used in file systems and databases to index data.
Each node in a B-Tree can have multiple keys and pointers to child nodes. The tree is kept
balanced by redistributing keys when nodes become too full or too empty. B-Trees are
designed for disk-based storage systems and are suitable for both point queries and range
queries.
They are used for both clustered and non-clustered indexes in various database systems.
B+Tree (Balanced Plus Tree):B+Trees are an extension of B-Trees and are widely used
for indexing in most modern database systems.
Like B-Trees, B+Trees are self-balancing and maintain sorted data.
In a B+Tree, keys are stored in the leaf nodes, and leaf nodes are linked together in a linked list.
Non-leaf nodes in a B+Tree only contain pointers to child nodes, making the tree more
compact. B+Trees are optimized for disk-based storage systems and work well with sequential
I/O operations.
They are particularly suitable for range queries due to the linked list structure of leaf nodes,
which enables efficient range scans.
Both B-Trees and B+Trees have logarithmic height, meaning the number of levels in the tree
grows slowly as the number of elements (rows) increases. This property ensures efficient lookup
times, as the number of nodes to traverse remains manageable even for large datasets. Tree-
based indexes improve query performance by reducing the number of disk accesses required to
locate specific rows. When a query involves filtering or searching based on indexed columns,
the DBMS can use the index to navigate the tree structure and quickly locate the desired rows.
This significantly speeds up data retrieval compared to a full table scan.

Comparison of File Organizations


File Organization Description Advantages Disadvantages

Heap File Random placement Simple insertion, Slow for retrieval


of records within no need for and range queries.
the file. No specific sorting.
order.

Sequential File Records are stored Good for range Slow for
in order based on queries, insertion and
a sequential updating.
designated search access.
key.

Indexed Similar to Improved Slower insertion


Sequential File sequential file, but retrieval with due to index
includes an index index. maintenance.
for quicker
access.

B-Tree File Records are Efficient for Overhead of


organized using a B point and range index
Tree data structure. queries. maintenance.

B+Tree File Extension of B-Tree, Efficient for Overhead of


optimized for disk range queries. index
based systems. maintenance.

Hash File Records are Very fast for Not suitable for
distributed among point queries. range queries.
a fixed number
of
buckets using a hash
function.

Clustered File Records are Efficient for May require


physically stored in specific reorganization
the same order as a queries. for new queries.
specified clustering
key.

Partitioned File Data is divided into Parallel Uneven


partitions based on a processing for distribution may
range of values in a queries. lead to
partitioning key. imbalance.

Indexes and Performance Tuning


Indexes and performance tuning are crucial aspects of database management systems (DBMS) to
optimize query execution and overall system performance.
Indexes - Indexes are data structures that accelerate data retrieval by providing a quick way to
locate rows in a database table. They allow the DBMS to locate rows based on the values in one
or more indexed columns.
Here's how indexes impact performance:
Faster Data Retrieval
Reduced I/O
Query Optimization
Trade-offs
Performance Tuning - Performance tuning involves optimizing the database and queries to
achieve better overall system performance.
Key strategies:
Indexing Strategy: Choose the right columns to index based on the types of queries your
application performs most frequently. Over-indexing can lead to increased overhead, while
under-indexing can result in slow query performance.
Query Optimization: Craft efficient SQL queries by using appropriate joins,
aggregations, and filtering conditions. Analyze query execution plans to identify performance
bottlenecks.
Normalization and Denormalization: Properly normalize your database to minimize
redundancy and data anomalies. However, consider denormalization for frequently queried tables
to reduce join operations.
Caching: Implement caching mechanisms to store frequently accessed data in memory,
reducing the need for repeated disk reads.
Partitioning: For large tables, consider partitioning the data into smaller, manageable
chunks. This improves both query and maintenance performance.
Hardware Optimization: Configure the database server's hardware parameters, such as
memory allocation and CPU usage, to match the workload and maximize performance. Query
and Index Statistics: Regularly update statistics on tables and indexes. The DBMS uses these
statistics to make informed decisions about query execution plans. Monitoring and Profiling:
Continuously monitor the database's performance using tools and collect metrics. Identify and
address performance issues as they arise. Database Maintenance: Perform routine maintenance
tasks like index rebuilding, data purging, and database backups to ensure optimal performance.
Connection Pooling: Use connection pooling to efficiently manage database connections
and reduce the overhead of opening and closing connections.
Parallelism and Concurrency: Utilize parallel processing and proper concurrency
control mechanisms to make efficient use of multi-core processors and allow multiple users to
access the database concurrently.
Intuitions for tree Indexes
Hierarchy and Organization
Balancing and Logarithmic Height
Efficient Data Retrieval
Application and Performance

Indexed Sequential Access Methods (ISAM)


Indexed Sequential Access Method (ISAM) is a data access method used in computer systems
for managing and accessing data stored on disk or other storage devices. It combines the features
of sequential access and random access methods to provide efficient retrieval and update
operations.
ISAM organizes data into fixed-length blocks or records and stores them in sequential order on
the storage device. Each block is assigned a unique identifier called a record number or block
number. Additionally, ISAM maintains an index structure that allows for direct access to specific
records based on key values.
The index structure typically consists of an index file or index table, which contains key values
and corresponding pointers to the physical location of the records. The index can be organized in
various ways, such as a B-tree or a hash table, depending on the specific implementation. To
access a record using ISAM, the system performs a search operation on the index to locate the
appropriate block or blocks containing the desired record. Once the block is located, sequential
access is used within the block to retrieve or update the desired record.
Advantages:
Efficient retrieval
Sequential processing
Data integrity
Support for concurrent access
Limitations:
Fixed record length
Index maintenance
Limited flexibility
Lack of data independence
https://www.javatpoint.com/dbms-indexed-sequential-access-method
https://www.w3spoint.com/indexed-sequential-access-method
isam#:~:text=Indexed%20Sequential%20Access%20Method%20%28ISAM%29%20is%20a%
2 0data,methods%20to%20provide%20efficient%20retrieval%20and%20update%20operations.

B+ Trees: A Dynamic Index Structure


B + Tree is a variation of the B-tree data structure. In a B + tree, data pointers are stored only at
the leaf nodes of the tree. In a B+ tree structure of a leaf node differs from the structure of
internal nodes. The leaf nodes have an entry for every value of the search field, along with a
data pointer to the record (or to the block that contains this record). The leaf nodes of the B+
tree are linked together to provide ordered access to the search field to the records. Internal
nodes of a B+ tree are used to guide the search. Some search field values from the leaf nodes are
repeated in the internal nodes of the B+ tree.
Features of B+ Trees
• Balanced: B+ Trees are self-balancing, which means that as data is added or removed from the
tree, it automatically adjusts itself to maintain a balanced structure. This ensures that the search
time remains relatively constant, regardless of the size of the tree.
• Multi-level: B+ Trees are multi-level data structures, with a root node at the top and one or more
levels of internal nodes below it. The leaf nodes at the bottom level contain the actual data.
• Ordered: B+ Trees maintain the order of the keys in the tree, which makes it easy to
perform range queries and other operations that require sorted data.
• Fan-out: B+ Trees have a high fan-out, which means that each node can have many child
nodes. This reduces the height of the tree and increases the efficiency of searching and
indexing operations.
• Cache-friendly: B+ Trees are designed to be cache-friendly, which means that they can take
advantage of the caching mechanisms in modern computer architectures to improve
performance.
• Disk-oriented: B+ Trees are often used for disk-based storage systems because they are
efficient at storing and retrieving data from disk.
Why Use B+ Tree?
1. B+ Trees are the best choice for storage systems with sluggish data access because they
minimize I/O operations while facilitating efficient disc access.
2. B+ Trees are a good choice for database systems and applications needing quick data
retrieval because of their balanced structure, which guarantees predictable performance for a
variety of activities and facilitates effective range-based queries.
Difference between B+ Tree and B Tree
Parameters B+ Tree B Tree

Structure Separate leaf nodes for data Nodes store both keys and
storage and internal nodes for data values
indexing

Leaf Nodes Leaf nodes form a linked list Leaf nodes do not form a linked list
for efficient range-based
queries

Order Higher order (more keys) Lower order (fewer keys)

Key Typically allows key duplication Usually does not allow key
Duplication in leaf nodes duplication

Disk Access Better disk access due to More disk I/O due to non-
sequential reads in a linked list sequential reads in internal nodes
structure

Applications Database systems, file systems, In-memory data structures,


where range queries are databases, general-purpose use
common

are Better performance for range Balanced performance for


queries and bulk data retrieval search, insert, and delete
operations

Memory Requires more memory for Requires less memory as keys


Usage internal nodes and values are stored in the
same node
Implementation of B+ Tree
In order, to implement dynamic multilevel indexing, B-tree and B+ tree are generally employed.
The drawback of the B-tree used for indexing, however, is that it stores the data pointer (a pointer
to the disk file block containing the key value), corresponding to a particular key value, along with
that key value in the node of a B-tree. This technique greatly reduces the number of entries that
can be packed into a node of a B-tree, thereby contributing to the increase in the number of levels
in the B-tree, hence increasing the search time of a record. B+ tree eliminates the above drawback
by storing data pointers only at the leaf nodes of the tree. Thus, the structure of the leaf nodes of a
B+ tree is quite different from the structure of the internal nodes of the B tree. It may be noted here
that, since data pointers are present only at the leaf nodes, the leaf nodes must necessarily store all
the key values along with their corresponding data pointers to the disk file block, in order to access
them.
Moreover, the leaf nodes are linked to providing ordered access to the records. The leaf nodes,
therefore form the first level of the index, with the internal nodes forming the other levels of a
multilevel index. Some of the key values of the leaf nodes also appear in the internal nodes, to
simply act as a medium to control the searching of a record. From the above discussion, it is
apparent that a B+ tree, unlike a B-tree, has two orders, ‘a’ and ‘b’, one for the internal nodes and
the other for the external (or leaf) nodes.
Structure of B+ Trees

B+ Trees contain two types of nodes:


• Internal Nodes: Internal Nodes are the nodes that are present in at least n/2 record pointers, but
not in the root node,
• Leaf Nodes: Leaf Nodes are the nodes that have n pointers.
The Structure of the Internal Nodes of a B+ Tree of Order ‘a’ is as Follows 1. Each internal
node is of the form: <P1, K1, P2, K2, ….., Pc-1, Kc-1, Pc> where c <= a and each Pi is a tree
pointer (i.e points to another node of the tree) and, each Ki is a key value (see diagram-I for
reference).
2. Every internal node has : K1 < K2 < …. < Kc-1
3. For each search field value ‘X’ in the sub-tree pointed at by Pi, the following condition
holds: Ki-1 < X <= Ki, for 1 < I < c and, Ki-1 < X, for i = c (See diagram I for reference) 4.
Each internal node has at most ‘aa tree pointers.
5. The root node has, at least two tree pointers, while the other internal nodes have at least
\ceil(a/2) tree pointers each.
6. If an internal node has ‘c’ pointers, c <= a, then it has ‘c – 1’ key values.
The Structure of the Leaf Nodes of a B+ Tree of Order ‘b’ is as Follows
Each leaf node is of the form: <<K1, D1>, <K2, D2>, ….., <Kc-1, Dc-1>, Pnext> where c <= b
and each Di is a data pointer (i.e points to actual record in the disk whose key value is Ki or to a
disk file block containing that record) and, each Ki is a key value and, Pnext points to next leaf
node in the B+ tree (see diagram II for reference).
Every leaf node has : K1 < K2 < …. < Kc-1, c <= b

Each leaf node has at least \ceil(b/2) values.


All leaf nodes are at the same level.

Diagram-II Using the Pnext pointer it is viable to traverse all the leaf nodes, just like a linked list,
thereby achieving ordered access to the records stored in the disk.
Searching a Record in B+ Trees

Let us suppose we have to find 58 in the B+ Tree. We will start by fetching from the root node
then we will move to the leaf node, which might contain a record of 58. In the image given
above, we will get 58 between 50 and 70. Therefore, we will we are getting a leaf node in the
third leaf node and get 58 there. If we are unable to find that node, we will return that ‘record not
founded’ message.
Insertion in B+ Trees
Every element in the tree has to be inserted into a leaf node. Therefore, it is necessary to
go to a proper leaf node.
Insert the key into the leaf node in increasing order if there is no
overflow. Deletion in B+Trees
Deletion in B+ Trees is just not deletion but it is a combined process of Searching,
Deletion, and Balancing. In the last step of the Deletion Process, it is mandatory to
balance the B+ Trees, otherwise, it fails in the property of B+ Trees.
Advantages of B+Trees
A B+ tree with ‘l’ levels can store more entries in its internal nodes compared to a B-tree
having the same ‘l’ levels. This accentuates the significant improvement made to the
search time for any given key. Having lesser levels and the presence of Pnext pointers
imply that the B+ trees is very quick and efficient in accessing records from disks. Data
stored in a B+ tree can be accessed both sequentially and directly.
It takes an equal number of disk accesses to fetch records.
B+trees have redundant search keys, and storing search keys repeatedly is not possible.
Disadvantages of B+Trees
The major drawback of B-tree is the difficulty of traversing the keys sequentially. The B+
tree retains the rapid random access property of the B-tree while also allowing rapid
sequential access.
Application of B+ Trees
Multilevel Indexing
Faster operations on the tree (insertion, deletion, search)
Database indexing
https://www.geeksforgeeks.org/introduction-of-b-tree/

You might also like