RDBMS Material
RDBMS Material
In simple words, Data can be facts related to any object in consideration. For example, your
name, age, height, weight, etc. are some data related to you. A picture, image, file, pdf, etc.
can also be considered data.
Database :
The database is a collection of inter-related data which is used to retrieve, insert and
delete the data efficiently.
It is also used to organize the data in the form of a table, schema, views, and reports,
etc.
Database management system is a software which is used to manage the database. For
example: MySQL, Oracle.
DBMS provides an interface to perform various operations like database creation,
storing data in it, updating data, creating a table in the database and a lot more.
It provides protection and security to the database. In the case of multiple users, it also
maintains data consistency.
Data Definition: It is used for creation, modification, and removal of definition that
defines the organization of data in the database.
Data Updation: It is used for the insertion, modification, and deletion of the actual
data in the database.
Data Retrieval: It is used to retrieve the data from the database which can be used by
applications for various purposes.
User Administration: It is used for registering and monitoring users, maintain data
integrity, enforcing data security, dealing with concurrency control, monitoring
performance and recovering information corrupted by unexpected failure.
Types of DBMS
Hierarchical database
Network database
Relational database
Object-Oriented database
Hierarchical DBMS
Network Model
The network database model allows each child to have multiple parents. It helps you to
address the need to model more complex relationships like as the orders/parts many-to-many
relationship. In this model, entities are organized in a graph which can be accessed through
several paths.
Relational Model
Relational DBMS is the most widely used DBMS model because it is one of the easiest. This
model is based on normalizing data in the rows and columns of the tables. Relational model
stored in fixed structures and manipulated using SQL.
Object-Oriented Model
In Object-oriented Model data stored in the form of objects. The structure which is called
classes which display data within it. It is one of the components of DBMS that defines a
database as a collection of objects which stores both data members values and operations.
Data Abstraction is a process of hiding unwanted or irrelevant details from the end user. It
provides a different view and helps in achieving data independence which is used to enhance
the security of data.
Levels of abstraction for DBMS
Database systems include complex data-structures. In terms of retrieval of data, reduce complexity in terms
of usability of users and in order to make the system efficient, developers use levels of abstraction that hide
irrelevant details from the users. Levels of abstraction simplify database design.
Mainly there are three levels of abstraction for DBMS, which are as follows −
It is the lowest level of abstraction for DBMS which defines how the data is actually stored, it defines data-
structures to store data and access methods used by the database. Actually, it is decided by developers or
database application programmers how to store the data in the database.
Logical level is the intermediate level or next higher level. It describes what data is stored in the database
and what relationship exists among those data. It tries to describe the entire or whole data because it
describes what tables to be created and what are the links among those tables that are created.
It is the highest level. In view level, there are different levels of views and every view only defines a part of
the entire data. It also simplifies interaction with the user and it provides many views or multiple views of
the same database.
Application of DBMS
Manufacturing It is used for the management of supply chain and for tracking
production of items. Inventories status in warehouses.
HR Management For information about employees, salaries, payroll, deduction,
generation of paychecks, etc.
Purpose of DBMS :
The diagram given below explains the process as to how the transformation of data to
information to knowledge to action happens respectively in the DBMS −
Previously, the database applications were built directly on top of the file system.
Database Language
A DBMS has appropriate languages and interfaces to express database queries and
updates.
Database languages can be used to read, store and update the data in the database.
DDL stands for Data Definition Language. It is used to define database structure or
pattern.
It is used to create schema, tables, indexes, constraints, etc. in the database.
Using the DDL statements, you can create the skeleton of the database.
Data definition language is used to store the information of metadata like the number
of tables and schemas, their names, indexes, columns in each table, constraints, etc.
These commands are used to update the database schema that's why they come under Data
definition language.
DML stands for Data Manipulation Language. It is used for accessing and manipulating data
in a database. It handles user requests.
DCL stands for Data Control Language. It is used to retrieve the stored or saved data.
The DCL execution is transactional. It also has rollback parameters.
There are the following operations which have the authorization of Revoke:
TCL is used to run the changes made by the DML statement. TCL can be grouped into a
logical transaction.
The DBMS design depends upon its architecture. The basic client/server architecture is
used to deal with a large number of PCs, web servers, database servers and other
components that are connected with networks.
The client/server architecture consists of many PCs and a workstation which are
connected via the network.
DBMS architecture depends upon how users are connected to the database to get their
request done.
Database architecture can be seen as a single tier or multi-tier. But logically, database
architecture is of two types like: 2-tier architecture and 3-tier architecture.
2-Tier Architecture
o The 2-Tier architecture is same as basic client-server. In the two-tier architecture, applications on the client
end can directly communicate with the database at the server side. For this interaction, API's
like: ODBC, JDBC are used.
o The user interfaces and application programs are run on the client-side.
o The server side is responsible to provide the functionalities like: query processing and transaction
management.
o To communicate with the DBMS, client-side application establishes a connection with the server side.
3-Tier Architecture
o The 3-Tier architecture contains another layer between the client and server. In this architecture, client
can't directly communicate with the server.
o The application on the client-end interacts with an application server which further communicates with the
database system.
o End user has no idea about the existence of the database beyond the application server. The database also
has no idea about any other user beyond the application.
o The 3-Tier architecture is used in case of large web application.
Database users are the persons who interact with the database and take the benefits of
database. They are differentiated into different types as follows :
1. Naive users: They are the unsophisticated users who interact with the system by using
permanent applications that already exist. Example: Online Library Management System,
ATMs (Automated Teller Machine), etc.
2. Application programmers: They are the computer professionals who interact with
system through DML. They write application programs.
3. Sophisticated users: They interact with the system by writing SQL queries directly
through the query processor without writing application programs.
4. Specialized users: They are also sophisticated users who write specialized database
applications that do not fit into the traditional data processing framework. Example:
Expert System, Knowledge Based System, etc.
DATABASE ADMINISTRATOR
A person who has such central control over the system is called a database administrator
(DBA). The functions of a DBA include :
Schema definition. The DBA creates the original database schema by execut-
ing a set of data definition statements in the DDL.
ER Diagram / ER model
For example, Suppose we design a School Database. In this database, the student will be an
entity with attributes like address, name, id, age, etc. The address can be another entity with
attributes like city, street name, pin code, etc and there will be a relationship between them.
Component of ER Diagram
1. Entity:
An entity may be any object, class, person or place. In the ER diagram, an entity can be
represented as rectangles.
a. Weak Entity
An entity that depends on another entity called a weak entity. The weak entity doesn't
contain any key attribute of its own. The weak entity is represented by a double rectangle.
2. Attribute
The attribute is used to describe the property of an entity. Eclipse is used to represent an
attribute.
For example, id, age, contact number, name, etc. can be attributes of a student.
a. Key Attribute
The key attribute is used to represent the main characteristics of an entity. It represents a
primary key. The key attribute is represented by an ellipse with the text underlined.
b. Composite Attribute
An attribute that composed of many other attributes is known as a composite attribute. The
composite attribute is represented by an ellipse, and those ellipses are connected with an
ellipse.
c. Multivalued Attribute
An attribute can have more than one value. These attributes are known as a multivalued
attribute. The double oval is used to represent multivalued attribute.
For example, a student can have more than one phone number.
d. Derived Attribute
An attribute that can be derived from other attribute is known as a derived attribute. It can
be represented by a dashed ellipse.
For example, A person's age changes over time and can be derived from another attribute
like Date of birth.
3. Relationship
Types of relationship
a. One-to-One Relationship
When only one instance of an entity is associated with the relationship, then it is known as
One to One relationship.
For example, A female can marry to one male, and a male can marry to one female.
b. One-to-many relationship
When only one instance of the entity on the left, and more than one instance of an entity on
the right associates with the relationship then this is known as a One-to-Many relationship.
For example, Scientist can invent many inventions, but the invention is done by the only
specific scientist.
c. Many-to-one relationship
When more than one instance of the entity on the left, and only one instance of an entity on
the right associates with the relationship then it is known as a Many-to-One relationship.
For example, Student enrolls for only one course, but a course can have many students.
d. Many-to-many relationship
When more than one instance of the entity on the left, and more than one instance of an
entity on the right associates with the relationship then it is known as a Many-to-Many
relationship.
For example, Employee can assign by many projects and project can have many employees.
Structural Constraints :
Structural Constraints are also called Structural properties of a database management
system (DBMS). Cardinality Ratios and Participation Constraints taken together are called
Structural Constraints.
The Structural constraints are represented by Min-Max notation. This is a pair of
numbers(m, n) that appear on the connecting line between the entities and their
relationships.
Weak Entity
A weak entity cannot be used independently as it is dependent on a strong entity type known
as its owner entity. Also, the relationship that connects the weak entity to its owner identity
is called the identifying relationship.
In the ER diagram, both the weak entity and its corresponding relationship are represented
using a double line and the partial key is underlined with a dotted line.
In the given ER diagram, Dependent is the weak entity and it depends on the strong entity
Employee via the relationship Depends on. There can be an employee without a dependent in
the Company but there will be no record of the Dependent in the company systems unless the
dependent is associated with an Employee.
EER is a high-level data model that incorporates the extensions to the original ER model.
Enhanced ERD are high level models that represent the requirements and complexities of
complex database.
Inheritance
We use all the above features of ER-Model in order to create classes of objects in object-
oriented programming. The details of entities are generally hidden from the user; this process
known as abstraction.
For example, the attributes of a Person class such as name, age, and gender can be inherited
by lower-level entities such as Student or Teacher.
Specialization is used to identify the subset of an entity set that shares some
distinguishing characteristics. In above example Vehicle entity can be a Car, Truck or
Motorcycle.
For example: In an Employee management system, EMPLOYEE entity can be
specialized as TESTER or DEVELOPER based on what role they play in the company.
Generalization is a process of generalizing an entity which contains generalized attributes
or properties of generalized entities.
It is a Bottom up process i.e. consider we have 3 sub entities Car, Truck and Motorcycle. Now
these three entities can be generalized into one super class named as Vehicle.
Relational Model was proposed by E.F. Codd to model data in the form of relations
or tables.
After designing the conceptual model of Database using ER diagram, we need to
convert the conceptual model in the relational model which can be implemented using
any RDBMS languages like Oracle SQL, MySQL etc.
Relational data model is the primary data model, which is used widely around the world for
data storage and processing. This model is simple and it has all the properties and
capabilities required to process data with storage efficiency.
Relational model can represent as a table with columns and rows. Each row is known as a
tuple. Each table of the column has a name or attribute.
Tables − In relational data model, relations are saved in the format of Tables. This format stores the
relation among entities. A table has rows and columns, where rows represents records and columns
represent the attributes.
Tuple − A single row of a table, which contains a single record for that relation is called a tuple.
Relation instance − A finite set of tuples in the relational database system represents relation
instance. Relation instances do not have duplicate tuples.
Relation schema − A relation schema describes the relation name (table name), attributes, and their
names.
Relation key − Each row has one or more attributes, known as relation key, which can identify the row
in the relation (table) uniquely.
Attribute domain − Every attribute has some pre-defined value scope, known as attribute domain.
In general, a relation schema consists of a directory of attributes and their corresponding domain.
Some Common Relational Model Terms
1. Constraints that are applied in the data model is called Implicit constraints.
2. Constraints that are directly applied in the schemas of the data model, by specifying
them in the DDL(Data Definition Language). These are called as schema-based
constraints or Explicit constraints.
3. Constraints that cannot be directly applied in the schemas of the data model. We call
these Application based or semantic constraints.
1. Domain constraints
2. Key constraints
3. Referential integrity constraints
Relational Integrity constraints in DBMS are referred to conditions which must be present for
a valid relation. These Relational constraints in DBMS are derived from the rules in the mini-
world that the database represents.
Domain Constraints
Domain constraints can be violated if an attribute value is not appearing in the corresponding
domain or it is not of the appropriate data type.
Domain constraints specify that within each tuple, and the value of each attribute must be
unique. This is specified as data types which include standard data types integers, real
numbers, characters, Booleans, variable length strings, etc.
Example:
The example shown demonstrates creating a domain constraint such that CustomerName is
not NULL
Key Constraints
An attribute that can uniquely identify a tuple in a relation is called the key of the table. The
value of the attribute for different tuples in the relation has to be unique.
Example:
In the given table, CustomerID is a key attribute of Customer Table. It is most likely to have
a single key for one customer, CustomerID =1 is only for the CustomerName =” Google”.
Referential Integrity constraints in DBMS are based on the concept of Foreign Keys. A foreign
key is an important attribute of a relation which should be referred to in other relationships.
Referential integrity constraint state happens where relation refers to a key attribute of a
different or same relation. However, that key element must exist in the table.
Example:
Relational Langauage
Relational database systems are expected to be equipped with a query language that
can assist its users to query the database instances.
There are two kinds of Query languages − Relational Algebra and Relational
Calculus.
Relational Algebra
Relational algebra is a procedural query language. It gives a step by step process to obtain the
result of the query. It uses operators to perform queries.
Notation : σ p(r)
Where:
Input:
σ BRANCH_NAME="perryride" (LOAN)
Output:
2. Project Operation ( ∏ ):
This operation shows the list of those attributes that we wish to appear in the result.
Rest of the attributes are eliminated from the table.
It is denoted by ∏.
Where
Input :
Output:
NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation ∪ :
Suppose there are two tuples R and S. The union operation contains all the tuples that
are either in R or S or both in R & S.
It eliminates the duplicate tuples. It is denoted by ∪.
Notation : R ∪ S
Example:
DEPOSITOR RELATION
CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
Lindsay A-284
BORROW RELATION
CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17
Input:
Output:
CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
4. Set Intersection ∩ :
Suppose there are two tuples R and S. The set intersection operation contains all tuples
that are in both R & S.
It is denoted by intersection ∩.
Notation : R ∩ S
Input:
Output:
CUSTOMER_NAME
Smith
Jones
5. Set Difference - :
Suppose there are two tuples R and S. The set intersection operation contains all tuples
that are in R but not in S.
It is denoted by intersection minus (-).
Notation : R - S
Input:
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product ( X ) :
The Cartesian product is used to combine each row in one table with each row in the
other table. It is also known as a cross product.
It is denoted by X.
Notation: E X D
Example:
EMPLOYEE
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input:
EMPLOYEE X DEPARTMENT
Output:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
ρ(STUDENT1, STUDENT)
Note : Apart from these common operations Relational algebra can be used in Join operations
Relational Calculus
The tuple relational calculus is specified to select the tuples in a relation. In TRC,
filtering variable uses the tuples of a relation.
The result of the relation can have one or more tuples.
Where ,
For example ,
OUTPUT: This query selects the tuples from the AUTHOR relation. It returns a tuple with
'name' from Author who has written an article on 'database'.
TRC (Tuple Relation Calculus) can be quantified. In TRC, we can use Existential (∃) and
Universal Quantifiers (∀).
For example:
Output: This query will yield the same result as the previous one.
Notation : { a1, a2, a3, ..., an | P (a1, a2, a3, ... ,an)}
Where ,
For example : {< article, page, subject > | ∈ javatpoint ∧ subject = 'database'}
Output: This query will yield the article, page, and subject from the relational javatpoint,
where the subject is a database.
Structured Query Language ( SQL )
Basics :
SQL stands for Structured Query Language. It is used for storing and managing data
in relational database management system (RDMS).
It is a standard language for Relational Database System. It enables a user to create,
read, update and delete relational databases and tables.
All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server use SQL as
their standard database language.
SQL allows users to query the database in a number of ways, using English-like
statements.
SQL process:
When an SQL command is executing for any RDBMS, then the system figure out the
best way to carry out the request and the SQL engine determines that how to interpret
the task.
In the process, various components are included. These components can be optimization
Engine, Query engine, Query dispatcher, classic, etc.
All the non-SQL queries are handled by the classic query engine, but SQL query engine
won't handle logical files.
SQL Datatype
SQL Datatype is used to define the values that a column can contain.
Every column is required to have a name and data type in the database table.
Datatype of SQL
1. Binary Datatypes
There are Three types of binary Datatypes which are given below:
Data Description
type
Char It has a maximum length of 8000 characters. It contains Fixed-length non-
unicode characters.
varchar It has a maximum length of 8000 characters. It contains variable-length non-
unicode characters.
Text It has a maximum length of 2,147,483,647 characters. It contains variable-
length non-unicode characters.
Datatype Description
Date It is used to store the year, month, and days value.
Time It is used to store the hour, minute, and second values.
timestamp It stores the year, month, day, hour, minute, and the second value.
The SQL Set operation is used to combine the two or more SQL SELECT statements.
1. Union
2. UnionAll
3. Intersect
4. Minus
1. Union
The SQL Union operation is used to combine the result of two or more SQL SELECT
queries.
In the union operation, all the number of datatype and columns must be same in both
the tables on which UNION operation is being applied.
The union operation eliminates the duplicate rows from its resultset.
Syntax
Example:
ID NAME
1 Jack
2 Harry
3 Jackson
ID NAME
3 Jackson
4 Stephan
5 David
ID NAME
1 Jack
2 Harry
3 Jackson
4 Stephan
5 David
2. Union All
Union All operation is equal to the Union operation. It returns the set without removing
duplication and sorting the data.
Example: Using the above First and Second table.Union All query will be like:
1. SELECT * FROM First
2. UNION ALL
3. SELECT * FROM Second;
ID NAME
1 Jack
2 Harry
3 Jackson
3 Jackson
4 Stephan
5 David
3. Intersect
It is used to combine two SELECT statements. The Intersect operation returns the
common rows from both the SELECT statements.
In the Intersect operation, the number of datatype and columns must be the same.
It has no duplicates and it arranges the data in ascending order by default.
Syntax SELECT *FROM t_employees INTERSECT SELECT *FROM t2_employees;
Example:
ID NAME
3 Jackson
4. Minus
It combines the result of two SELECT statements. Minus operator is used to display
the rows which are present in the first query but absent in the second query.
It has no duplicates and data arranged in ascending order by default.
Syntax:
Example
ID NAME
1 Jack
2 Harry
SQL Aggregate Functions
1. COUNT FUNCTION
COUNT function is used to Count the number of rows in a database table. It can work
on both numeric and non-numeric data types.
COUNT function uses the COUNT(*) that returns the count of all the rows in a
specified table. COUNT(*) considers duplicate and Null.
Syntax
1. COUNT(*)
or
Sample table:
PRODUCT_MAST
1. SELECT COUNT(*)
2. FROM PRODUCT_MAST;
Output:
10
1. SELECT COUNT(*)
2. FROM PRODUCT_MAST;
3. WHERE RATE>=20;
Output:
Output:
Output:
Com1 5
Com2 3
Com3 2
Output:
Com1 5
Com2 3
2. SUM Function
Sum function is used to calculate the sum of all selected columns. It works on numeric fields
only.
Syntax
1. SUM() or
2. SUM( [ALL|DISTINCT] expression )
Example: SUM()
1. SELECT SUM(COST)
2. FROM PRODUCT_MAST;
Output:
670
1. SELECT SUM(COST)
2. FROM PRODUCT_MAST
3. WHERE QTY>3;
Output:
320
1. SELECT SUM(COST)
2. FROM PRODUCT_MAST
3. WHERE QTY>3
4. GROUP BY COMPANY;
Output:
Com1 150
Com2 170
Output:
Com1 335
Com3 170
3. AVG function
The AVG function is used to calculate the average value of the numeric type. AVG function
returns the average of all non-Null values.
Syntax
1. AVG() or
2. AVG( [ALL|DISTINCT] expression )
Example:
1. SELECT AVG(COST)
2. FROM PRODUCT_MAST;
Output:
67.00
4. MAX Function
MAX function is used to find the maximum value of a certain column. This function
determines the largest value of all selected values of a column.
Syntax
1. MAX() or
2. MAX( [ALL|DISTINCT] expression )
Example:
1. SELECT MAX(RATE)
2. FROM PRODUCT_MAST;
Output :
30
5. MIN Function
MIN function is used to find the minimum value of a certain column. This function
determines the smallest value of all selected values of a column.
Syntax
1. MIN() or
2. MIN( [ALL|DISTINCT] expression )
Example:
1. SELECT MIN(RATE)
2. FROM PRODUCT_MAST;
Output:
10
Null Values
The SQL NULL is the term used to represent a missing value. A NULL value in a table is a
value in a field that appears to be blank.
A field with a NULL value is a field with no value. It is very important to understand that a
NULL value is different than a zero value or a field that contains spaces.
Syntax
Here, NOT NULL signifies that column should always accept an explicit value of the given
data type. There are two columns where we did not use NOT NULL, which means these
columns could be NULL.
A field with a NULL value is the one that has been left blank during the record creation.
Example
The NULL value can cause problems when selecting data. However, because when comparing
an unknown value to any other value, the result is always unknown and not included in the
results. You must use the IS NULL or IS NOT NULL operators to check for a NULL value.
Consider the following CUSTOMERS table having the records as shown below.
+----+----------+-----+-----------+----------+-------------------------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+------------------------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | _ |
| 7 | Muffy | 24 | Indore | _ |
+----+----------+-----+-----------+----------+---------------------------+
Now, following is the usage of the IS NOT NULLoperator.
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
+----+----------+-----+-----------+----------+
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 6 | Komal | 22 | MP | |
| 7 | Muffy | 24 | Indore | |
+----+----------+-----+-----------+----------+
Sub Queries
A Subquery or Inner query or a Nested query is a query within another SQL query and
embedded within the WHERE clause.
A subquery is used to return data that will be used in the main query as a condition to further
restrict the data to be retrieved.
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements
along with the operators like =, <, >, >=, <=, IN, BETWEEN, etc.
Subqueries are most frequently used with the SELECT statement. The basic syntax is as
follows −
Example
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 35 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
SQL> SELECT *
FROM CUSTOMERS
WHERE ID IN (SELECT ID
FROM CUSTOMERS
WHERE SALARY > 4500) ;
+----+----------+-----+---------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+---------+----------+
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+---------+----------+
Subqueries also can be used with INSERT statements. The INSERT statement uses the data
returned from the subquery to insert into another table. The selected data in the subquery
can be modified with any of the character, date or number functions.
Example
The subquery can be used in conjunction with the UPDATE statement. Either single or
multiple columns in a table can be updated when using a subquery with the UPDATE
statement.
UPDATE table
SET column_name = new_value
[ WHERE OPERATOR [ VALUE ]
(SELECT COLUMN_NAME
FROM TABLE_NAME)
[ WHERE) ]
Example
This would impact two rows and finally CUSTOMERS table would have the following records.
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 35 | Ahmedabad | 125.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 2125.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Subqueries with the DELETE Statement
The subquery can be used in conjunction with the DELETE statement like with any other
statements mentioned above.
Example
This would impact two rows and finally the CUSTOMERS table would have the following
records.
+----+----------+-----+---------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+---------+----------+
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+---------+----------+
As the name shows, JOIN means to combine something. In case of SQL, JOIN means "to
combine two or more tables".
In SQL, JOIN clause is used to combine the records from two or more tables in a database.
1. INNER JOIN
2. LEFT JOIN
3. RIGHT JOIN
4. FULL JOIN
Sample Table
EMPLOYEE
PROJECT
1. INNER JOIN
In SQL, INNER JOIN selects records that have matching values in both tables as long as the
condition is satisfied. It returns the combination of all rows from both the tables where the
condition satisfies.
Syntax
Query
EMP_NAME DEPARTMENT
Angelina Testing
Robert Development
Christian Designing
Kristen Development
2. LEFT JOIN
The SQL left join returns all the values from left table and the matching values from the right
table. If there is no matching join value, it will return NULL.
Syntax
Query
Output
EMP_NAME DEPARTMENT
Angelina Testing
Robert Development
Christian Designing
Kristen Development
Russell NULL
Marry NULL
3. RIGHT JOIN
In SQL, RIGHT JOIN returns all the values from the values from the rows of right table and
the matched values from the left table. If there is no matching in both tables, it will return
NULL.
Syntax
Query
Output
EMP_NAME DEPARTMENT
Angelina Testing
Robert Development
Christian Designing
Kristen Development
4. FULL JOIN
In SQL, FULL JOIN is the result of a combination of both left and right outer join. Join tables
have all the records from both tables. It puts NULL on the place of matches not found.
Syntax
Query
Output
EMP_NAME DEPARTMENT
Angelina Testing
Robert Development
Christian Designing
Kristen Development
Russell NULL
Marry NULL
What is Embedded SQL?
As we have seen in our previous tutorials, SQL is known as the Structured Query Language.
It is the language that we use to perform operations and transactions on the databases.
Some of the prominent examples of languages with which we embed SQL are as follows:
C++
Java
Python etc.
Dynamic SQL
Dynamic SQL is a programming technique that could be used to write SQL queries during
runtime. Dynamic SQL could be used to create general and flexible SQL queries.
'SELECT statement';
To run a dynamic SQL statement, run the stored procedure sp_executesql as shown below :
Use prefix N with the sp_executesql to use dynamic SQL as a Unicode string.
1. Declare two variables, @var1 for holding the name of the table and @var 2 for holding
the dynamic SQL :
2. DECLARE
3. @var1 NVARCHAR(MAX),
@var2 NVARCHAR(MAX);
5. Create the dynamic SQL by adding the SELECT statement to the table name
parameter :
6. SET @var2= N'SELECT *
FROM ' + @var1;
Example –
SELECT *
from geek;
Table – Geek
ID NAME CITY
1 Khushi Jaipur
2 Neha Noida
3 Meera Delhi
DECLARE
@tab NVARCHAR(128),
@st NVARCHAR(MAX);
SET @tab = N'geektable';
SET @st = N'SELECT *
FROM ' + @tab;
EXEC sp_executesql @st;
Table – Geek
ID NAME CITY
1 Khushi Jaipur
2 Neha Noida
3 Meera Delhi
Embedded SQL
Embedded SQL involves the placement of SQL language constructs in procedural language
code. Precompilation translates the embedded SQL into calls to Pro*Ada runtime library
procedures that handle the interaction between your program and the Oracle Server. After
precompilation, you simply compile the resulting source files using your standard, supported
Ada compiler, then build the application in the normal way. Pro*Ada supplies all the
necessary library procedures
Embedded SQL is an ANSI and ISO standard. It supports an extended method of database
access above and beyond interactive SQL. The SQL statements that you can embed in an Ada
program are a superset of the SQL statements that are supported by interactive SQL, and
may have a slightly different syntax. For example, using an interactive SQL tool such as
SQL*Plus, you can issue the statement
Relational database design requires that we find a “good” collection of relation schemas. A bad
design may lead to Repetition of Information & Inability to represent certain information.
Design Goals:
1. Avoid redundant data
2. Ensure that relationships among attributes are represented
3. Facilitate the checking of updates for violation of database integrity constraints.
Example
. Consider the relation schema:
Lending-schema = (branch-name, branch-city, assets, customer-name,
loan-number, amount)
Problems:
Redundancy:
. Data for branch-name, branch-city, assets are repeated for each
loan that a branch makes
. Wastes space
. Complicates updating, introducing possibility of inconsistency of
assets value
Null values
. Cannot store information about a branch if no loans exist
. Can use null values, but they are difficult to handle
Relational Decomposition
When a relation in the relational model is not in appropriate normal form then the
decomposition of a relation is required.
In a database, it breaks the table into multiple tables.
If the relation has no proper decomposition, then it may lead to problems like loss of
information.
Decomposition is used to eliminate some of the problems of bad design like anomalies,
inconsistencies, and redundancy.
Types of Decomposition
Lossless Decomposition
If the information is not lost from the relation that is decomposed, then the
decomposition will be lossless.
The lossless decomposition guarantees that the join of relations will result in the same
relation as it was decomposed.
The relation is said to be lossless decomposition if natural joins of all the decomposition
give the original relation.
Example:
EMPLOYEE_DEPARTMENT table:
The above relation is decomposed into two relations EMPLOYEE and DEPARTMENT
EMPLOYEE table:
DEPARTMENT table
Employee ⋈ Department
Dependency Preserving
Functional Dependency
For example,
If we know the value of student roll number, we can obtain student address, marks etc. By
this, we say that student address and marks is functionally dependent on student roll
number.
Then,
Roll_Number → Student_Name
Since neither Roll_Number --> Paper_Hour nor Subject_Name --> Paper_Hour hold.
Roll_Number Student_Name
Normalization
EMPLOYEE table:
The decomposition of the EMPLOYEE table into 1NF has been shown below:
Example: Let's assume, a school can store the data of teachers and the subjects they teach.
In a school, a teacher can teach more than one subject.
TEACHER table
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.
3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
If there is no transitive dependency for non-prime attributes, then the relation must be
in third normal form.
A relation is in third normal form if it holds atleast one of the following conditions for every
non-trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
EMPLOYEE_DETAIL table:
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMPLOYEE_ZIP table:
Example: Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
Now, this is in BCNF because left side part of both the functional dependencies is a key.
A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
For a dependency A → B, if for a single value of A, multiple values of B exists, then the
relation will be a multi-valued dependency.
Example
STUDENT
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and
Math and two hobbies, Dancing and Singing. So there is a Multi-valued dependency on
STU_ID, which leads to unnecessary repetition of data. So to make the above table into 4NF,
we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hocke
A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
5NF is satisfied when all the tables are broken into as many tables as possible in order
to avoid redundancy.
5NF is also known as Project-join normal form (PJ/NF).
Example
SUBJECT LECTURER SEMESTER
Computer Anshika Semester 1
Computer John Semester 1
Math John Semester 1
Math Akash Semester 2
Chemistry Praveen Semester 1
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't
take Math class for Semester 2. In this case, combination of all these fields required to
identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who
will be taking that subject so we leave Lecturer and Subject as NULL. But all three columns
together acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester Anshika
1
Semester John
1
Semester John
1
Semester Akash
2
Semester Praveen
1
Unit 4 : STORAGE AND FILE ORGANIZATION
Databases are stored in record-containing file formats. At the physical level, the actual data is
stored on some devices in electromagnetic format. These storage devices can be classified
generally into three types.
Primary Storage − This category contains the memory storage that is directly
available to the CPU. The internal memory (registers), fast memory (cache), and main
memory ( RAM) of the CPU are directly accessible to the CPU since they are all placed
on the chipset of the motherboard or CPU. Usually, this storage is very small, ultra-
fast, and volatile. In order to maintain its condition, primary storage requires a
continuous power supply. All the data is lost in the event of a power failure.
Secondary Storage-Secondary storage devices are used for potential use or as backup
data storage. k Secondary storage covers memory devices, such as magnetic disks,
optical disk (DVDs, CDs, etc.), disk drives, flash drives, and magnetic tapes that are not
part of the CPU chipset or motherboard.
Tertiary Storage- - Tertiary storage is used to store immense data volumes. They are
the slowest in speed because such storage devices are external to the computer system.
For the most part, these storage devices are used to back up an entire system. Tertiary
storage is commonly used for optical disks and magnetic tapes.
Memory Hierarchy
A computer system has a well-established memory hierarchy. A CPU has direct access to its
primary memory and its built-in registers. Obviously, the access time of the main memory is
less than the CPU speed. Cache memory will be introduced to decrease this speed mismatch.
The fastest access time is given by cache memory and it includes data that is most frequently
accessed by the CPU.
The most expensive one is the memory with the fastest entry. Larger storage devices are slow
and less expensive, but compared to CPU registers or cache memory, they can hold huge
amounts of data.
Disks
Disks are online devices that can be accessed directly. Typical database applications
need only a small portion of the database Many files can be stored on a singleat a time for
processing ...... disk storage device with little wasted space . Files can be accessed and opened
practically instantaneously .
Magnetic disk is the secondary storage device used to support direct access to a desired
location.
The different parts that are present in magnetic disk or hard disk are explained below. All
these parts are helpful to read, write and store the data in the hard disk.
Disk blocks − The unit data transfer between disk and main memory is a block. A disk
block is a contiguous sequence of bytes.
Track − Blocks are arranged in concentric rings called tracks.
Sectors − A sector is the smallest unit of information that can be read from or written
to disk: for example: sector size of 512 bytes.
Platters − The surface of the platter is covered with a magnetic material. Information is
recorded on this surface. The set of all tracks with the same diameter is called a
cylinder. Typical platter diameters are 3.5 inch or 5.2 inch.
Read-write head − Each platter has a read-write head on both sides. It is used for
reading and writing the data on a platter.
Disk controller − A disk controller interfaces a disk driver to the computer.
The time to read or write a block varies depending on the location of data. The performance of
a hard disk can be calculated by using the below mentioned formula.
Here,
Seek time − The time to move the disk-head to the track on which a desired block is
located.
Rotational delay − It is the waiting time for the desired block to rotate under disk
head.
Transfer time − It is the time to read or write the data in the block once the head is
positioned.
RAID consists of an array of disks in which multiple disks are connected together to achieve
different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into blocks and
the blocks are distributed among disks. Each disk receives a block of data to write/read in
parallel. It enhances the speed and performance of the storage device. There is no parity and
backup in Level 0.
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a copy of
data to all the disks in the array. RAID level 1 is also called mirroring and provides 100%
redundancy in case of a failure.
RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data, striped on
different disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC
codes of the data words are stored on a different set disks. Due to its complex structure and
high cost, RAID 2 is not commercially available.
RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is stored
on a different disk. This technique makes it to overcome single disk failures.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity is
generated and stored on a different disk. Note that level 3 uses byte-level striping, whereas
level 4 uses block-level striping. Both level 3 and level 4 require at least three disks to
implement RAID.
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data
block stripe are distributed among all the data disks rather than storing them on a different
dedicated disk.
RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are generated and
stored in distributed fashion among multiple disks. Two parities provide additional fault
tolerance. This level requires at least four disk drives to implement RAID.
File Organization
File – A file is named collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tables and optical disks.
Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection .
Some types of File Organizations are :
We will be discussing each of the file Organizations in further sets of this article along withdifferences and
advantages/ disadvantages of each file Organization methods.
The easiest method for file Organization is Sequential method. In this method the file are
stored one after another in a sequential manner. There are two ways to implement this
method:
Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
Sorted File Method –In this method, As the name itself suggest whenever a new
record has to be inserted, it is always inserted in a sorted (ascending or descending)
manner. Sorting of records may be based on any primary key or any other key.
1. Insertion of new record –
Let us assume that there is a preexisting sorted sequence of four records R1, R3, and so
on upto R7 and R8. Suppose a new record R2 has to be inserted in the sequence, then it
will be inserted at the end of the file and then it will sort the sequence .
Cons –
Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
Sorted file method is inefficient as it takes time and space for sorting records.
Heap File Organization works with data blocks. In this method records are inserted at the
end of the file, into the data blocks. No Sorting or Ordering is required in this method. If a
data block is full, the new record is stored in some other block, Here the other data block need
not be the very next data block, but it can be any block in the memory. It is the responsibility
of DBMS to store and manage the new records.
Insertion of new record –
Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new record R2
has to be inserted in the heap then, since the last data block i.e data block 3 is full it will be
inserted in any of the data blocks selected by the DBMS, lets say data block 1.
If we want to search, delete or update data in heap file Organization the we will traverse the
data from the beginning of the file till we get the requested record. Thus if the database is
very huge, searching, deleting or updating the record will take a lot of time.
Fetching and retrieving records is faster than sequential record but only in case of
small databases.
When there is a huge number of data needs to be loaded into the database at a time,
then this method of file Organization is best suited.
Cons –
Hashing is an efficient technique to directly search the location of desired data on the disk
without using index structure. Data is stored at the data blocks whose address is generated
by using hash function. The memory location where these records are stored is called as data
block or data bucket.
Data bucket – Data buckets are the memory locations where the records are stored.
These buckets are also considered as Unit Of Storage.
Hash Function – Hash function is a mapping function that maps all the set of search
keys to actual record address. Generally, hash function uses primary key to generate
the hash index – address of the data block. Hash function can be simple mathematical
function to any complex mathematical function.
Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash
index has a depth value to signify how many bits are used for computing a hash
function. These bits can address 2n buckets. When all these bits are consumed ? then
the depth value is increased linearly and twice the buckets are allocated.
Static Hashing –
In static hashing, when a search-key value is provided, the hash function always computes
the same address. For example, if we want to generate address for STUDENT_ID = 76 using
mod (5) hash function, it always result in the same bucket address 4. There will not be any
changes to the bucket address here. Hence number of data buckets in the memory for this
static hashing remains constant throughout.
Operations –
Insertion – When a new record is inserted into the table, The hash function h generate
a bucket address for the new record based on its hash key K.
Bucket address = h(K)
Searching – When a record needs to be searched, The same hash function is used to
retrieve the bucket address for the record. For Example, if we want to retrieve whole
record for ID 76, and if the hash function is mod (5) on that ID, the bucket address
generated would be 4. Then we will directly got to address 4 and retrieve the whole
record for ID 104. Here ID acts as a hash key.
Deletion – If we want to delete a record, Using the hash function we will first fetch the
record which is supposed to be deleted. Then we will remove the records for that
address in memory.
Updation – The data record that needs to be updated is first searched using hash
function, and then the data record is updated.
Now, If we want to insert some new records into the file But the data bucket address
generated by the hash function is not empty or the data already exists in that address. This
becomes a critical situation to handle. This situation in the static hashing is called bucket
overflow.
How will we insert data in this case?
There are several methods provided to overcome this situation. Some commonly used methods
are discussed below:
1. Open Hashing –
In Open hashing method, next available data block is used to enter the new record,
instead of overwriting older one. This method is also called linear probing.
For example, D3 is a new record which needs to be inserted , the hash function
generates address as 105. But it is already full. So the system searches next available
data bucket, 123 and assigns D3 to it.
2. Closed hashing –
In Closed hashing method, a new data bucket is allocated with same address and is
linked it after the full data bucket. This method is also known as overflow chaining.
For example, we have to insert a new record D3 into the tables. The static hash
function generates the data bucket address as 105. But this bucket is full to store the
new data. In this case is a new data bucket is added at the end of 105 data bucket and
is linked to it. Then new record D3 is inserted into the new bucket.
o Quadratic probing :
Quadratic probing is very much similar to open hashing or linear probing. Here,
The only difference between old and new bucket is linear. Quadratic function is
used to determine the new bucket address.
o Double Hashing :
Double Hashing is another method similar to linear probing. Here the difference
is fixed as in linear probing, but this fixed difference is calculated by using
another hash function. That’s why the name is double hashing.
Dynamic Hashing –
The drawback of static hashing is that that it does not expand or shrink dynamically as the
size of the database grows or shrinks. In Dynamic hashing, data buckets grows or shrinks
(added or removed dynamically) as the records increases or decreases. Dynamic hashing is
also known as extended hashing.
In dynamic hashing, the hash function is made to produce a large number of values. For
Example, there are three data records D1, D2 and D3 . The hash function generates three
addresses 1001, 0101 and 1010 respectively. This method of storing considers only part of
this address – especially only first one bit to store the data. So it tries to load three of them at
address of 0 & 1.
But the problem is that No bucket address is remaining for D3. The bucket has to grow
dynamically to accommodate D3. So it changes the address have 2 bits rather than 1 bit, and
then it updates the existing data to have 2 bit address. Then it tries to accommodate D3.
B+ Tree, as the name suggests, It uses a tree like structure to store records in File. It uses the concept of Key
indexing where the primary key is used to sort the records. For each primary key, an index value is
generated and mapped with the record. An index of a record is the address of record in the file.
B+ Tree is very much similar to binary search tree, with the only difference that instead of just two children,
it can have more than two. All the information is stored in leaf node and the intermediate nodes acts as
pointer to the leaf nodes. The information in leaf nodes always remain a sorted sequential linked list.
In the above diagram 56 is the root node which is also called the main node of the tree.
The intermediate nodes here, just consist the address of leaf nodes. They do not contain any actual record.
Leaf nodes consist of the actual record. All leaf nodes are balanced.
Pros –
Cons –
In cluster file organization, two or more related tables/records are stored within same file known as clusters.
These files will have two or more tables in the same data block and the key attributes which are used to map
these table together are stored only once.
Thus it lowers the cost of searching and retrieving various records in different files as they are now
combined and kept in a single cluster.
For example we have two tables or relation Employee and Department. These table are related to each
other.
Therefore these table are allowed to combine using a join operation and can be seen in a cluster file.
If we have to insert, update or delete any record we can directly do so. Data is sorted based on the primary
key or the key with which searching is done. Cluster key is the key with which joining of the table is
performed.
Types of Cluster File Organization – There are two ways to implement this method:
1. Indexed Clusters –
In Indexed clustering the records are group based on the cluster key and stored together. The
above mentioned example of the Employee and Department relationship is an example of
Indexed Cluster where the records are based on the Department ID.
2. Hash Clusters –
This is very much similar to indexed cluster with only difference that instead of storing the
records based on cluster key, we generate hash key value and store the records with same hash
key value
Fixed-Length Records
Fixed-length records means setting a length and storing the records into the file. If the record
size exceeds the fixed size, it gets divided into more than one block. Due to the fixed size there
occurs following two problems:
1. Partially storing subparts of the record in more than one block requires access to all the
blocks containing the subparts to read or write in it.
2. It is difficult to delete a record in such a file organization. It is because if the size of the
existing record is smaller than the block size, then another record or a part fills up the
block.
Variable-Length Records
Variable-length records are the records that vary in size. It requires the creation of multiple
blocks of multiple sizes to store them. These variable-length records are kept in the following
ways in the database system:
Slotted-page Structure
There occurs a problem to store variable-length records within the block. Thus, such records
are organized in a slotted-page structure within the block. In the slotted-page structure, a
header is present at the starting of each block. This header holds information such as:
When a new record is to be inserted, it gets the place at the end of the free space. It is because
free space is contiguous as well. Also, the header fills an entry with the size and location
information of the newly inserted record.
Till now, we learned and understood about relations and its representation. In the relational
database system, it maintains all information of a relation or table, from its schema to the
applied constraints. All the metadata is stored. In general, metadata refers to the data about
data. So, storing the relational schemas and other metadata about the relations in a structure
is known as Data Dictionary or System Catalog.
A data dictionary is like the A-Z dictionary of the relational database system holding all
information of each relation in the database.
In addition to this, the system may also store some statistical and descriptive data
about the relations, such as:
A system may also store the storage organization, whether sequential, hash, or
heap. It also notes the location where each relation is stored:
If relations are stored in the files of the operating system, the data dictionary note, and
stores the names of the file.
If the database stores all the relations in a single file, the data dictionary notes and
store the blocks containing records of each relation in a data structure similar to a
linked list.
At last, it also stores the information regarding each index of all the relations:
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved
are:
The Translation process in query processing is similar to the parser of a query. When a
user executes any query, for generating the internal form of the query, the parser in
the system checks the syntax of the query, verifies the name of the relation in the
database, the tuple, and finally the required attribute value.
The Parser creates a tree of the query, known as 'parse-tree.' Further, translate it into
the form of relational algebra. With this, it evenly replaces all the use of the views
when used in the query.
Thus, we can understand the working of a query processing in the below-described diagram:
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query
is undertaken:
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a
query evaluation plan.
Query Evaluation Plan
In order to fully evaluate a query, the system needs to construct a query evaluation
plan.
The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
A query execution engine is responsible for generating the output of the given
query. It takes the query execution plan, executes it, and finally makes the output for
the user query.
Optimization
The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.
Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is known as
Query Optimization.
For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
Transaction
Example: Suppose an employee of bank transfers Rs 800 from X's account to Y's account.
This small transaction contains several low-level tasks:
X's Account
1. Open_Account(X)
2. Old_Balance = X.balance
3. New_Balance = Old_Balance - 800
4. X.balance = New_Balance
5. Close_Account(X)
Y's Account
1. Open_Account(Y)
2. Old_Balance = Y.balance
3. New_Balance = Old_Balance + 800
4. Y.balance = New_Balance
5. Close_Account(Y)
ACID Properties are used for maintaining the integrity of database during transaction
processing. ACID in DBMS stands for Atomicity, Consistency, Isolation, and Durability.
In the above diagram, it can be seen that after crediting $10, the amount is still $100 in
account B. So, it is not an atomic transaction.
The below image shows that both debit and credit operations are done successfully.
Thus the transaction is atomic.
Thus, when the amount loses atomicity, then in the bank systems, this becomes a huge
issue, and so the atomicity is the main focus in the bank systems.
Consistency: Once the transaction is executed, it should move from one consistent
state to another.
Example:
Isolation: Transaction should be executed in isolation from other transactions (no
Locks). During concurrent transaction execution, intermediate transaction results from
simultaneously executed transactions should not be made available to each other.
(Level 0,1,2,3)
Example: If two operations are concurrently running on two different accounts, then
the value of both accounts should not get affected. The value should remain persistent.
As you can see in the below diagram, account A is making T1 and T2 transactions to
account B and C, but both are executing independently without affecting each other. It
is known as Isolation.
Durability: · After successful completion of a transaction, the changes in the database
should persist. Even in the case of system failures.
States of Transactions
State Description
Active In this state, the transaction is being executed. This is the initial state of
state every transaction.
Concurrency Control is the management procedure that is required for controlling concurrent execution of
the operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent execution.
In a multi-user system, multiple users can access and use the same database at one time,
which is known as the concurrent execution of the database. It means that the same database
is executed simultaneously on a multi-user system by different users.
While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case, concurrent
execution of the database is performed.
The thing is that the simultaneous execution that is performed should be done in an
interleaved manner, and no operation should affect the other executing operations, thus
maintaining the consistency of the database. Thus, on making the concurrent execution of the
transaction operations, there occur several challenging problems that need to be solved.
In a database transaction, the two main operations are READ and WRITE operations. So, there is a need to
manage these two operations in the concurrent execution of the transactions as if these operations are not
performed in an interleaved manner, and the data may become inconsistent. So, the following problems
occur with the Concurrent Execution of the operations:
The problem occurs when two different database transactions perform the read/write operations on the
same database items in an interleaved manner (i.e., concurrent execution) that makes the values of the items
incorrect hence making the database inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are performed on the same account A
where the balance of account A is $300.
At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
At time t2, transaction TX deducts $50 from account A that becomes $250 (only deducted and
not updated/write).
Alternately, at time t3, transaction T Y reads the value of account A that will be $300 only
because TX didn't update the value yet.
At time t4, transaction TY adds $100 to account A that becomes $400 (only added but not
updated/write).
At time t6, transaction TX writes the value of account A that will be updated as $250 only, as T Y
didn't update the value yet.
Similarly, at time t7, transaction TY writes the values of account A, so it will write as done at
time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is lost.
The dirty read problem occurs when one transaction updates an item of the database, and somehow the
transaction fails, and before the data gets rollback, the updated database item is accessed by another
transaction. There comes the Read-Write Conflict between both transactions.
For example:
Consider two transactions TX and TY in the below diagram performing read/write operations on
account A where the available balance in account A is $300:
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different values are
read for the same database item.
For example:
Consider two transactions, TX and TY, performing the read/write operations on account A, having an
available balance = $300. The diagram is shown below:
At time t1, transaction TX reads the value from account A, i.e., $300.
At time t2, transaction TY reads the value from account A, i.e., $300.
At time t3, transaction TY updates the value of account A by adding $100 to the available
balance, and then it becomes $400.
At time t4, transaction TY writes the updated value, i.e., $400.
After that, at time t5, transaction TX reads the available value of account A, and that will be
read as $400.
It means that within the same transaction TX, it reads two different values of account A, i.e., $
300 initially, and after updation made by transaction TY , it reads $400. It is an unrepeatable
read and is therefore known as the Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that take place in concurrent
execution, management is needed, and that is where the concept of Concurrency Control comes into role.
Lock-Based Protocols
To attain consistency, isolation between the transactions is the most important tool. Isolation
is achieved if we disable the transaction to perform a read/write operation. This is known as
locking an operation in a transaction. Through lock-based protocols, desired operations are
freely allowed to perform locking the undesired operations.
Simplistic Lock Protocol: This protocol instructs to lock all the other operations on the
data when the data is going to get updated. All the transactions may unlock all the operations
on the data after the write operation.
Two-phase Locking Protocol: This protocol consists of three phases. The transaction starts
its execution with the first phase, where it asks for the locks. Once the locks are granted, the
second phase begins, where the transaction contains all the locks. When the transaction
releases the first lock, the third phase begins where all the locks are getting released after the
execution of every operation in the transaction.
Strict Two-Phase Locking Protocol: The strict 2PL is almost similar to 2PL. The only
difference is that the strict 2PL does not allow releasing the locks just after the execution of
the operations, but it carries all the locks and releases them when the commit is triggered.
Deadlock Handling :
A Deadlock is a condition where two or more transactions are waiting indefinitely for one another to give
up locks. Deadlock is said to be one of the most feared complications in DBMS as no task ever gets finished
and is in waiting state forever.
For example: In the student table, transaction T1 holds a lock on some rows and needs to update some rows
in the grade table. Simultaneously, transaction T2 holds locks on some rows in the grade table and needs to
update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and similarly,
transaction T2 is waiting for T1 to release its lock. All activities come to a halt state and remain at a
standstill. It will remain in a standstill until the DBMS detects the deadlock and aborts one of the
transactions.
Deadlock Avoidance
When a database is stuck in a deadlock state, then it is better to avoid the database rather
than aborting or restating the database. This is a waste of time and resource.
Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A method
like "wait for graph" is used for detecting the deadlock situation but this method is suitable
only for the smaller database. For the larger database, deadlock prevention method can be
used.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should detect whether
the transaction is involved in a deadlock or not. The lock manager maintains a Wait for the graph to detect
the deadlock cycle in the database.
This is the suitable method for deadlock detection. In this method, a graph is created based on
the transaction and their lock. If the created graph has a cycle or closed loop, then there is a
deadlock.
The wait for the graph is maintained by the system for every transaction which is waiting for
some data held by the others. The system keeps checking the graph if there is any cycle in the
graph.
The wait for a graph for the above scenario is shown below:
Deadlock Prevention
Deadlock prevention method is suitable for a large database. If the resources are allocated in
such a way that deadlock never occurs, then the deadlock can be prevented.
The Database management system analyzes the operations of the transaction whether they can
create a deadlock situation or not. If they do, then the DBMS never allowed that transaction to
be executed.
Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a conflicting lock by
another transaction then the DBMS simply checks the timestamp of both transactions. It allows the older
transaction to wait until the resource is available for execution.
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any transaction T. If T2
holds a lock by some other transaction and T1 is requesting for resources held by T2 then the following
actions are performed by DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource, then Ti is
allowed to wait until the data-item is available for execution. That means if the older
transaction is waiting for a resource which is locked by the younger transaction, then the older
transaction is allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj is
waiting for it, then Tj is killed and restarted later with the random delay but with the same
timestamp.
Wound wait scheme
In wound wait scheme, if the older transaction requests for a resource which is held by the
younger transaction, then older transaction forces younger one to kill the transaction and
release the resource. After the minute delay, the younger transaction is restarted but with the
same timestamp.
If the older transaction has held a resource which is requested by the Younger transaction,
then the younger transaction is asked to wait until older releases it.
Recovery Systems
Database systems, like any other computer system, are subject to failures but the data
stored in it must be available as and when required. When a database fails it must possess
the facilities for fast recovery. It must also have atomicity i.e. either transactions are
completed successfully and committed (the effect is recorded permanently in the database) or
the transaction should have no effect on the database..
Recovery techniques are heavily dependent upon the existence of a special file known as a
system log. It contains information about the start and end of each transaction and any
updates which occur in the transaction. The log keeps track of all transaction operations
that affect the values of database items. This information is needed to recover from
transaction failure.
The log is kept on disk start_transaction(T): This log entry records that transaction T
starts the execution.
read_item(T, X): This log entry records that transaction T reads the value of database
item X.
write_item(T, X, old_value, new_value): This log entry records that transaction T
changes the value of the database item X from old_value to new_value. The old value is
sometimes known as a before an image of X, and the new value is known as an
afterimage of X.
commit(T): This log entry records that transaction T has completed all accesses to the
database successfully and its effect can be committed (recorded permanently) to the
database.
abort(T): This records that transaction T has been aborted.
checkpoint: Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point
before which the DBMS was in consistent state, and all the transactions were
committed.
A transaction T reaches its commit point when all its operations that access the database
have been executed successfully i.e. the transaction has reached the point at which it will not
abort (terminate without completing). Once committed, the transaction is permanently
recorded in the database.
Commitment always involves writing a commit entry to the log and writing the log to disk. At
the time of a system crash, item is searched back in the log for all transactions T that have
written a start_transaction(T) entry into the log but have not written a commit(T) entry yet;
These transactions may have to be rolled back to undo their effect on the database during
the recovey process
Undoing – If a transaction crashes, then the recovery manager may undo transactions
i.e. reverse the operations of a transaction. This involves examining a transaction for
the log entry write_item(T, x, old_value, new_value) and setting the value of item x in
the database to old-value.There are two major techniques for recovery from non-
catastrophic transaction failures: deferred updates and immediate updates
Deferred update – This technique does not physically update the database on disk
until a transaction has reached its commit point. Before reaching commit, all
transaction updates are recorded in the local transaction workspace. If a transaction
fails before reaching its commit point, it will not have changed the database in any way
so UNDO is not needed. It may be necessary to REDO the effect of the operations that
are recorded in the local transaction workspace, because their effect may not yet have
been written in the database. Hence, a deferred update is also known as the No-
undo/redo algorithm
Immediate update – In the immediate update, the database may be updated by some
operations of a transaction before the transaction reaches its commit point. However,
these operations are recorded in a log on disk before they are applied to the database,
making recovery still possible. If a transaction fails to reach its commit point, the effect
of its operation must be undone i.e. the transaction must be rolled back hence we
require both undo and redo. This technique is known as undo/redo algorithm.
Caching/Buffering – In this one or more disk pages that include data items to be
updated are cached into main memory buffers and then updated in memory before
being written back to disk. A collection of in-memory buffers called the DBMS cache is
kept under control of DBMS for holding these buffers. A directory is used to keep track
of which database items are in the buffer. A dirty bit is associated with each buffer,
which is 0 if the buffer is not modified else 1 if modified.
Differential backup – It stores only the data changes that have occurred since last
full database backup. When same data has changed many times since last full database
backup, a differential backup stores the most recent version of changed data. For this
first, we need to restore a full database backup.
Transaction log backup – In this, all events that have occurred in the database, like
a record of every single statement executed is backed up. It is the backup of transaction
log entries and contains all transaction that had happened to the database. Through
this, the database can be recovered to a specific point in time. It is even possible to
perform a backup from a transaction log if the data files are destroyed and not even a
single committed transaction is lost.