DBMS Series Part-2
DBMS Series Part-2
DATABASE MANAGEMENT
SYSTEM (DBMS)
-By Riti Kumari
LET’S START WITH DBMS :)
Query Optimization
Whenever we want to improve the efficiency of a query, make it execute faster and
consume fewer resources we use query optimization techniques.
1.Use Indexes Efficiently: Index the columns frequently used in WHERE, JOIN,
ORDER BY, and GROUP BY clauses to speed up data retrieval.
2.Select Only Necessary Columns : Avoid SELECT *, specify only the columns you
need in your query. This reduces the amount of data transferred and processed.
LET’S START WITH DBMS :)
Query Optimization
3. Optimize JOIN Operations : Choose the Right JOIN Type and use indexed
columns in Join conditions.
4. Partition Large Tables: For very large tables, consider partitioning them to
improve query performance. Partitioning allows the DBMS to scan only relevant
partitions, reducing I/O and improving response times.
Physical Storage
Storage device : For storing the data, there are different types of storage options
available.
Primary Storage -> Main memory and cache (fastest and expensive and as soon
as the system leads to a power cut or a crash, the data also get lost. )
Secondary Storage -> Flash Memory and Magnetic Disk Storage (non-
volatile)save and store data permanently.
Tertiary Storage-> Optical disk and Magnetic tape (used for data backup and
storing large data offline)
LET’S START WITH DBMS :)
Physical Storage And File Organization
Databases store data in files on disk. Each table or index may correspond to one or
more files. Each file has sequence of records.
Accessing data from RAM is much faster than accessing data from a hard disk or
SSD. This is because RAM is designed for high-speed read and write operations. Data
retrieval in RAM typically involves fetching data in nanoseconds.
RAM : Provides fast, volatile memory for quick data retrieval. Data is accessed via
addressable locations and cache memory enhances speed.
When a program or application needs to access data, it first checks if the data is
available in RAM. If the data is not in RAM, it must be loaded from a slower storage
device like a hard disk or SSD. Index files in a Database Management System
(DBMS) are stored on disk but are also managed in RAM to optimize performance
Query
RAM
CPU Hardisk
(Volatile)
(non-volatile)
LET’S START WITH DBMS :)
Physical Storage And File Organization
How file is organized?
Snam
Rollno SAge
e
Files are allocated on the
hard disk in contiguous
blocks to reduce
fragmentation.
DB HARDISK
No of records stored in each block= Size of block/size of record
Records can be stored in 2 ways-> sorted(seraching is fast) or
unsorted(searching is slow, insertion is fast)
LET’S START WITH DBMS :)
Indexing and its types
Why do we need indexing?
Imagine you have a 1,000-page textbook and you want to serach a topic “xyz”.
When indexing is not present-> You need to manually search for each term in
the book, which could be frustrating and time-consuming.
When indexing is present->The index allows you to quickly locate each topic.
You can look up "xyz," find that it's covered on pages 448-490.This makes it
more efficient and less time-consuming.
Indexes can be used to optimize queries that sort data (ORDER BY) or group
data (GROUP BY) : When an index is available on the columns being sorted or
grouped, the DBMS can retrieve the data in the desired order directly from the
index, eliminating the need for additional sorting operations.
We have certain indexing methods like Dense(index entry for all the records)
and Sparse(index entries for only a subset of records) index.
LET’S START WITH DBMS :)
Indexing and its types
Sparse Index: A sparse index is a type of database indexing technique where the
index does not include an entry for every single record in the database. Instead, it
contains entries for only some of the records, typically one entry per block (or
page) of data in the storage. This makes sparse indexes smaller and faster to
search through compared to dense indexes, which have an entry for every record.
EX: If a table has 1,000 rows divided into 100 blocks, and you create a sparse
index, the index might only have 100 entries, with each entry pointing to the first
record in each block.
Index table
RollNo Name
1 Ram
Search key data ref
2 Shyam B1
3 1 Ram
1 B1
4 2 Shyam
4 B2 5 B2
6 3 Raghu
Sparse Index
4 Riti
no of records in index disk
file=no of blocks 5 Raj
6 Rahul
Searching: When searching for a specific value, the database uses the sparse
index to quickly locate the block where the record might reside. It then searches
within the block to find the exact record.
Index table
1 RollNo Name
Search key data ref
2 B1
3 1 Ram
1 B1
record.
Use Cases:
Primary Indexing on Large Tables: Sparse indexes are ideal for primary
indexes on large tables where records are sequentially stored.
Range Queries: Sparse indexes can be effective for range queries where
the query needs to retrieve a range of records rather than a single record.
LET’S START WITH DBMS :)
Indexing and its types
Dense Index: A dense index is a type of indexing technique used in databases where
there is an index entry for every single record in the data file. This means that the index
contains all the search keys and corresponding pointers (or addresses) to the actual
data records, making it highly efficient for direct lookups.
EX : Consider a table with 1,000 rows, and you create a dense index on a non-primary
key column. The dense index will have no ofunique non key records, each pointing to a
specific row in the table.
These are useful in scenarios where random access to individual records is quiet
frequent .
18 Ram Age Name
Search key data ref 18. Shyam B1
19 18 Ram
18 B1
20 18 Shyam
20 B2
19 B1
21
19 Raghu
20 B2
Dense Index 20 Riti
21 B2
no of records in index file=no 20 Raj
When a query is run, the database engine checks the index to find the pointers
to the rows that contain the desired data. It then retrieves these rows directly,
without scanning the entire table.
LET’S START WITH DBMS :)
Indexing and its types
How indexing works?
B1 B2
It takes a search key as input and then returns a collection of all the matching
1-10 11-20
records. Now in index there are 2 columns, the first one stores a duplicate of the
key attribute/ non-key attribute(serach key) from table while the second one
stores pointer which hold the disk block address of the corresponding key-value.
LET’S START WITH DBMS :)
Indexing and its types
Indexing
Clustering Index
When key= non-key
order= sorted
Follows dense indexing
LET’S START WITH DBMS :)
Indexing and its types
Key data: Unique identifiers for each record. Often used to distinguish one
record from another.
Non-key data: All other data in the record besides the key.
The primary index ensures that the database can immediately locate the
row associated with a given UserID, making operations like SELECT,
UPDATE, and DELETE highly efficient.
LET’S START WITH DBMS :)
Cluster Index
The cluster index is generally on a non- key attribute. The data in the table is typically
ordered according to the order defined by the clustered index. Mostly each index entry
points directly to a row. It is mostly used where we are using GROUP BY. It physically
order records in a table based on the indexed columns.
Age Name
Search key data ref 18
18 B1
19 18 Ram
18 B1
20 18 Shyam
19 B1
20 B2
21 19 Raghu
20 B2
The clustered index keeps the rows in LastName order, reducing the need for the
database to perform additional sorting when executing queries.
LET’S START WITH DBMS :)
Multilevel Index
A multilevel index is an advanced indexing technique used in database management
systems to manage large indexes more efficiently. When a single-level index
becomes too large to fit into memory, multilevel indexing helps by breaking down
the index into multiple levels, reducing the number of disk I/O operations needed to
search through the index
First Level (Primary Index): The first level is the original index, where each entry
points to a block of data or a page in the table.
Second Level (Secondary Index): If the first level index is too large to fit in
memory, a second-level index is created. This index points to blocks of the first-
level index, effectively indexing the index itself.
LET’S START WITH DBMS :)
Multilevel Index
Eid Pointer Eid Data
E351 B6
Inner index
LET’S START WITH DBMS :)
Secondary Index
Secondary indexing speeds up searches for non-key columns in a database.
Unlike primary indexing, which uses the primary key, secondary indexing
focuses on other columns. It creates an index that maps column values to the
locations where the records are stored, making it faster to find specific records
without scanning the entire table.
The data is mostly unsorted and it can be performed on both key and non-key
attributes.
LET’S START WITH DBMS :)
Secondary Index
1. when serach key is key attribute and data is unsorted
.
.
.
.
.
.
.
Index table
LET’S START WITH DBMS :)
Secondary Index
2. when serach key is non- key attribute and data is unsorted
.
.
.
.
.
.
.
Index table
LET’S START WITH DBMS :)
Indexing and its types
How to create indexes
Non-Clustered index: Clustered index:
(On one column) Automatically created with the primary
CREATE INDEX index_name key. MySQL does not support explicit
ON table_name (column_name); creation of additional clustered indexes.
Understanding B-trees helps you understand why certain queries perform well or poorly based
on the structure of the index. For example, how a B-tree's balanced nature affects search times
or why range queries are efficient with B-tree indexes
B-trees provide the balanced, efficient structure that makes these types of indexes performant,
ensuring that operations like search, insert, delete, and update are done in logarithmic time
(O(log n)). This makes B-tree indexing crucial for optimizing database queries, whether in
clustered or non-clustered index scenarios.
LET’S START WITH DBMS :)
B trees(M-way tree)
B-trees are self-balancing tree data structures that maintain sorted data and allow
searches, sequential access, insertions, and deletions in logarithmic time. All leaf
nodes are at the same level.
key
key key
Binary tree
B-tree
Root node
LET’S START WITH DBMS :)
internal node
B trees
Dp-> record/data pointer(where the record is present
in secondary memory(disk) Leaf node
Tp-> block/tree pointer(links to the children nodes)
Structure of B-Tree:
Nodes: A B-tree is composed of nodes, each containing keys and pointers (references) to
child nodes. The keys within a node are sorted in ascending order.
Root, Internal Nodes, and Leaves:
keys Rp
1. The root node is the topmost node in the tree. Tp
2. Internal nodes contain keys and child pointers.
3. Leaf nodes contain keys and possibly pointers to records or other data.
In indexing, each key in a B-tree node typically represents a value or range of values, and
the associated pointer directs to a data block where records corresponding to the key(s)
can be found.
For example, in a database, the key might be a value in a column, and the pointer might
direct to the location of a row or a set of rows in a table.
Root node
LET’S START WITH DBMS :)
internal node
B trees
A B-tree of order x can have:
Leaf node
The max no. of children
For every node -> x i.e the order of tree.
The max no. of keys Ceiling-> upper value
For every node -> x-1 floor-> lower value
The min no of keys
a. root node -> 1 K1 K2
b. other nodes apart from root-> ceiling(m/2) -1
The min no. of children
a. root node -> 2
b. Leaf node ->0
c. internal node-> ceiling(m/2)
LET’S START WITH DBMS :) block size, block pointer size, and data pointer
size are given :
B trees
m×Pb+(m−1)×(K+Pd)≤B
Let's say you have the following:
Block size (B) = 1024 bytes B- Block size
Block pointer size (Pb) = 8 bytes m- Order of tree
Pb- Block pointer size
Data pointer size (Pd) = 12 bytes K- Key size
Key size (K) = 16 bytes Pd- Data pointer size
Find the order of the tree.
2. Insert the value into the leaf node in sorted order. If the leaf node has fewer than the maximum
allowed keys (order - 1), this step is simple.
3. If the leaf node contains the maximum number of keys after the insertion, it causes an overflow.
Split the Node:
Divide the node into two nodes. The middle key (median) is pushed up to the parent node.
The left half of the original node stays in place, while the right half forms a new node.
Insert the Median into the Parent:
If the parent node also overflows after this insertion, recursively split the parent node and
propagate the median up the tree.
4.If the root node overflows (which can happen if it already has the maximum number of keys), split it
into two nodes, and the median becomes the new root. This increases the height of the B-tree by one.
A B-tree of order x can have:
Insertion in B-tree happens from leaf node and values are also inserted
in sorted order. The element in left of root would be less than root and the
element in right would be greater than root
Step 1: Start at the root and recursively move down the tree to find the appropriate
leaf node where the new value should be inserted. Insert the value into the leaf
node in sorted order. If the leaf node has fewer than the maximum allowed keys
(order - 1), this step is simple.
1 4
Max children= order of tree=3
LET’S START WITH DBMS :) Max keys node can have=order-1=3-1=2
Insertion in B-TREE
Create a B-tree of Order 3 and insert values from 1,4,6,8,10,12,14,16
Step 2: If the leaf node contains the maximum number of keys after the insertion, it
causes an overflow.
Split the Node: Divide the node into two nodes. The middle key (median) is pushed up
to the parent node.
The left half of the original node stays in place, while the right half forms a new
node.
Insert the Median into the Parent:
If the parent node also overflows after this insertion, recursively split the parent
node and propagate the median up the tree.
4 4 (left- 1 , right- 6)
1 6
Max children= order of tree=3
LET’S START WITH DBMS :) Max keys node can have=order-1=3-1=2
Insertion in B-TREE
Create a B-tree of Order 3 and insert values from 1,4,6,8,10,12,14,16
4 (left- 1 , right- 6,8)
4
1 6 8
4 (left- 1 , right- 6)
4 8
8(right -10,12)
1 6 10 12
Max children= order of tree=3
LET’S START WITH DBMS :) Max keys node can have=order-1=3-1=2
Insertion in B-TREE
Create a B-tree of Order 3 and insert values from 1,4,6,8,10,12,14,16
Step 3:If the root node overflows (which can happen if it already has the maximum
number of keys), split it into two nodes, and the median becomes the new root. This
increases the height of the B-tree by one.
4 (left- 1 , right- 6)
8
8(left-4, right-12)
4 12 12(left -10,right -14)
1 6 10 14
maximum of (m-1)=3 keys
LET’S START WITH DBMS :) and minimum of (ceil(m/2)-1)=1 key
Deletion in B-TREE
Consider a B-tree of order 4
]
Step 1: Begin at the root and recursively move 40, 50
down the tree to find the node that contains
the key to be deleted. 10, 20 44, 47 52, 59
Step 2 :
Case 1: The Key is in a Leaf Node (Delete 10)
Simply remove the key from the leaf node. ]
If the node still has the minimum required number 40, 50
of keys (i.e., at least ceil(order/2) - 1 keys), the
deletion is complete. 20 44, 47 52, 59
maximum of (m-1)=3 keys
LET’S START WITH DBMS :) and minimum of (ceil(m/2)-1)=1 key
Deletion in B-TREE
Consider a B-tree of order 4 ]
40, 50
Step 2 :
Case 2: The Key is in a Leaf Node (Delete
20) 20 44, 47 52, 59
If the deletion causes the node to have
fewer than the minimum number of keys,
proceed to the Borrowing or Merging
step.
11 44, 47 52, 59
10 22,23 43 48 60
LET’S START WITH DBMS :)
B+ Tree
A B+ tree is an extension of the B-tree and is commonly used in databases and
file systems to maintain sorted data and allow for efficient insertion, deletion, and
search operations. B+ tree is a balanced tree, meaning all leaf nodes are at the
same level
The key difference between a B+ tree and a B-tree lies in how they store data
and how leaf nodes are structured.
1. In a B+ tree, all actual data (or references to data) are stored in the leaf nodes.
Internal nodes only store keys
leaf node
LET’S START WITH DBMS :)
B+ Tree
2. In a B+ tree, Leaf nodes are linked together in a linked list fashion, allowing for
efficient sequential access, one leaf node will have only 1 Bp.
leaf nodes
3. Internal nodes do not store data pointers, only keys and child pointers. This allows
more keys to be stored in each internal node, leading to a lower height and more
efficient operations.
internal nodes
LET’S START WITH DBMS :)
B+ Tree
Structure of a B+ Tree:
Root Node:
The top node of the B+ tree, which points to the first level of
internal nodes or directly to leaf nodes if the tree has only one
level.
Internal Nodes:
These nodes contain only keys and pointers to child nodes.
They guide the search process down to the correct leaf node.
Leaf Nodes:
These nodes contain keys and data pointers. Each leaf node
stores a pointer to the next leaf node, enabling quick traversal
of records.
LET’S START WITH DBMS :)
B+ Tree
Advantages :
1. B+ trees have a balanced structure, meaning all leaf nodes are at the same
level. This balance ensures that search operations require logarithmic time
relative to the number of keys, making it very efficient even for large
datasets.
2. The structure of the B+ tree allows for direct access to data. Since the
internal nodes contain only keys, searching for a specific value can be done
quickly by navigating through the tree down to the leaf node where the data
is stored.
B- Block size
LET’S START WITH DBMS :) m- Order of tree
B+ Tree Pb- Block pointer size
Order of B+ tree K- Key size
Pd- Data pointer size
Order of Leaf node :
leaf nodes
1xPb+M(k+Pd)<=B
B+ Tree Example: Imagine a library catalog where all book records are listed
in a sorted, linked list (leaf nodes), and the internal nodes only contain
pointers to guide you to the right part of the catalog.
LET’S START WITH DBMS :)
Difference between B and B+ Tree
Case B tree B+ tree
keys and associated pointers to data are all actual data is stored only in the leaf
Data Storage stored in all nodes (internal and leaf nodes.Internal nodes contain only keys
nodes) and pointers to child nodes
User Journey
The user sends a request by entering a URL or clicking a link. The browser
then translates the domain name into an IP address through DNS lookup,
establishes a connection with the server, sends an HTTP request, and
receives a response.
The browser then processes this response, rendering the webpage and
displaying it to the user. The process involves multiple layers of networking,
data transfer, and client-server interaction, all working together to deliver the
desired content to the user’s screen.
LET’S START WITH DBMS :)
Scaling in Databases
Advantages:
Easy to set up because it usually doesn't need changes to the app or
database structure.
Simple to manage since everything is kept on one server.
Disadvantages:
Limited by the hardware's maximum capacity.
Expensive, as high-end hardware can be very costly.
Single point of failure: if the server fails, the entire database goes offline.
LET’S START WITH DBMS :) add more instances
Scaling in Databases
Horizontal Scaling (Scaling Out):
Horizontal scaling involves spreading the database workload across multiple servers. This
can be done in different ways:
a. Database Replication:
Replication is the process of copying data from one main server (master) to one or more
backup servers (slaves).
Master-Slave Replication: The master server manages all write operations, while read
operations are distributed among the slave servers. This setup improves read
performance and provides redundancy.
Multi-Master Replication: Multiple servers handle both read and write operations,
increasing availability and write capacity. However, it also introduces challenges in
managing conflicts.
Shard
LET’S START WITH DBMS :) T1 T2 T3
Scaling in Databases
Divide the data between multiple tables
Horizontal Scaling (Scaling Out): in diff db instances
b. Database Sharding
Sharding involves partitioning the database into smaller, more manageable pieces
(shards), with each shard stored on a separate server.
Load Balancing: Distribute database queries across multiple servers using load
balancers to prevent any single server from becoming a bottleneck
Partitioning: Divide large tables into smaller partitions that can be queried
independently, improving query performance.
For example, a database administrator might create roles like DB_Read for
reading data and DB_Write for modifying data. Only people with the right
role can read or change the information.
LET’S START WITH DBMS :)
RBAC(Role-based access control )
Advantages:
Makes it easier to manage permissions, especially in big systems.
Lowers the risk of misuse by making sure users only have access to
what they need for their role.
Improves security and helps keep rules consistent across the system.
LET’S START WITH DBMS :)
Encryption
Types of Encryption
1. At-Rest Encryption
2. In-Transit Encryption
3. Column-Level Encryption
4. Full Disk Encryption:
LET’S START WITH DBMS :)
Encryption
Full Disk Encryption: Full disk encryption encrypts the entire disk
where the database is stored. This ensures that all data on the disk is
protected, not just the database but also any files, logs, or backups
stored on that disk.
Limits who can view or modify encrypted data to those with the appropriate
decryption keys.
Challenges
Encryption can slow down database operations because of the additional
processing required.
If decryption keys are lost, the data may become permanently inaccessible.
LET’S START WITH DBMS :)
Data masking techniques
For example : A bank needs to train its machine learning models to detect
fraudulent transactions. To protect customers' personal information, the
bank uses data masking to replace real account numbers and transaction
details with fake but realistic-looking data. This way, the models can be
trained effectively without risking exposure of actual customer data.
LET’S START WITH DBMS :)
Data masking techniques
Common Data Masking Techniques
Substitution:
Replacing sensitive data with random but realistic-looking values. For
example, swapping real names with random names from a list.
Shuffling:
Rearranging the values within a column so that the data remains realistic but
is not linked to the original records.
Example: A hospital has a list of patients with their birth dates. To protect
privacy, the hospital shuffles the birth dates among the patients. For instance,
Patient A's birth date of 01/15/1980 might be swapped with Patient B's birth
date of 07/22/1985.
LET’S START WITH DBMS :)
Data masking techniques
Common Data Masking Techniques
Encryption:
Encrypting data so that it is unreadable without the decryption key. This can be
used as a form of masking if the data doesn't need to be human-readable.
Nulling Out:
Replacing sensitive data with null values or placeholders like "XXXX". This
removes the original data but may reduce the usefulness of the dataset.
Number Variance:
Altering numeric data by adding or subtracting a random value within a certain
range. For example, changing salary figures slightly to mask the exact amounts
Example: A finance department needs to protect the exact sales figures but still
wants to share data for trend analysis. They slightly alter the figures by adding or
subtracting a small random amount. For example, a sales figure of $10,000
might be changed to $10,020 or $9,980.
LET’S START WITH DBMS :)
Data masking techniques
Common Data Masking Techniques
Tokenization:
Tokenization is a technique used to replace sensitive data with non-sensitive
equivalents, called "tokens." These tokens can be stored and processed without
exposing the original sensitive data, while maintaining a mapping to the original
values when needed.