0% found this document useful (0 votes)
246 views

B+ Tree in DBMS

The document discusses B+ trees and B tree indexes. B+ trees store data in interior nodes and leaf nodes, with interior nodes containing pointers to leaf nodes. Leaf nodes store actual data values in sorted order. This structure allows for fast searching and retrieval of records. B tree indexes are similar but store some data in interior nodes as well, reducing traversal time to leaf nodes. Both structures use splitting and rearranging of nodes to maintain balance during insertion and deletion while preserving sorting of values. The main difference is that B+ trees strictly separate interior pointer nodes from leaf data nodes, while B trees store some data in all nodes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
246 views

B+ Tree in DBMS

The document discusses B+ trees and B tree indexes. B+ trees store data in interior nodes and leaf nodes, with interior nodes containing pointers to leaf nodes. Leaf nodes store actual data values in sorted order. This structure allows for fast searching and retrieval of records. B tree indexes are similar but store some data in interior nodes as well, reducing traversal time to leaf nodes. Both structures use splitting and rearranging of nodes to maintain balance during insertion and deletion while preserving sorting of values. The main difference is that B+ trees strictly separate interior pointer nodes from leaf data nodes, while B trees store some data in all nodes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Concepts of B+ Tree and B Tree

index files in DBMS


Introduction
As we have already seen in previous articles that B+ tree is a (key, value)
storage method in a tree like structure. B+ tree has one root, any number of
intermediary nodes (usually one) and a leaf node. Here all leaf nodes will
have the actual records stored. Intermediary nodes will have only pointers to
the leaf nodes; it not has any data. Any node will have only two leaves. This
is the basic of any B+ tree.

Consider the STUDENT table below. This can be stored in B+ tree structure
as shown below. We can observe here that it divides the records into two and
splits into left node and right node. Left node will have all the values less
than or equal to root node and the right node will have values greater than
root node. The intermediary nodes at level 2 will have only the pointers to
the leaf nodes. The values shown in the intermediary nodes are only the
pointers to next level. All the leaf nodes will have the actual records in a
sorted order.
If we have to search for any record, they are all found at leaf node. Hence
searching any record will take same time because of equidistance of the leaf
nodes. Also they are all sorted. Hence searching a record is like a sequential
search and does not take much time.

Suppose a B+ tree has an order of n (it is the number of branches – above


tree structure has 5 branches altogether, hence order is 5), and then it can
have n/2 to n intermediary nodes and n/2 to n-1 leaf nodes. In our example
above, n= 5 i.e.; it has 5 branches from root. Then it can have intermediary
nodes ranging from 3 to 5. And it can have leaf nodes from 3 to 4.

The main goal of B+ tree is:

 Sorted Intermediary and leaf nodes: Since it is a balanced tree,


all nodes should be sorted.
 Fast traversal and Quick Search:

One should be able to traverse through the nodes very fast. That means, if
we have to search for any particular record, we should be able pass through
the intermediary node very easily.  This is achieved by sorting the pointers at
intermediary nodes and the records in the leaf nodes.

Any record should be fetched very quickly.  This is made by maintaining the
balance in the tree and keeping all the nodes at same distance.

 No overflow pages: B+ tree allows all the intermediary and leaf


nodes to be partially filled – it will have some percentage defined
while designing a B+ tree. This percentage up to which nodes are
filled is called fill factor.  If a node reaches the fill factor limit, then it
is called overflow page. If a node is too empty then it is called
underflow. In our example above, intermediary node with 108 is
underflow. And leaf nodes are not partially filled, hence it is an
overflow. In ideal B+ tree, it should not have overflow or underflow
except root node.
Searching a record in B+ Tree
Suppose we want to search 65 in the below B+ tree structure. First we will
fetch for the intermediary node which will direct to the leaf node that can
contain record for 65. So we find branch between 50 and 75 nodes in the
intermediary node. Then we will be redirected to the third leaf node at the
end. Here DBMS will perform sequential search to find 65. Suppose, instead
of 65, we have to search for 60. What will happen in this case? We will not
be able to find in the leaf node. No insertions/update/delete is allowed during
the search in B+ tree.

Insertion in B+ tree
Suppose we have to insert a record 60 in below structure. It will go to 3 rd leaf
node after 55. Since it is a balanced tree and that leaf node is already full,
we cannot insert the record there. But it should be inserted there without
affecting the fill factor, balance and order. So the only option here is to split
the leaf node. But how do we split the nodes?

The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current
root node is 50. We will split the leaf node in the middle so that its balance is
not altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes. If
these two has to be leaf nodes, the intermediary node cannot branch from
50. It should have 60 added to it and then we can have pointers to new leaf
node.
This is how we insert a new entry when there is overflow. In normal scenario,
it is simple to find the node where it fits and place it in that leaf node.

Delete in B+ tree
Suppose we have to delete 60 from the above example. What will happen in
this case? We have to remove 60 from 4th leaf node as well as from the
intermediary node too. If we remove it from intermediary node, the tree will
not satisfy B+ tree rules. So we need to modify it have a balanced tree. After
deleting 60 from above B+ tree and re-arranging nodes, it will appear as
below.

Suppose we have to delete 15 from above tree. We will traverse to the


1st leaf node and simply delete 15 from that node. There is no need for any
re-arrangement as the tree is balanced and 15 do not appear in the
intermediary node.
B+ Tree Extensions
As the number of records grows in the database, the intermediary and leaf
nodes needs to be split and spread widely to keep the balance of the tree.
This is called as B+ tree extensions. As it spreads out widely, the searching
of records becomes faster.

The main goal of creating B+ tree is faster traversal of records. As the


branches spreads out, it requires less I/O on disk to get the record.  Record
that needs to be fetched are fetched in logarithmic fraction of time.  Suppose
we have K search key values – that is the pointers in the intermediary node
for n nodes. Then we can fetch any record in the b+ tree in log (n/2) (K).

Suppose each node takes 40bytes to store an index and each disk block is of
40Kbytes. That means we can have 100 nodes (n).  Say we have 1million
search key values – that means we have 1 million intermediary pointers.
Then we can access log 50 (1000000) = 4 nodes are accessed in one go.
Hence this costs only 4milliseconds to fetch any node in the tree. Now we
can guess the advantage of extending the B+ tree into more intermediary
nodes. As intermediary nodes spread out more and more, it is more efficient
in fetching the records in B+ tree.

Look at below two diagrams to understand how it makes difference with B+


tree extensions.
B+ Tree index files
Above concept of B+ tree is used to store the records in the secondary
memory. If the records are stored using this concept, then those files are
called as B+ tree index files.  Since this tree is balanced and sorted, all the
nodes will be at same distance and only leaf node has the actual value,
makes searching for any record easy and quick in B+ tree index files. Even
insertion/deletion in B+ tree does not take much time. Hence B+ tree forms
an efficient method to store the records.

Searching, inserting and deleting a record is done in the same way we have
seen above. Since it is a balance tree, it searches for the position of the
records in the file, and then it fetches/inserts /deletes the records. In case it
finds that tree will be unbalanced because of insert/delete/update, it does
the proper re-arrangement of nodes so that definition of B+ tree is not
changed.

Below is the simple example of how student details are stored in B+ tree
index files.

Suppose we have a new student Bryan. Where will he fit in the file? He will fit
in the 1st leaf node. Since this leaf node is not full, we can easily add him in
the node.
But what happens if we want to insert another student Ben to this file? Some
re-arrangement to the nodes is needed to maintain the balance of the file.

Same thing happens when we perform delete too.

Benefits of B+ Tree index files


 As the file grows in the database, the performance remains the
same. It does not degrade like in ISAM. This is because all the
records are maintained at leaf node and all the nodes are at equi-
distance from root. In addition, if there is any overflow, it
automatically re-organizes the structure.
 Even though insertion and deletion are little complicated, it can be
done in fraction of seconds.
 Leaf node allows only partial/ half filled, since records are larger
than pointers.

B Tree index Files


B tree index file is similar to B+ tree index files, but it uses binary search
concepts. In this method, each root will branch to only two nodes and each
intermediary node will also have the data. And leaf node will have lowest
level of data. However, in this method also, records will be sorted. Since all
intermediary nodes also have records, it reduces the traversing till leaf node
for the data. A simple B tree can be represented as below:
See the difference between this tree structure and B+ tree for the same
example above. Here there is no repetition or pointers till leaf node. All the
records are stored in all the nodes. If we need to insert any record, it will be
done as B+ tree index files, but it will make sure that each node will branch
only to two nodes. If there is not enough space in any of the node, it will split
the node and store the records.

Example of Simple Insert

Example of splitting the nodes while inserting


Difference between B Tree and B+ Tree Index
Files
Compare the difference between the examples of B+ tree index files and B
tree index files above. You can see that they are almost similar but there is
little difference in them. This little difference itself gives greater effect in
database performance.

  B Tree Index Files B+ Tree Index Files

  This is a binary tree This is a balanced tree with


structure similar to B+ tree. intermediary nodes and leaf
But here each node will have nodes. Intermediary nodes contain
only two branches and each only pointers / address to the leaf
node will have some nodes. All leaf nodes will have
records. Hence here no need records and all are at same
to traverse till leaf node to distance from the root.
get the data.

  It has more height Most width is more compared to


compared to width. height.

  Number of nodes at any Each intermediary node can have


intermediary level ‘l’ n/2 to n children. Only root node
is 2l. Each of the will have 2 children.
intermediary nodes will have
only 2 sub nodes.

  Even a leaf node level will Leaf node stores (n-1)/2 to n-1
have 2l nodes. Hence total values
nodes in the B Tree
are 2 l+1 – 1.

    As the number of intermediary


nodes increases and hence the
leaf nodes i.e. as B+ tree extends,
the traversal speed  increases log
arithmetically log(n/2)(K)

  Records are in sorted order Records are in sorted order

Advantages It might have fewer nodes Automatically Adjust the nodes to


compared to B+ tree as fit the new record. Similarly it re-
each node will have data. organizes the nodes in the case of
delete, if required. Hence it does
not alter the definition of B+ tree.

Since each node has record, Reorganization of the nodes does


there might not be required not affect the performance of the
to traverse till leaf node. file. This is because, even after
the rearrangement all the records
are still found in leaf nodes and
are all at equidistance. There is no
change in distance of records from
neither root nor the time to
traverse till leaf node.

  No file degradation problem

  Good space utilization as


intermediary nodes contain only
pointer to the records and only
leaf nodes contain records. Space
needed for pointers are very less
compared to records.

  Is suitable for partial and range


search too

  Since all the leaf nodes are at


equal distance, the time for I/O
fetch is much less. Hence the
performance of the tree will also
increase.

Disadvantag If the tree is very big, then  If there is any rearrangement of
es we have to traverse through nodes while insertion or deletion,
most of the nodes to get the then it would be an overhead. It
records. Only few records takes little effort, time and space.
can be fetched at the But this disadvantage can be
intermediary nodes or near ignored compared to the speed of
to the root. Hence this traversal
method might be slower.

Since each node has data  


and can have only two child
nodes, the tree will not
spread out much. Its
depth/height will increase as
the number of records
increases. But if height of a
tree increases, the I/O will
also increase and hence the
performance will decrease.

Insertion and deletion of  


nodes will have re-
arrangements like in B+
tree. But it will be more
complicated as it has to
balance the binary nodes.

Implementation of B tree is  
little difficult compared to
B+ tree

All these disadvantages  


cannot be ignored as they
are highly affecting the
performance of the file.

B+ Tree indexing
This is the standard index in the database where primary key or the most
frequently used search key column in the table used to index. It has the
same feature as discussed above. Hence it is efficient in retrieving the data.
These indexes can be stored in different forms in a B+ tree. Depending on
the way they are organized, there are 4 types of B+ tree indexes.

 Index-organized tables: – Here data itself acts as a index and


whole record is stored in the B+ index file.
 Descending Indexes: – Here index key is stored in the descending
order in B+ tree files.
 Reverse key indexes: – In this method of indexing, the index key
column value is stored in the reverse order. For example, say index
is created on STD_ID in the STUDENT table. Suppose STD_ID has
values 100,101,102 and 103. Then the reverse key index would be
001, 101,201 and 301 respectively.
 B+ tree Cluster Index: – Here, cluster key of the table is used in
the index. Thus, each index in this method will point to set of
records with same cluster keys.
File Organization in DBMS | Set 2
Prerequisite – Hashing Data Structure
In database management system, When we want to retrieve a particular data, It becomes very
inefficient to search all the index values and reach the desired data. In this situation, Hashing
technique comes into picture.
Hashing is an efficient technique to directly search the location of desired data on the disk
without using index structure. Data is stored at the data blocks whose address is generated by
using hash function. The memory location where these records are stored is called as data
block or data bucket.

Hash File Organization :

 Data bucket – Data buckets are the memory locations where the records are stored. These
buckets are also considered as Unit Of Storage.
 Hash Function – Hash function is a mapping function that maps all the set of search keys to
actual record address. Generally, hash function uses primary key to generate the hash index
– address of the data block. Hash function can be simple mathematical function to any
complex mathematical function.
 Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash index has
a depth value to signify how many bits are used for computing a hash function. These bits
can address 2n buckets. When all these bits are consumed ? then the depth value is
increased linearly and twice the buckets are allocated.

Below given diagram clearly depicts how hash function work:


Hashing is further divided into two sub categories :

Static Hashing –

In static hashing, when a search-key value is provided, the hash function always computes the
same address. For example, if we want to generate address for STUDENT_ID = 76 using
mod (5) hash function, it always result in the same bucket address 4.  There will not be any
changes to the bucket address here. Hence number of data buckets in the memory for this
static hashing remains constant throughout.

Operations –

 Insertion – When a new record is inserted into the table, The hash function h generate a
bucket address for the new record based on its hash key K.
Bucket address = h(K)
 Searching – When a record needs to be searched, The same hash function is used to retrieve
the bucket address for the record. For Example, if we want to retrieve whole record for ID
76, and if the hash function is mod (5) on that ID, the bucket address generated would be 4.
Then we will directly got to address 4 and retrieve the whole record for ID 104. Here ID acts
as a hash key.
 Deletion – If we want to delete a record, Using the hash function we will first fetch the
record which is supposed to be deleted.  Then we will remove the records for that address in
memory.
 Updation – The data record that needs to be updated is first searched using hash function,
and then the data record is updated.

Now, If we want to insert some new records into the file But the data bucket address
generated by the hash function is not empty or the data already exists in that address. This
becomes a critical situation to handle.  This situation in the static hashing is called bucket
overflow.
How will we insert data in this case?
There are several methods provided to overcome this situation. Some commonly used
methods are discussed below:

1. Open Hashing –
In Open hashing method, next available data block is used to enter the new record, instead
of overwriting older one. This method is also called  linear probing.

For example, D3 is a new record which needs to be inserted , the hash function
generates address as 105. But it is already full. So the system searches next available
data bucket, 123 and assigns D3 to it.

2. Closed hashing –
In Closed hashing method, a new data bucket is allocated with same address and is linked it
after the full data bucket. This method is also known as  overflow chaining.
For example, we have to insert a new record D3 into the tables. The static hash function
generates the data bucket address as 105. But this bucket is full to store the new data. In
this case is a new data bucket is added at the end of 105 data bucket and is linked to it. Then
new record D3 is inserted into the new bucket.
o Quadratic probing :
Quadratic probing is very much similar to open hashing or linear probing. Here, The
only difference between old and new bucket is linear. Quadratic function is used to
determine the new bucket address.
o Double Hashing :
Double Hashing is another method similar to linear probing. Here the difference is
fixed as in linear probing, but this fixed difference is calculated by using another
hash function. That’s why the name is double hashing.

Dynamic Hashing –

The drawback of static hashing is that that it does not expand or shrink dynamically as the
size of the database grows or shrinks.  In Dynamic hashing, data buckets grows or shrinks
(added or removed dynamically) as the records increases or decreases. Dynamic hashing is
also known as extended hashing.

In dynamic hashing, the hash function is made to produce a large number of values. For
Example, there are three data records D1, D2 and D3 . The hash function generates three
addresses 1001, 0101 and 1010 respectively.  This method of storing considers only part of
this address – especially only first one bit to store the data. So it tries to load three of them at
address 0 and 1.

But the problem is that No bucket address is remaining for D3. The bucket has to grow
dynamically to accommodate D3. So it changes the address have 2 bits rather than 1 bit, and
then it updates the existing data to have 2 bit address. Then it tries to accommodate D3.
Hashing Data Structure
Hashing is an important Data Structure which is designed to use a special function
called the Hash function which is used to map a given value with a particular key for
faster access of elements. The efficiency of mapping depends of the efficiency of the
hash function used.
Let a hash function H(x) maps the value   at the index x%10 in an Array. For
example if the list of values is [11,12,13,14,15] it will be stored at positions {1,2,3,4,5}
in the array or Hash table respectively.

Suppose we want to design a system for storing employee records keyed using
phone numbers. And we want following queries to be performed efficiently:
1. Insert a phone number and corresponding information.
2. Search a phone number and fetch the information.
3. Delete a phone number and related information.
We can think of using the following data structures to maintain information about
different phone numbers.
1. Array of phone numbers and records.
2. Linked List of phone numbers and records.
3. Balanced binary search tree with phone numbers as keys.
4. Direct Access Table.
For arrays and linked lists, we need to search in a linear fashion, which can be
costly in practice. If we use arrays and keep the data sorted, then a phone number
can be searched in O(Logn) time using Binary Search, but insert and delete
operations become costly as we have to maintain sorted order.
With balanced binary search tree, we get moderate search, insert and delete
times. All of these operations can be guaranteed to be in O(Logn) time.
Another solution that one can think of is to use a direct access table where we
make a big array and use phone numbers as index in the array. An entry in array is
NIL if phone number is not present, else the array entry stores pointer to records
corresponding to phone number. Time complexity wise this solution is the best
among all, we can do all operations in O(1) time. For example to insert a phone
number, we create a record with details of given phone number, use phone number
as index and store the pointer to the created record in table.
This solution has many practical limitations. First problem with this solution is extra
space required is huge. For example if phone number is n digits, we need O(m * 10 n)
space for table where m is size of a pointer to record. Another problem is an integer
in a programming language may not store n digits.
Due to above limitations Direct Access Table cannot always be used. Hashing is the
solution that can be used in almost all such situations and performs extremely well
compared to above data structures like Array, Linked List, Balanced BST in practice.
With hashing we get O(1) search time on average (under reasonable assumptions)
and O(n) in worst case.
Hashing is an improvement over Direct Access Table. The idea is to use hash
function that converts a given phone number or any other key to a smaller number
and uses the small number as index in a table called hash table.
Hash Function: A function that converts a given big phone number to a small
practical integer value. The mapped integer value is used as an index in hash table.
In simple terms, a hash function maps a big number or string to a small integer that
can be used as index in hash table.
A good hash function should have following properties
1) Efficiently computable.
2) Should uniformly distribute the keys (Each table position equally likely for each
key)
For example for phone numbers a bad hash function is to take first three digits. A
better function is consider last three digits. Please note that this may not be the best
hash function. There may be better ways.
Hash Table: An array that stores pointers to records corresponding to a given phone
number. An entry in hash table is NIL if no existing phone number has hash function
value equal to the index for the entry.
Collision Handling: Since a hash function gets us a small number for a big key,
there is possibility that two keys result in same value. The situation where a newly
inserted key maps to an already occupied slot in hash table is called collision and
must be handled using some collision handling technique. Following are the ways to
handle collisions:
 Chaining:The idea is to make each cell of hash table point to a linked list of
records that have same hash function value. Chaining is simple, but requires
additional memory outside the table.
 Open Addressing: In open addressing, all elements are stored in the hash
table itself. Each table entry contains either a record or NIL. When searching for
an element, we one by one examine table slots until the desired element is
found or it is clear that the element is not in the table.
What is Hashing in DBMS?
In DBMS, hashing is a technique to directly search the location of desired data on the disk
without using index structure. Data is stored in the form of data blocks whose address is
generated by applying a hash function in the memory location where these records are stored
known as a data block or data bucket.

Why do we need Hashing?


Here, are the situations in the DBMS where you need to apply the Hashing method:

 For a huge database structure, it's tough to search all the index values through all its level
and then you need to reach the destination data block to get the desired data.
 Hashing method is used to index and retrieve items in a database as it is faster to search that
specific item using the shorter hashed key instead of using its original value.
 Hashing is an ideal method to calculate the direct location of a data record on the disk
without using index structure.
 It is also a helpful technique for implementing dictionaries.

Important Terminologies using in Hashing

Here, are important terminologies which are used in Hashing:

 Data bucket – Data buckets are memory locations where the records are stored. It is also
known as Unit Of Storage.
 Key: A DBMS key is an attribute or set of an attribute which helps you to identify a
row(tuple) in a relation(table). This allows you to find the relationship between two tables.
 Hash function: A hash function, is a mapping function which maps all the set of search keys
to the address where actual records are placed.
 Linear Probing – Linear probing is a fixed interval between probes. In this method, the next
available data block is used to enter the new record, instead of overwriting on the older
record.
 Quadratic probing- It helps you to determine the new bucket address. It helps you to add
Interval between probes by adding the consecutive output of quadratic polynomial to
starting value given by the original computation.
 Hash index – It is an address of the data block. A hash function could be a simple
mathematical function to even a complex mathematical function.
 Double Hashing –Double hashing is a computer programming method used in hash tables to
resolve the issues of has a collision.
 Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for
any static has to function.

There are mainly two types of SQL hashing methods:

1. Static Hashing
2. Dynamic Hashing
Static Hashing

In the static hashing, the resultant data bucket address will always remain the same.

Therefore, if you generate an address for say Student_ID = 10 using hashing function
mod(3), the resultant bucket address will always be 1. So, you will not see any change in the
bucket address.

Therefore, in this static hashing method, the number of data buckets in memory always
remains constant.

Static Hash Functions

 Inserting a record: When a new record requires to be inserted into the table, you can
generate an address for the new record using its hash key. When the address is
generated, the record is automatically stored in that location.
 Searching: When you need to retrieve the record, the same hash function should be
helpful to retrieve the address of the bucket where data should be stored.
 Delete a record: Using the hash function, you can first fetch the record which is you
wants to delete. Then you can remove the records for that address in memory.

Static hashing is further divided into

1. Open hashing
2. Close hashing.

Open Hashing

In Open hashing method, Instead of overwriting older one the next available data block is
used to enter the new record, This method is also known as linear probing.

For example, A2 is a new record which you wants to insert. The hash function generates
address as 222. But it is already occupied by some other value. That's why the system looks
for the next data bucket 501 and assigns A2 to it.
lose Hashing

In the close hashing method, when buckets are full, a new bucket is allocated for the same
hash and result are linked after the previous one.

Dynamic Hashing

Dynamic hashing offers a mechanism in which data buckets are added and removed
dynamically and on demand. In this hashing, the hash function helps you to create a large
number of values.

Comparison of Ordered Indexing and Hashing


Parameters Order Indexing Hashing

Addresses in the memory are sorted


Storing of Addresses are always generated using
according to a key value called the primary
address a hash function on the key value.
key

Performance of hashing will be best


It can decrease when the data increases in
when there is a constant addition and
the hash file. As it stores the data in a sorted
deletion of data. However, when the
Performance form when there is any
database is huge, then hash file
(insert/delete/update) operation performed
organization and its maintenance will
which decreases its performance.
be costlier.

This is an ideal method when you


Preferred for range retrieval of data- which
want to retrieve a particular record
means whenever there is retrieval data for a
Use for based on the search key. However, it
particular range, this method is an ideal
will only perform well when the hash
option.
function is on the search key.

There will be many unused data blocks


In static and dynamic hashing
because of the delete/update operation.
Memory methods, memory is always managed.
These data blocks can't be released for re-
management Bucket overflow is also handled
use. That's why regular maintenance of the
perfectly to extend static hashing.
memory is required.

You might also like