B+ Tree in DBMS
B+ Tree in DBMS
Consider the STUDENT table below. This can be stored in B+ tree structure
as shown below. We can observe here that it divides the records into two and
splits into left node and right node. Left node will have all the values less
than or equal to root node and the right node will have values greater than
root node. The intermediary nodes at level 2 will have only the pointers to
the leaf nodes. The values shown in the intermediary nodes are only the
pointers to next level. All the leaf nodes will have the actual records in a
sorted order.
If we have to search for any record, they are all found at leaf node. Hence
searching any record will take same time because of equidistance of the leaf
nodes. Also they are all sorted. Hence searching a record is like a sequential
search and does not take much time.
One should be able to traverse through the nodes very fast. That means, if
we have to search for any particular record, we should be able pass through
the intermediary node very easily. This is achieved by sorting the pointers at
intermediary nodes and the records in the leaf nodes.
Any record should be fetched very quickly. This is made by maintaining the
balance in the tree and keeping all the nodes at same distance.
Insertion in B+ tree
Suppose we have to insert a record 60 in below structure. It will go to 3 rd leaf
node after 55. Since it is a balanced tree and that leaf node is already full,
we cannot insert the record there. But it should be inserted there without
affecting the fill factor, balance and order. So the only option here is to split
the leaf node. But how do we split the nodes?
The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current
root node is 50. We will split the leaf node in the middle so that its balance is
not altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes. If
these two has to be leaf nodes, the intermediary node cannot branch from
50. It should have 60 added to it and then we can have pointers to new leaf
node.
This is how we insert a new entry when there is overflow. In normal scenario,
it is simple to find the node where it fits and place it in that leaf node.
Delete in B+ tree
Suppose we have to delete 60 from the above example. What will happen in
this case? We have to remove 60 from 4th leaf node as well as from the
intermediary node too. If we remove it from intermediary node, the tree will
not satisfy B+ tree rules. So we need to modify it have a balanced tree. After
deleting 60 from above B+ tree and re-arranging nodes, it will appear as
below.
Suppose each node takes 40bytes to store an index and each disk block is of
40Kbytes. That means we can have 100 nodes (n). Say we have 1million
search key values – that means we have 1 million intermediary pointers.
Then we can access log 50 (1000000) = 4 nodes are accessed in one go.
Hence this costs only 4milliseconds to fetch any node in the tree. Now we
can guess the advantage of extending the B+ tree into more intermediary
nodes. As intermediary nodes spread out more and more, it is more efficient
in fetching the records in B+ tree.
Searching, inserting and deleting a record is done in the same way we have
seen above. Since it is a balance tree, it searches for the position of the
records in the file, and then it fetches/inserts /deletes the records. In case it
finds that tree will be unbalanced because of insert/delete/update, it does
the proper re-arrangement of nodes so that definition of B+ tree is not
changed.
Below is the simple example of how student details are stored in B+ tree
index files.
Suppose we have a new student Bryan. Where will he fit in the file? He will fit
in the 1st leaf node. Since this leaf node is not full, we can easily add him in
the node.
But what happens if we want to insert another student Ben to this file? Some
re-arrangement to the nodes is needed to maintain the balance of the file.
Even a leaf node level will Leaf node stores (n-1)/2 to n-1
have 2l nodes. Hence total values
nodes in the B Tree
are 2 l+1 – 1.
Disadvantag If the tree is very big, then If there is any rearrangement of
es we have to traverse through nodes while insertion or deletion,
most of the nodes to get the then it would be an overhead. It
records. Only few records takes little effort, time and space.
can be fetched at the But this disadvantage can be
intermediary nodes or near ignored compared to the speed of
to the root. Hence this traversal
method might be slower.
Implementation of B tree is
little difficult compared to
B+ tree
B+ Tree indexing
This is the standard index in the database where primary key or the most
frequently used search key column in the table used to index. It has the
same feature as discussed above. Hence it is efficient in retrieving the data.
These indexes can be stored in different forms in a B+ tree. Depending on
the way they are organized, there are 4 types of B+ tree indexes.
Data bucket – Data buckets are the memory locations where the records are stored. These
buckets are also considered as Unit Of Storage.
Hash Function – Hash function is a mapping function that maps all the set of search keys to
actual record address. Generally, hash function uses primary key to generate the hash index
– address of the data block. Hash function can be simple mathematical function to any
complex mathematical function.
Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash index has
a depth value to signify how many bits are used for computing a hash function. These bits
can address 2n buckets. When all these bits are consumed ? then the depth value is
increased linearly and twice the buckets are allocated.
Static Hashing –
In static hashing, when a search-key value is provided, the hash function always computes the
same address. For example, if we want to generate address for STUDENT_ID = 76 using
mod (5) hash function, it always result in the same bucket address 4. There will not be any
changes to the bucket address here. Hence number of data buckets in the memory for this
static hashing remains constant throughout.
Operations –
Insertion – When a new record is inserted into the table, The hash function h generate a
bucket address for the new record based on its hash key K.
Bucket address = h(K)
Searching – When a record needs to be searched, The same hash function is used to retrieve
the bucket address for the record. For Example, if we want to retrieve whole record for ID
76, and if the hash function is mod (5) on that ID, the bucket address generated would be 4.
Then we will directly got to address 4 and retrieve the whole record for ID 104. Here ID acts
as a hash key.
Deletion – If we want to delete a record, Using the hash function we will first fetch the
record which is supposed to be deleted. Then we will remove the records for that address in
memory.
Updation – The data record that needs to be updated is first searched using hash function,
and then the data record is updated.
Now, If we want to insert some new records into the file But the data bucket address
generated by the hash function is not empty or the data already exists in that address. This
becomes a critical situation to handle. This situation in the static hashing is called bucket
overflow.
How will we insert data in this case?
There are several methods provided to overcome this situation. Some commonly used
methods are discussed below:
1. Open Hashing –
In Open hashing method, next available data block is used to enter the new record, instead
of overwriting older one. This method is also called linear probing.
For example, D3 is a new record which needs to be inserted , the hash function
generates address as 105. But it is already full. So the system searches next available
data bucket, 123 and assigns D3 to it.
2. Closed hashing –
In Closed hashing method, a new data bucket is allocated with same address and is linked it
after the full data bucket. This method is also known as overflow chaining.
For example, we have to insert a new record D3 into the tables. The static hash function
generates the data bucket address as 105. But this bucket is full to store the new data. In
this case is a new data bucket is added at the end of 105 data bucket and is linked to it. Then
new record D3 is inserted into the new bucket.
o Quadratic probing :
Quadratic probing is very much similar to open hashing or linear probing. Here, The
only difference between old and new bucket is linear. Quadratic function is used to
determine the new bucket address.
o Double Hashing :
Double Hashing is another method similar to linear probing. Here the difference is
fixed as in linear probing, but this fixed difference is calculated by using another
hash function. That’s why the name is double hashing.
Dynamic Hashing –
The drawback of static hashing is that that it does not expand or shrink dynamically as the
size of the database grows or shrinks. In Dynamic hashing, data buckets grows or shrinks
(added or removed dynamically) as the records increases or decreases. Dynamic hashing is
also known as extended hashing.
In dynamic hashing, the hash function is made to produce a large number of values. For
Example, there are three data records D1, D2 and D3 . The hash function generates three
addresses 1001, 0101 and 1010 respectively. This method of storing considers only part of
this address – especially only first one bit to store the data. So it tries to load three of them at
address 0 and 1.
But the problem is that No bucket address is remaining for D3. The bucket has to grow
dynamically to accommodate D3. So it changes the address have 2 bits rather than 1 bit, and
then it updates the existing data to have 2 bit address. Then it tries to accommodate D3.
Hashing Data Structure
Hashing is an important Data Structure which is designed to use a special function
called the Hash function which is used to map a given value with a particular key for
faster access of elements. The efficiency of mapping depends of the efficiency of the
hash function used.
Let a hash function H(x) maps the value at the index x%10 in an Array. For
example if the list of values is [11,12,13,14,15] it will be stored at positions {1,2,3,4,5}
in the array or Hash table respectively.
Suppose we want to design a system for storing employee records keyed using
phone numbers. And we want following queries to be performed efficiently:
1. Insert a phone number and corresponding information.
2. Search a phone number and fetch the information.
3. Delete a phone number and related information.
We can think of using the following data structures to maintain information about
different phone numbers.
1. Array of phone numbers and records.
2. Linked List of phone numbers and records.
3. Balanced binary search tree with phone numbers as keys.
4. Direct Access Table.
For arrays and linked lists, we need to search in a linear fashion, which can be
costly in practice. If we use arrays and keep the data sorted, then a phone number
can be searched in O(Logn) time using Binary Search, but insert and delete
operations become costly as we have to maintain sorted order.
With balanced binary search tree, we get moderate search, insert and delete
times. All of these operations can be guaranteed to be in O(Logn) time.
Another solution that one can think of is to use a direct access table where we
make a big array and use phone numbers as index in the array. An entry in array is
NIL if phone number is not present, else the array entry stores pointer to records
corresponding to phone number. Time complexity wise this solution is the best
among all, we can do all operations in O(1) time. For example to insert a phone
number, we create a record with details of given phone number, use phone number
as index and store the pointer to the created record in table.
This solution has many practical limitations. First problem with this solution is extra
space required is huge. For example if phone number is n digits, we need O(m * 10 n)
space for table where m is size of a pointer to record. Another problem is an integer
in a programming language may not store n digits.
Due to above limitations Direct Access Table cannot always be used. Hashing is the
solution that can be used in almost all such situations and performs extremely well
compared to above data structures like Array, Linked List, Balanced BST in practice.
With hashing we get O(1) search time on average (under reasonable assumptions)
and O(n) in worst case.
Hashing is an improvement over Direct Access Table. The idea is to use hash
function that converts a given phone number or any other key to a smaller number
and uses the small number as index in a table called hash table.
Hash Function: A function that converts a given big phone number to a small
practical integer value. The mapped integer value is used as an index in hash table.
In simple terms, a hash function maps a big number or string to a small integer that
can be used as index in hash table.
A good hash function should have following properties
1) Efficiently computable.
2) Should uniformly distribute the keys (Each table position equally likely for each
key)
For example for phone numbers a bad hash function is to take first three digits. A
better function is consider last three digits. Please note that this may not be the best
hash function. There may be better ways.
Hash Table: An array that stores pointers to records corresponding to a given phone
number. An entry in hash table is NIL if no existing phone number has hash function
value equal to the index for the entry.
Collision Handling: Since a hash function gets us a small number for a big key,
there is possibility that two keys result in same value. The situation where a newly
inserted key maps to an already occupied slot in hash table is called collision and
must be handled using some collision handling technique. Following are the ways to
handle collisions:
Chaining:The idea is to make each cell of hash table point to a linked list of
records that have same hash function value. Chaining is simple, but requires
additional memory outside the table.
Open Addressing: In open addressing, all elements are stored in the hash
table itself. Each table entry contains either a record or NIL. When searching for
an element, we one by one examine table slots until the desired element is
found or it is clear that the element is not in the table.
What is Hashing in DBMS?
In DBMS, hashing is a technique to directly search the location of desired data on the disk
without using index structure. Data is stored in the form of data blocks whose address is
generated by applying a hash function in the memory location where these records are stored
known as a data block or data bucket.
For a huge database structure, it's tough to search all the index values through all its level
and then you need to reach the destination data block to get the desired data.
Hashing method is used to index and retrieve items in a database as it is faster to search that
specific item using the shorter hashed key instead of using its original value.
Hashing is an ideal method to calculate the direct location of a data record on the disk
without using index structure.
It is also a helpful technique for implementing dictionaries.
Data bucket – Data buckets are memory locations where the records are stored. It is also
known as Unit Of Storage.
Key: A DBMS key is an attribute or set of an attribute which helps you to identify a
row(tuple) in a relation(table). This allows you to find the relationship between two tables.
Hash function: A hash function, is a mapping function which maps all the set of search keys
to the address where actual records are placed.
Linear Probing – Linear probing is a fixed interval between probes. In this method, the next
available data block is used to enter the new record, instead of overwriting on the older
record.
Quadratic probing- It helps you to determine the new bucket address. It helps you to add
Interval between probes by adding the consecutive output of quadratic polynomial to
starting value given by the original computation.
Hash index – It is an address of the data block. A hash function could be a simple
mathematical function to even a complex mathematical function.
Double Hashing –Double hashing is a computer programming method used in hash tables to
resolve the issues of has a collision.
Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for
any static has to function.
1. Static Hashing
2. Dynamic Hashing
Static Hashing
In the static hashing, the resultant data bucket address will always remain the same.
Therefore, if you generate an address for say Student_ID = 10 using hashing function
mod(3), the resultant bucket address will always be 1. So, you will not see any change in the
bucket address.
Therefore, in this static hashing method, the number of data buckets in memory always
remains constant.
Inserting a record: When a new record requires to be inserted into the table, you can
generate an address for the new record using its hash key. When the address is
generated, the record is automatically stored in that location.
Searching: When you need to retrieve the record, the same hash function should be
helpful to retrieve the address of the bucket where data should be stored.
Delete a record: Using the hash function, you can first fetch the record which is you
wants to delete. Then you can remove the records for that address in memory.
1. Open hashing
2. Close hashing.
Open Hashing
In Open hashing method, Instead of overwriting older one the next available data block is
used to enter the new record, This method is also known as linear probing.
For example, A2 is a new record which you wants to insert. The hash function generates
address as 222. But it is already occupied by some other value. That's why the system looks
for the next data bucket 501 and assigns A2 to it.
lose Hashing
In the close hashing method, when buckets are full, a new bucket is allocated for the same
hash and result are linked after the previous one.
Dynamic Hashing
Dynamic hashing offers a mechanism in which data buckets are added and removed
dynamically and on demand. In this hashing, the hash function helps you to create a large
number of values.