Data Structure - Unit 5 - B.Tech 3rd
Data Structure - Unit 5 - B.Tech 3rd
Graph
Definition
A graph G can be defined as an ordered set G(V, E) where V(G) represents the set of
vertices and E(G) represents the set of edges which are used to connect these vertices.
A Graph G(V, E) with 5 vertices (A, B, C, D, E) and six edges ((A,B), (B,C), (C,E), (E,D),
(D,B), (D,A)) is shown in the following figure.
Closed Path
A path will be called as closed path if the initial node is same as terminal node. A path
will be closed path if V0=VN.
Simple Path
If all the nodes of the graph are distinct with an exception V0=VN, then such path P is
called as closed simple path.
Cycle
A cycle can be defined as the path which has no repeated edges or vertices except the
first and last vertices.
Connected Graph
A connected graph is the one in which some path exists between every two vertices (u,
v) in V. There are no isolated nodes in connected graph.
Complete Graph
A complete graph is the one in which every node is connected with all other nodes. A
complete graph contain n(n-1)/2 edges where n is the number of nodes in the graph.
Weighted Graph
In a weighted graph, each edge is assigned with some data such as length or weight.
The weight of an edge e can be given as w(e) which must be a positive (+) value
indicating the cost of traversing the edge.
Digraph
A digraph is a directed graph in which each edge of the graph is associated with some
direction and the traversing can be done only in the specified direction.
Loop
An edge that is associated with the similar end points can be called as Loop.
Adjacent Nodes
If two nodes u and v are connected via an edge e, then the nodes u and v are called as
neighbours or adjacent nodes.
o Null Graph
o Trivial Graph
o Non-directed Graph
o Directed Graph
o Connected Graph
o Disconnected Graph
o Regular Graph
o Complete Graph
o Cycle Graph
o Cyclic Graph
o Acyclic Graph
o Finite Graph
o Infinite Graph
o Bipartite Graph
o Planar Graph
o Simple Graph
o Multi Graph
o Pseudo Graph
o Euler Graph
o Hamiltonian Graph
Null Graph
The Null Graph is also known as the order zero graph. The term "null graph"
refers to a graph with an empty edge set.
In other words, a null graph has no edges, and the null graph is present with only
isolated vertices in the graph.
The image displayed above is a null or zero graphs because it has zero edges
between the three vertices of the graph.
Trivial Graph
A graph is called a trivial graph if it has only one vertex present in it.
The trivial graph is the smallest possible graph that can be created with the least
number of vertices that is one vertex only.
The above is an example of a trivial graph having only a single vertex in the whole
graph named vertices A.
Non-Directed Graph
A graph is called a non-directed graph if all the edges present between any graph
nodes are non-directed.
By non-directed edges, we mean the edges of the graph that cannot be
determined from the node it is starting and at which node it is ending.
All the edges for a graph need to be non-directed to call it a non-directed graph.
All the edges of a non-directed graph don't have any direction.
The graph that is displayed above is an example of a disconnected graph.
This graph is called a disconnected graph because there are four vertices named
vertex A, vertex B, vertex C, and vertex D.
There are also exactly four edges between these vertices of the graph.
And all the vertices that are present between the different nodes of the graph are
not directed, which means the edges don't have any specific direction.
For example, the edge between vertex A and vertex B doesn't have any direction,
so we cannot determine whether the edge between vertex A and vertex B starts
from vertex A or vertex B.
Similarly, we can't determine the ending vertex of this edge between these nodes.
Directed Graph
Another name for the directed graphs is digraphs.
A graph is called a directed graph or digraph if all the edges present between any
vertices or nodes of the graph are directed or have a defined direction.
By directed edges, we mean the edges of the graph that have a direction to
determine from which node it is starting and at which node it is ending.
All the edges for a graph need to be directed to call it a directed graph or digraph.
All the edges of a directed graph or digraph have a direction that will start from
one vertex and end at another.
The graph that is displayed above is an example of a connected graph.
This graph is called a connected graph because there are four vertices in the
graph named vertex A, vertex B, vertex C, and vertex D.
There are also exactly four edges between these vertices of the graph and all the
vertices that are present between the different nodes of the graph are directed
(or pointing to some of the vertices) which means the edges have a specific
direction assigned to them.
For example, consider the edge that is present between vertex D and vertex A.
This edge shows that an arrowhead is pointing towards vertex A, which means
this edge starts from vertex D and ends at vertex A.
Connected Graph
For a graph to be labelled as a connected graph, there must be at least a single
path between every pair of the graph's vertices.
In other words, we can say that if we start from one vertex, we should be able to
move to any of the vertices that are present in that particular graph, which means
there exists at least one path between all the vertices of the graph.
The graph shown above is an example of a connected graph because we start
from any one of the vertices of the graph and start moving towards any other
remaining vertices of the graph.
There will exist at least one path for traversing the graph.
For example, if we begin from vertex B and traverse to vertex H, there are
various paths for traversing. One of the paths is
Vertice B ->vertice C ->vertice D ->vertice F ->vertice E ->vertice H.
Similarly, there are other paths for traversing the graph from vertex B to vertex H.
there is at least one path between all the graph nodes.
In other words, we can say that all the vertices or nodes of the graph are
connected to each other via edge or number of edges.
Disconnected Graph
A graph is said to be a disconnected graph where there does not exist any path
between at least one pair of vertices.
In other words, we can say that if we start from any one of the vertices of the
graph and try to move to the remaining present vertices of the graph and there
exists not even a single path to move to that vertex, then it is the case of the
disconnected graph.
If any one of such a pair of vertices doesn't have a path between them, it is called
a disconnected graph.
The graph shown above is a disconnected graph.
The above graph is called a disconnected graph because at least one pair of
vertices doesn't have a path to traverse starting from either node.
For example, a single path between both vertices doesn't exist if we want to
traverse from vertex A to vertex G.
In other words, we can say that all the vertices or nodes of the graph are not
connected to each other via edge or number of edges so that they can be
traversed.
Regular Graph
For a graph to be called a regular, it should satisfy one primary condition: all
graph vertices should have the same degree.
By the degree of vertices, we mean the number of nodes associated with a
particular vertex.
If all the graph nodes have the same degree value, then the graph is called
a regular graph.
If all the vertices of a graph have the degree value of 6, then the graph is called a
6-regular graph.
If all the vertices in a graph are of degree 'k', then it is called a "k-regular graph".
The graphs that are displayed above are regular graphs.
In graph 1, there are three vertices named vertex A, vertex B, and vertex C, All
the vertices in graph 1, have the degree of each node as 2.
The degree of each vertex is calculated by counting the number of edges
connected to that particular vertex.
For vertex A in graph 1, there are two edges associated with vertex A, one from
vertex B and another from vertex D. Thus, the degree of vertex A of graph one is
2.
Similarly, for other vertices of the graph, there are only two edges associated with
each vertex, vertex B and vertex D.
Therefore, vertex B and vertex D are 2. As the degree of all the three nodes of the
graph is the same, that is 2. Therefore, this graph is called a 2-regular graph.
Similarly, for the second graph shown above, there are four vertices named vertex
E, vertex F, vertex G, and vertex F.
The degree of all the four vertices of this graph is 2.
Each vertex of the graph is 2 because only two edges are associated with all of
the graph's vertices.
As all the nodes of this graph have the same degree of 2, this graph is called
a regular graph.
Complete Graph
A graph is said to be a complete graph if, for all the vertices of the graph, there
exists an edge between every pair of the vertices.
In other words, we can say that all the vertices are connected to the rest of all the
vertices of the graph.
There are two graphs name K3 and K4 shown in the above image, and both
graphs are complete graphs.
Graph K3 has three vertices, and each vertex has at least one edge with the rest
of the vertices. Similarly, for graph K4, there are four nodes named vertex E,
vertex F, vertex G, and vertex H.
For example, the vertex F has three edges connected to it to connect it to the
respective three remaining vertices of the graph.
Likewise, for the other three reaming vertices, there are three edges associated
with each one of them.
As all the vertices of this graph have a separate edge for other vertices, it is called
a complete graph.
Cycle Graph
If a graph with many vertices greater than three and edges form a cycle, then the
graph is called a cycle graph.
In a graph of cycle type, the degree of all the vertices of the cycle graph will be 2.
There are three graphs shown in the above image, and all of them are examples
of the cyclic graph because the number of nodes for all of these graphs is greater
than two and the degree of all the vertices of all these graphs is exactly 2.
Cyclic Graph
For a graph to be called a cyclic graph, it should consist of at least one cycle. If a
graph has a minimum of one cycle present, it is called a cyclic graph.
The graph shown in the image has two cycles present, satisfying the required
condition for a graph to be cyclic, thus making it a cyclic graph.
Acyclic Graph
A graph is called an acyclic graph if zero cycles are present, and an acyclic graph
is the complete opposite of a cyclic graph.
The graph shown in the above image is acyclic because it has zero cycles present
in it.
That means if we begin traversing the graph from vertex B, then a single path
doesn't exist that will traverse all the vertices and end at the same vertex that is
vertex B.
Finite Graph
If the number of vertices and the number of edges that are present in a graph are
finite in number, then that graph is called a finite graph.
The graph shown in the above image is the finite graph.
There are four vertices named vertex A, vertex B, vertex C, and vertex D, and the
number of edges present in this graph is also four, as both the number of nodes
and vertices of this graph is finite in number it is called a finite graph.
Infinite Graph
If the number of vertices in the graph and the number of edges in the graph are
infinite in number, that means the vertices and the edges of the graph cannot be
counted, then that graph is called an infinite graph.
As we can see in the above image, the number of vertices in the graph and the
number of edges in the graph are infinite, so this graph is called an infinite graph.
Bipartite Graph
For a graph to be a Bipartite graph, it needs to satisfy some of the basic
preconditions. These conditions are:
o All the vertices of the graph should be divided into two distinct sets of
vertices X and Y.
o All the vertices present in the set X should only be connected to the
vertices present in the set Y with some edges. That means the vertices
present in a set should not be connected to the vertex that is present in
the same set.
o Both the sets that are created should be distinct that means both
should not have the same vertices in them.
The graph shown in the above image is divided into two vertices named set X and
set Y. The contents of these sets are,
Set X = {vertex A, vertex B, vertex C, vertex D}
Set Y = {vertex P, vertex Q, vertex R}
The vertex A of the set X is associated with the vertex Q of the set Y. And the
vertex B is also connected to the vertex Q.
The vertex C of the set X is connected to the two vertices of the set Y named
vertex P and vertex R. The vertex D of the set X is associated with the vertex Q
of the set R.
Similarly, all the vertices present in the set Y are only connected to the vertices
from the set X. And both set X and set Y have non-repeating or distinct elements
present in them.
The graph shown in the above image satisfies all the conditions for the Bipartite
graph, and thus it is a Bipartite graph.
Planar Graph
A graph is called a planar graph if that graph can be drawn in a single plane with
any two of the edges intersecting each other.
In such a way that no edges cross each other.
The graph shown in the above image can be drawn in a single plane with any two
edges intersecting. Thus it is a planar graph.
Simple Graph
A graph is said to be a simple graph if the graph doesn't consist of no self-loops
and no parallel edges in the graph.
We have three vertices and three edges for the graph that is shown in the above
image. This graph has no self-loops and no parallel edges; therefore, it is called
a simple graph.
Multi Graph
A graph is said to be a multigraph if the graph doesn't consist of any self-loops,
but parallel edges are present in the graph.
If there is more than one edge present between two vertices, then that pair of
vertices is said to be having parallel edges.
We have three vertices and three edges for the graph that is shown in the above
image.
There are no self-loops, but two edges connect these two vertices between vertex
A and vertex E of the graph.
In other words, we can say that if two vertices of a graph are connected with more
than one edge in a graph, then it is said to be having parallel edges, thus making
it a multigraph.
Pseudo Graph
If a graph consists of no parallel edges, but self-loops are present in a graph, it is
called a pseudo graph.
The meaning of a self-loop is that there is an edge present in the graph that starts
from one of the graph's vertices, and if that edge ends on the same vertex, then it
is called a pseudo graph.
The graph shown in the above image has vertex A, vertex B and vertex E.
There are four edges in this graph, and there are three edges associated with
vertex A, and among these three edges, one of the edges is a self-loop.
And among these four edges present in there is no parallel edge in it. Since the
graph shown above has a self-loop and no parallel edge present in it, thus it is a
pseudo graph.
Euler Graph
If all the vertices present in a graph have an even degree, then the graph is
known as an Euler graph.
By degree of a vertex, we mean the number of edges that are associated with a
vertex.
So for a graph to be an Euler graph, it is required that all the vertices in the graph
should be associated with an even number of edges.
In the graph shown in the above image, we have five vertices named vertex A,
vertex B, vertex C, vertex D and vertex E.
All the vertices except vertex C have a degree of 2, which means they are
associated with two edges each of the vertex.
At the same time, vertex C is associated with four edges, thus making it degree 4.
The degree of vertex C and other vertices is 4 and 2, respectively, which are
even. Therefore, the graph displayed above is an Euler graph.
Hamilton Graph
Suppose a closed walk in the connected graph that visits every vertex of the
graph exactly once (except starting vertex) without repeating the edges.
Such a graph is called a Hamiltonian graph, and such a walk is called
a Hamiltonian path. The Hamiltonian circuit is also known as Hamiltonian Cycle.
In other words, A Hamiltonian path that starts and ends at the same vertex is
called a Hamiltonian circuit.
Every graph that contains a Hamiltonian circuit also contains a Hamiltonian path,
but vice versa is not true.
There may exist more than one Hamiltonian path and Hamiltonian circuit in a
graph.
The graph shown in the above image consists of a closed path ABCDEFA which
starts from vertex A and traverses all other vertices or nodes without traversing
any of the nodes twice other than vertex A in the path of traversal.
Therefore, the graph shown in the above image is a Hamilton graph.
Directed Graph
In a directed graph, edges form an ordered pair. Edges represent a specific path
from some vertex A to another vertex B.
Node A is called initial node while node B is called terminal node.
A directed graph is shown in the following figure.
Sequential Representations of Graph
In this article, we will discuss the ways to represent the graph. By Graph
representation, we simply mean the technique to be used to store some graph
into the computer's memory.
A graph is a data structure that consist a sets of vertices (called nodes) and
edges. There are two ways to store Graphs into the computer's memory:
Sequential representation
In sequential representation, there is a use of an adjacency matrix to represent
the mapping between vertices and edges of the graph.
We can use an adjacency matrix to represent the undirected graph, directed
graph, weighted directed graph, and weighted undirected graph.
If adj[i][j] = w, it means that there is an edge exists from vertex i to vertex j with
weight w.
An entry Aij in the adjacency matrix representation of an undirected graph G will
be 1 if an edge exists between Vi and Vj.
If an Undirected Graph G consists of n vertices, then the adjacency matrix for that
graph is n x n, and the matrix A = [aij] can be defined as -
aij = 1 {if there is a path exists from Vi to Vj}
aij = 0 {Otherwise}
It means that, in an adjacency matrix, 0 represents that there is no association
exists between the nodes, whereas 1 represents the existence of a path between
two edges.
If there is no self-loop present in the graph, it means that the diagonal entries of
the adjacency matrix will be 0.
Adjacency matrices
Adjacency matrix for an undirected graph
Now, let's see the adjacency matrix representation of an undirected graph.
In the above figure, an image shows the mapping among the vertices (A, B, C, D,
E), and this mapping is represented by using the adjacency matrix.
There exist different adjacency matrices for the directed and undirected graph. In
a directed graph, an entry Aij will be 1 only when there is an edge directed from
Vi to Vj.
We use Queue data structure with maximum size of total number of vertices
o Step 2 - Select any vertex as starting point for traversal. Visit that vertex
o Step 3 - Visit all the non-visited adjacent vertices of the vertex which is at
o Step 4 - When there is no new vertex to be visited from the vertex which is
o Step 6 - When queue becomes empty, then produce final spanning tree
Example
DFS (Depth First Search)
DFS traversal of a graph produces a spanning tree as final result. Spanning Tree is a
graph without loops. We use Stack data structure with maximum size of total number
Step 2 - Select any vertex as starting point for traversal. Visit that vertex and push it
on to the Stack.
Step 3 - Visit any one of the non-visited adjacent vertices of a vertex which is at the
Step 4 - Repeat step 3 until there is no new vertex to be visited from the vertex which
Step 5 - When there is no new vertex to visit then use back tracking and pop one
Step 7 - When stack becomes Empty, then produce final spanning tree by removing
Example
Connected Component
Definition
A connected component or simply component of an undirected graph is a subgraph in
which each pair of nodes is connected with each other via a path.
Let’s try to simplify it further, though. A set of nodes forms a connected component in an
undirected graph if any node from the set of nodes can reach any other node by
traversing edges. The main point here is reachability.
In connected components, all the nodes are always reachable from each other.
Let’s name this graphG1(V,E) . Here V = {V1,V2,V3,V4,V5,V6} denotes the vertex set
and E = {E1,E2,E3,E4,E5,E6,E7} denotes the edge set of G1. The graph G1 has one
connected component, let’s name it C1, which contains all the vertices of G1. Now let’s
check whether the set C1 holds to the definition or not.
According to the definition, the vertices in the set C1 should reach one another via a
path. We’re choosing two random vertices V1 and V6:
V6 is reachable to V1 via:
E4 E7 or E3 E5 E7 0r E1 E2 E6 E7
V1 is reachable to V6 via:
E7 E4 or E7 E5 E3 or E7 E6 E2 E1
The vertices V1 and V6 satisfied the definition, and we could do the same with other
vertex pairs in C1 as well.
More Than One Connected Component
In this example, the undirected graph has three connected components:
V6 is reachable to V4 via:
E2 E6 E7 or E1 E4 E7 or E1 E3 E5 E7
V4 is reachable to V6 via:
E7 E6 E2 or E7 E4 E1 or E7 E5 E3 E1
Now let’s pick the vertices V8 and V9 from the set C2.
V9 is reachable to V8: E9 E8
V8 is reachable to V9: E8 E9
Finally, let’s pick the vertices V11 and V12 from the set C3.
V11 is reachable to V12: E11 E10
V12 is reachable to V11: E10 E11
So from these simple demonstrations, it is clear that C1,C2 ,C3and follow the
connected component definition.
Spanning tree
Now, we will discuss the spanning tree and the minimum spanning tree. But before
moving directly towards the spanning tree, let's first see a brief description of the graph
and its types.
Graph
A graph can be defined as a group of vertices and edges to connect these vertices. The
types of graphs are given as follows -
o Undirected graph: An undirected graph is a graph in which all the edges do not
point to any particular direction, i.e., they are not unidirectional; they are
bidirectional. It can also be defined as a graph with a set of V vertices and a set
of E edges, each edge connecting two different vertices.
o Connected graph: A connected graph is a graph in which a path always exists
from a vertex to any other vertex. A graph is connected if we can reach any
vertex from any other vertex by following edges in either direction.
o Directed graph: Directed graphs are also known as digraphs. A graph is a
directed graph (or digraph) if all the edges present between any vertices or
nodes of the graph are directed or have a defined direction.
Now, let's move towards the topic spanning tree.
o Cluster Analysis
As discussed above, a spanning tree contains the same number of vertices as the
graph, the number of vertices in the above graph is 5; therefore, the spanning tree will
contain 5 vertices. The edges in the spanning tree will be equal to the number of
vertices in the graph minus 1. So, there will be 4 edges in the spanning tree.
Some of the possible spanning trees that will be created from the above graph are given
as follows -
Properties of spanning-tree
Some of the properties of the spanning tree are given as follows -
o A spanning tree is minimally connected, so removing one edge from the tree
will make the graph disconnected.
o A spanning tree is maximally acyclic, so adding one edge to the tree will create
a loop.
o There can be a maximum nn-2 number of spanning trees that can be created
from a complete graph.
o A spanning tree has n-1 edges, where 'n' is the number of nodes.
o If the graph is a complete graph, then the spanning tree can be constructed by
removing maximum (e-n+1) edges, where 'e' is the number of edges and 'n' is
the number of vertices.
So, a spanning tree is a subset of connected graph G, and there is no spanning tree of
a disconnected graph.
So, the minimum spanning tree that is selected from the above spanning trees for the
given weighted graph is -
o Prim's Algorithm
o Kruskal's Algorithm
Let's see a brief description of both of the algorithms listed above.
Prim's algorithm - It is a greedy algorithm that starts with an empty spanning tree. It is
used to find the minimum spanning tree from the graph. This algorithm finds the subset
of edges that includes every vertex of the graph such that the sum of the weights of the
edges can be minimized.
To learn more about the prim's algorithm, you can click the below link -
https://www.javatpoint.com/prim-algorithm
Kruskal's algorithm - This algorithm is also used to find the minimum spanning tree for
a connected weighted graph. Kruskal's algorithm also follows greedy approach, which
finds an optimum solution at every stage instead of focusing on a global optimum.
To learn more about the prim's algorithm, you can click the below link -
https://www.javatpoint.com/kruskal-algorithm
So, that's all about the article. Hope the article will be helpful and informative to you.
Here, we have discussed spanning tree and minimum spanning tree along with their
properties, examples, and applications.
File Structures:
Physical Storage Media
File Organization
The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of file
organization which was used for a given set of records.
File organization is used to describe the way in which the records are stored in
terms of blocks, and the blocks are placed on the storage medium.
The first approach to map the database to the file is to use the several files and
store only one fixed length record in any given file. An alternative approach is to
structure our files so that we can contain multiple lengths for records.
Files of fixed length records are easier to implement than the files of variable
length records.
Fixed-Length Records
Fixed-length records means setting a length and storing the records into the file. If the
record size exceeds the fixed size, it gets divided into more than one block. Due to the
fixed size there occurs following two problems:
1. Partially storing subparts of the record in more than one block requires access
to all the blocks containing the subparts to read or write in it.
2. It is difficult to delete a record in such a file organization. It is because if the size
of the existing record is smaller than the block size, then another record or a part
fills up the block.
However, including a certain number of bytes is the solution to the above problems. It is
known as File Header. The allocated file header carries a variety of information about
the file, such as the address of the first record. The address of the second record gets
stored in the first record and so on. This process is similar to pointers. The method of
insertion and deletion is easy in fixed-length records because the space left or freed by
the deleted record is exactly similar to the space required to insert the new records. But
this process fails for storing the records of variable lengths.
Variable-Length Records
Variable-length records are the records that vary in size. It requires the creation of
multiple blocks of multiple sizes to store them. These variable-length records are kept in
the following ways in the database system:
1. Storage of multiple record types in a file.
2. It is kept as Record types that enable repeating fields like multisets or arrays.
3. It is kept as Record types that enable variable lengths either for one field or
more.
In variable-length records, there exist the following two problems:
1. Defining the way of representing a single record so as to extract the individual
attributes easily.
2. Defining the way of storing variable-length records within a block so as to
extract that record in a block easily.
Thus, the representation of a variable-length record can be divided into two parts:
1. An initial part of the record with fixed-length attributes such as numeric values,
dates, fixed-length character attributes for storing their value.
2. The data for variable-length attributes such as varchar type is represented in the
initial part of the record by (offset, length) pair. The offset refers to the place
where that record begins, and length refers to the length of the variable-size
attribute. Thus, the initial part stores fixed-size information about each attribute,
i.e., whether it is the fixed-length or variable-length attribute.
o In the case of modification of any record, it will update the record and then sort
the file, and lastly, the updated record is placed in the right place.
o In this method, files can be easily stored in cheaper storage mechanism like
magnetic tapes.
o This method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
o Sorted file method takes more time and space for sorting the records.
Indexing is a data structure technique to efficiently retrieve records from the database
files based on some attributes on which the indexing has been done. Indexing in
database systems is similar to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following
types
Primary Index Primary index is defined on an ordered data file. The data file is
ordered on a key field. The key field is generally the primary key of the relation.
Secondary Index Secondary index may be generated from a field which is a
candidate key and has a unique value in every record, or a non-key with duplicate
values.
Clustering Index Clustering index is defined on an ordered data file. The data file
is ordered on a non-key field.
Ordered Indexing is of two types
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database.
This makes searching faster but requires more space to store index records itself. Index
records contain search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record
here contains a search key and an actual pointer to the data on the disk. To search a
record, we first proceed by index record and reach at the actual location of the data. If
the data we are looking for is not where we directly reach by following the index, then
the system starts sequential search until the desired data is found.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored
on the disk along with the actual database files. As the size of the database grows, so
does the size of the indices. There is an immense need to keep the index records in the
main memory so as to speed up the search operations. If single-level index is used,
then a large size index cannot be kept in memory which leads to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order
to make the outermost level so small that it can be saved in a single disk block, which
can easily be accommodated anywhere in the main memory.
Hashing
In a huge database structure, it is very inefficient to search all the index values and
reach the desired data. Hashing technique is used to calculate the direct location of a
data record on the disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using
the hashing function. The memory location where these records are stored is known as
data bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address.
Most of the time, the hash function uses the primary key to generate the address of the
data block. A hash function is a simple mathematical function to any complex
mathematical function. We can even consider the primary key itself as the address of
the data block. That means each row whose address will be the same as a primary key
stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash
function can also be a simple mathematical function like exponential, mod, cos, sin, etc.
Suppose we have mod (5) hash function to determine the address of the data block. In
this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4
and 2 respectively, and records are stored in those data block addresses.
Primary indices & Secondary indices
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known
as primary indexing. These primary keys are unique to each record and contain
1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse
index.
Secondary Index
o In the sparse indexing, as the size of the table grows, the size of mapping also
grows.
o These mappings are usually kept in the primary memory so that address fetch
should be faster.
o Then the secondary memory searches the actual data based on the address got
from mapping.
o If the mapping size grows then fetching the address itself becomes slower. In this
case, the sparse index will not be efficient.
o To overcome this problem, secondary indexing is introduced.
o In secondary indexing, to reduce the size of mapping, another level of indexing is
introduced.
o In this method, the huge range for the columns is selected initially so that the
mapping size of the first level becomes small.
o Then each range is further divided into smaller ranges.
o The mapping of the first level is stored in the primary memory, so that address
fetch is faster.
o The mapping of the second level and actual data are stored in the secondary
memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the
highest entry which is smaller than or equal to 111 in the first level index. It will
get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110.
Now using the address 110, it goes to the data block and starts searching each
record till it gets 111.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all
leaf nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can
support random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+
tree is of the order n where n is fixed for every B+ tree.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the
root node.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf
node.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf
node after 55. It is a balanced tree, and a leaf node of this tree is already full, so we
cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We
will split the leaf node of the tree in the middle so that its balance is not altered. So we
can group (50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is
very easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to
remove 60 from the intermediate node as well as from the 4th leaf node too. If we
remove it from the intermediate node, then the tree will not satisfy the rule of the B+
tree. So we need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
Indexing and Hashing Comparison
1. Indexing :
2. Hashing :
It calculates direct location of data record on disk without using index structure.
A good hash functions only uses one-way hashing algorithm and hash cannot be
converted back into original key.
In simple words, it is a process of converting given key into another value known
as hash value or simply hash.
Indexing Hashing
It offers faster search and retrieval of It is faster than searching arrays and lists,
data to users, helps to reduce table provides more flexible and reliable method of
space, makes it possible to quickly data retrieval rather than any other data
retrieve or fetch data, can be used for structure, can be used for comparing two
sorting, etc. files for quality, etc.
Its main purpose is to provide basis for Its main purpose is to use math problem to
both rapid random lookups and efficient organize data into easily searchable
access of ordered records. buckets.
Indexing Hashing
===========XXXXXXXXXXXXXXXX===========