0% found this document useful (0 votes)
26 views

Unit 7

Uploaded by

Ghanashyam Bk
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Unit 7

Uploaded by

Ghanashyam Bk
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit-7

Storage
Management and
Indexing
COMPILED BY:
GHANASHYAM
BK
File Organization
The logical relationships among the many records that make up the file, particularly in terms of
means of identification and access to any given record, are referred to as file organization.
A logical relationship between distinct records is referred to as file organization.
This method specifies how disc blocks are mapped to file records.
The word “file organization” refers to the method by which records are organized into blocks
and then placed on a storage media.
Simply put, file organization is the process of storing files in a specific order.
The first method of mapping a database to a file is to employ many files, each containing only
one fixed-length entry.
Another option is to arrange our files so that we can store data of various lengths. Fixed-length
record files are easier to implement than variable-length record files.
File Organization
In file organization, there are two possible ways of representing the records:
fixed length records
Fixed-length records means setting a length and storing the records into the file.
If the record size exceeds the fixed size, it gets divided into more than one block.
Due to the fixed size there occurs following two problems:
Partially storing subparts of the record in more than one block requires access to all the
blocks containing the subparts to read or write in it.
It is difficult to delete a record in such a file organization. It is because if the size of the
existing record is smaller than the block size, then another record or a part fills up the block.
File Organization
fixed length records (Contd…)
However, including a certain number of bytes is the solution to the above problems. It is known
as File Header.
The allocated file header carries a variety of information about the file, such as the address of
the first record.
The address of the second record gets stored in the first record and so on.
This process is similar to pointers.
The method of insertion and deletion is easy in fixed-length records because the space left or
freed by the deleted record is exactly similar to the space required to insert the new records.
But this process fails for storing the records of variable lengths.
File Organization
variable length records
Variable-length records are the records that vary in size.
It requires the creation of multiple blocks of multiple sizes to store them.
These variable-length records are kept in the following ways in the database system:
Storage of multiple record types in a file.
 It is kept as Record types that enable variable lengths either for one field or more.
In variable-length records, there exist the following two problems:
Defining the way of representing a single record so as to extract the individual attributes
easily.
Defining the way of storing variable-length records within a block so as to extract that record
in a block easily.
File Organization
variable length records (contd…)
Thus, the representation of a variable-length record can be divided into two parts:
An initial part of the record with fixed-length attributes such as numeric values, dates, fixed-
length character attributes for storing their value.
The data for variable-length attributes such as varchar type is represented in the initial part of
the record by (offset, length) pair.
The offset refers to the place where that record begins, and length refers to the length of the
variable-size attribute.
Thus, the initial part stores fixed-size information about each attribute, i.e., whether it is the
fixed-length or variable-length attribute.
Objectives of File Organization
It has an ideal record selection, which means records can be selected as quickly as feasible.
Insert, delete, and update transactions on records should be simple and rapid.
Duplicate records cannot be created by inserting, updating, or deleting records.
Records should be stored efficiently to save money on storage.
Organization of Records in File-
(Heap)
It is the simplest and most basic type of organization.
It works with data blocks.
In heap file organization, the records are inserted at the file's end.
When the records are inserted, it doesn't require the sorting and ordering of records.
When the data block is full, the new record is stored in some other block.
This new data block need not to be the very next data block, but it can select any data block in
the memory to store new records.
The heap file is also known as an unordered file.
In the file, every record has a unique id, and every page in a file is of the same size. It is the
DBMS responsibility to store and manage the new records.
Organization of Records in File-
(Heap)
Insertion of a new record
Suppose we have five records R1, R3, R6, R4
and R5 in a heap and
suppose we want to insert a new record R2 in a
heap.
If the data block 3 is full then it will be inserted
in any of the database selected by the DBMS,
let's say data block 1.
Organization of Records in File-
(Heap)
If we want to search, update or delete the data in
heap file organization, then we need to traverse the
data from staring of the file till we get the requested
record.
If the database is very large then searching, updating
or deleting of record will be time-consuming because
there is no sorting or ordering of records. In the heap
file organization, we need to check all the data until we
get the requested record.
Pros and Cons of Heap file
organization
Pros:
It is a very good method of file organization for bulk insertion. If there is a large number of data
which needs to load into the database at a time, then this method is best suited.
In case of a small database, fetching and retrieving of records is faster than the sequential
record.
Cons:
This method is inefficient for the large database because it takes time to search or modify the
record.
The problem of unused memory blocks.
Organization of Records in File-
(Sequential)
This method is the easiest method for file organization. In this method, files are stored
sequentially.
This method can be implemented in two ways:
Pile File Method
Sorted File Method
Pile File Method
It is a quite simple method. In this method, we store the record in a sequence, i.e., one after
another.
Here, the record will be inserted in the order in which they are inserted into tables.
In case of updating or deleting of any record, the record will be searched in the memory
blocks.
When it is found, then it will be marked for deleting, and the new record is inserted.
Pile File Method
Insertion of the new record:
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence.
Hence, records are nothing but a row in the table.
Suppose we want to insert a new record R2 in the sequence, then it will be placed at the end of
the file.
Here, records are nothing but a row in any table.
Sorted File Method
In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order.
Sorting of records is based on any primary key or any other key.
In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.
Sorted File Method
Insertion of the new record:
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and
R7.
Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end
of the file, and then it will sort the sequence.
Pros and Cons of sequential file
organization
Pros:
It contains a fast and efficient method for the huge amount of data.
In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
It is simple in design. It requires no much effort to store the data.
This method is used when most of the records have to be accessed like grade calculation of a
student, generating the salary slip, etc.
This method is used for report generation or statistical calculations.
Cons:
It will waste time as we cannot jump on a particular record that is required but we have to
move sequentially which takes our time.
Sorted file method takes more time and space for sorting the records.
Organization of Records in File-
(Indexed sequential)
Indexed sequential access method (ISAM) is an
advanced sequential file organization.
In this method, records are stored in the file using
the primary key.
An index value is generated for each primary key and
mapped with the record.
This index contains the address of the record in the
file.
If any record has to be retrieved based on its index
value, then the address of the data block is fetched
and the record is retrieved from the memory.
Pros and Cons of ISAM
Pros:
In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the index is based
on the primary key values, we can retrieve the data for the given range of value. In the same
way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can be
easily searched.
Cons:
This method requires extra space in the disk to store the index value.
When the new records are inserted, then these files have to be reconstructed to maintain the
sequence.
When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
Indexing
Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
The index is a type of data structure. It is used to locate and access the data in a database table
quickly.
Index structure:
Indexes can be created using some database columns.
The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
The second column of the database is the data reference. It contains a set of pointers holding
the address of the disk block where the value of the particular key can be found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster.
The indices which are sorted are known as ordered indices.
Example:
Suppose we have an employee table with thousands of record and each of which is 10 bytes
long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
In the case of an index, we will search using indexes and the DBMS will read the record after
reading 542*2= 1084 bytes which are very less compared to the previous case.
Primary Index
If the index is created on the basis of the primary key of the table, then it is known as primary
indexing.
These primary keys are unique to each record and contain 1:1 relation between the records.
As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
The primary index can be classified into two types:
Dense index and
Sparse index.
Primary Index- (Dense)
The dense index contains an index record for every search key value in the data file. It makes
searching faster.
In this, the number of records in the index table is same as the number of records in the main
table.
It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.
Primary Index- (Sparse)
In the data file, index record appears only for a few items. Each item points to a block.
In this, instead of pointing to each record in the main table, the index points to the records in
the main table in a gap.
Clustering Index
A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index.
The records which have similar characteristics are grouped, and indexes are created for these
group.
Example:
suppose a company contains several employees in each department.
Suppose we use a clustering index, where all employees which belong to the same Dept_ID are
considered within a single cluster, and index pointers point to the cluster as a whole.
Here Dept_Id is a non-unique key.
Clustering Index

The first schema is little confusing because one disk block is shared by records which belong to the different
cluster. If we use separate disk block for separate clusters, then it is called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows.
These mappings are usually kept in the primary memory so that address fetch should be faster.
Then the secondary memory searches the actual data based on the address got from mapping.
 If the mapping size grows then fetching the address itself becomes slower.
In this case, the sparse index will not be efficient. To overcome this problem, secondary indexing
is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small.
Then each range is further divided into smaller ranges. The mapping of the first level is stored in
the primary memory, so that address fetch is faster.
The mapping of the second level and actual data are stored in the secondary memory (hard disk).
Secondary Index- (Example)
If you want to find the record of roll 111 in the diagram, then
it will search the highest entry which is smaller than or equal
to 111 in the first level index. It will get 100 at this level.
Then in the second index level, again it does max (111) <= 111
and gets 110. Now using the address 110, it goes to the data
block and starts searching each record till it gets 111.
This is how a search is performed in this method. Inserting,
updating or deleting is also done in the same manner.
B+ Tree Index Files
The B+ tree is a balanced binary search tree. It follows a multi-level index format.
In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
It contains an internal node and leaf node.
B+ Tree Index Files
Internal node
An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
At most, an internal node of the tree contains n pointers.
Leaf node
The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
At most, a leaf node contains n record pointer and n key values.
Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Searching a record in B+ Tree
Suppose we have to search 55 in the below B+ tree structure.
First, we will fetch for the intermediary node which will direct to the leaf node that can contain
a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes.
Then at the end, we will be redirected to the third leaf node. Here DBMS will perform a
sequential search to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure.
It will go to the 3rd leaf node after 55. It is a balanced tree, and a leaf node of this tree is
already full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
B+ Tree Insertion
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50.
We will split the leaf node of the tree in the middle so that its balance is not altered. So we can
group (50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy
to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example.
In this case, we have to remove 60 from the intermediate node as well as from the 4th leaf
node too.
If we remove it from the intermediate node, then the tree will not satisfy the rule of the B+
tree. So we need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
Hash Indices
Hashing technique is used to calculate the direct location of a data record on the disk without
using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function.
The memory location where these records are stored is known as data bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address.
Most of the time, the hash function uses the primary key to generate the address of the data
block.
A hash function is a simple mathematical function to any complex mathematical function.
We can even consider the primary key itself as the address of the data block. That means each
row whose address will be the same as a primary key stored in the data block.
Hash Indices
The diagram shows data block addresses same as
primary key value.
This hash function can also be a simple mathematical
function like exponential, mod, cos, sin, etc.
Suppose we have mod (5) hash function to determine
the address of the data block.
In this case, it applies mod (5) hash function on the
primary keys and generates 3, 3, 1, 4 and 2 respectively,
and records are stored in those data block addresses.
Hash Indices
Types of Hashing
Static Hashing
In static hashing, the resultant data bucket address will
always be the same.
That means if we generate an address for EMP_ID =103
using the hash function mod (5) then it will always result in
same bucket address 3.
Here, there will be no change in the bucket address.
Hence in this static hashing, the number of data buckets in
memory remains constant throughout.
In this example, we will have five data buckets in the
memory used to store the data.
Operations of Static Hashing
Searching a record
When a record needs to be searched, then the same hash function retrieves the address of
the bucket where the data is stored.
Insert a Record
When a new record is inserted into the table, then we will generate an address for a new
record based on the hash key and record is stored in that location.
Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we
will delete the records for that address in memory.
Update a Record
To update a record, we will first search it using a hash function, and then the data record is
updated.
Static Hashing
If we want to insert some new record into the file but the address of a data
bucket generated by the hash function is not empty, or data already exists in that
address. This situation in the static hashing is known as bucket overflow. This is a
critical situation in this method.
To overcome this situation, there are various methods. Some commonly used
methods are as follows:
Open Hashing
When a hash function generates an address at which data is already stored,
then the next bucket will be allocated to it. This mechanism is called as Linear
Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already
full. So the system searches next available data bucket, 113 and assigns R3 to it.
Static Hashing
Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is
linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this case,
a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
The dynamic hashing method is used to overcome the problems of static hashing like bucket
overflow.
In this method, data buckets grow or shrink as the records increases or decreases. This method
is also known as Extendable hashing method.
This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in
poor performance.
Advantages of dynamic hashing
In this method, the performance does not decrease as the data grows in the system. It simply
increases the size of memory to accommodate the data.
In this method, memory is well utilized as it grows and shrinks with the data. There will not be
any unused memory lying.
This method is good for the dynamic database where data grows and shrinks frequently.
Disadvantages of dynamic
hashing
In this method, if the data size increases then the bucket size is also increased. These addresses
of data will be maintained in the bucket address table. This is because the data address will keep
changing as buckets grow and shrink. If there is a huge increase in data, maintaining the bucket
address table becomes tedious.
In this case, the bucket overflow situation will also occur. But it might take little time to reach
this situation than static hashing.

You might also like