Dbms - Unit 5 Notes
Dbms - Unit 5 Notes
Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary Indexes,
Index data Structures, Hash Based Indexing, Tree base Indexing, Comparison of File Organizations,
Indexes and Performance Tuning, Intuitions for tree Indexes, Indexed Sequential Access Methods (ISAM),
B+ Trees: A Dynamic Index Structure
_______________________________________________________________________________________________
● The disk space manager is responsible for keeping track of available disk space.
● The file manager, which provides the abstraction of a file of records to higher levels of
DBMS code, issues requests to the disk space manager to obtain and relinquish space on
disk.
Storage Manager Component :
A Storage Manager is a component or program module that provides the interface between the low-level
data stored in the database and the application programs/queries submitted to
the system. The Storage Manager Components include –
1. File Manager- File manager manages the file space and it takes care of the structure of
the file. It manages the allocation space on disk storage and the data structures used to
represent info stored on other media.
2. Buffer Manager – It transfers blocks between disk (or other devices) and Main
Memory. A DMA (Direct Memory Access) is a form of Input/Output that controls the
exchange of blocks process. When a processor receives a request for a transfer of a
block, it sends it to the DMA Controller which transfers the block uninterrupted.
3. Authorization and Integrity Manager – This Component of storage manager checks
for the authority of the users to access and modify information, as well as integrity
constraints (keys, etc).
4. Disk Manager- The block requested by the file manager is transferred by the
Disk Manager.
Memory Hierarchy:
At the top, we have primary storage, which consists of cache and main memory , and provides very fast access
to data. then comes secondary storage, which consists of slower devices such as magnetic disks. tertiary storage
is the slowest class of storage devices; for example, optical disks and tapes.
Primary Storage:
1. all the primary storage level, the memory hierarchy includes at the most expensive end cache
memory, which is a static RAM (Random Access Memory ) cache memory is mainly used by the
CPU to speedup execution programs.
2. the next level of primary storage is DRAM (Dynamic Random Access Memory ), which
provides the main work area for the CPU for keeping programs and data , which is popularly
called as main memory .
3. the advantages of DRAM is its low cost, which continuous to decrease ; the drawback
is its volatility and lower speed compared with static RAM.
Secondary Storage:
At the secondary storage level, the hierarchy includes magnetic disks, as well storage in the form of CD -
ROM (Compact Disk - Read Only Memory ) devices.
Secondary storage devices are used to store data for future use or as backup. Secondary storage includes
memory devices that are not a part of the CPU chipset or motherboard, for example, magnetic disks,
optical disks (DVD, CD, etc.), hard disks, flash drives, and magnetic tapes.
Tertiary storage:
At the tertiary storage level, the hierarchy includes optical disks and tapes as the least expensive end.
The storage capacity anywhere in the hierarchy is measured in kilobytes (k bytes or bytes),
megabytes (M bytes or 1 million bytes), gigabytes (G byte or billion bytes), and even terabytes (1000 G
bytes).
Explanation:
DRAM:
programs reside execute in DRAM . Generally, large permanent database reside on secondary storage, and
portions of the database are read into and written from buffers is main memory as needed. personal computers
and work stations have tens of megabytes of data in DRAM. it is become possible to load a large fraction of the
database into main memory. an example is telephone switching applications, which store databases that contain
routing and line information in main memory.
Flash Memory:
1. Between DRAM and magnetic disk storage, another form of memory resides, flash
memory, which is becoming common, particularly because it is non - volatile.
2. flash memories are high density, high - performance memories using EEPROM
(Electrically Erasable programmable Read -only Memory) technology.
3. the advantage of flash memory is the fast access speed;
4. the disadvantage is that an entire block must be erased and written over at a time.
Magnetic disk storage:
● primary medium for long-term storage.
● Typically the entire database is stored on disk.
● Data must be moved from disk to main memory in order for the data to be operated on.
● After operations are performed, data must be copied back to disk if any changes were
made.
● Disk storage is called direct access storage as it is possible to read data on the disk in
any order (unlike sequential access).
● Disk storage usually survives power failures and system crashes.
Access time: the time it takes from when a read or write request is issued to when data transfer begins.
Data-transfer rate– the rate at which data can be retrieved from or stored to the disk.
Mean time to failure (MTTF)– the average time the disk is expected to run
continuously without any failure.
CD-ROM:
CD - ROM disks store data optically and are read by a laser. CD - ROM s contain pre - recorded data that
cannot be overwritten. WORM (Write - Once - Read - Many disks) are a form of optical storage used for
archiving data; they allow data to be written once and read any number of times without the possibility of
erasing. the DVD (Digital Video Disks) is a recent standard for optical disks allowing fourteen to fifteen
gigabytes of storage per disks.
Tapes:
1. Tapes are relatively not expensive and can store very large amount of data. when we
maintain data for a long period but do expect to access it very often.
Storing the files in certain order is called file organization. The main objective of file organization is
In sequential file organization, records are placed in the file in some sequential order based on the unique key
field or search key.
The easiest method for file Organization is Sequential method. In this method the the file are stored one after
another in a sequential manner. There are two ways to implement this method:
2. Sorted File
1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
Insertion of new record –
Let the R1, R3 and so on upto R5 and R4 be four records in the sequence. Here, records are
nothing but a row in any table. Suppose a new record R2 has to be inserted in the sequence, then
it is simply placed at the end of the file.
2. Sorted File Method –In this method, As the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.
Cons –
● Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
● Sorted file method is inefficient as it takes time and space for sorting records.
When a file is created using Heap File Organization, the Operating System allocates memory area to that
file without any further accounting details. File records can be placed anywhere in that memory area.
Heap File Organization works with data blocks. In this method records are inserted at the end of the file,
into the data blocks. No Sorting or Ordering is required in this method. If a data block is full, the new
record is stored in some other block, Here the other data block need not be the very next data block, but it
can be any block in the memory. It is the responsibility of DBMS to store and manage the new records.
● Fetching and retrieving records is faster than sequential record but only in case of small
databases.
● When there is a huge number of data needs to be loaded into the database at a time,
then this method of file Organization is best suited.
Cons –
In this method of file organization, hash function is used to calculate the address of the block to store the
records.
The hash function is applied on some columns/attributes – either key or non-key columns to get the
block address.
Hence each record is stored randomly irrespective of the order they come. Hence this method is also
known as Direct or Random file organization.
If the hash function is generated on key column, then that column is called hash key, and if hash function
is generated on non-key column, then the column is hash column.
When a record has to be retrieved, based on the hash key column, the address is generated and directly from
that address whole record is retrieved. Here no effort to traverse through whole file. Similarly when a new
record has to be inserted, the address is generated by hash key and record is directly inserted. Same is the
case with update and delete.
● Records need not be sorted after any of the transaction. Hence the effort of sorting is
reduced in this method.
● Since block address is known by hash function, accessing any record is very faster.
Similarly updating or deleting a record is also very quick.
● This method can handle multiple transactions as each record is independent of other.
i.e.; since there is no dependency on storage location for each record, multiple records
can be accessed at the same time.
● It is suitable for online transaction systems like online banking, ticket booking system
etc.
In this method two or more table which are frequently used to join and get the results are stored in the same file
called clusters. These files will have two or more tables in the same data block and the key columns which map
these tables are stored only once. This method hence reduces the cost of searching for various records in
different files. All the records are found at one place and hence making search efficient.
Comparison of file organizations:
The operations to be considered for comparisons of file organizations are below:
Indexing in DBMS
o The first column of the database is the search key that contains a copy of the primary
key or candidate key of the table. The values of the primary key are stored in sorted
order so that the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be
found.
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3 and so on and we have to search student with
ID-543.
o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the previous
case.
Indexing Methods:
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in
the main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get
the unique value and create index out of them. This method is called a clustering
index.
o The records which have similar characteristics are grouped, and indexes are created
for these group.
The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is
called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now
using the address 110, it goes to the data block and starts searching each record till it
gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is
also done in the same manner.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf
nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can
support random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is
of the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting the
fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55)
and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from
the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it
to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
HASHING
In a huge database structure, it is very inefficient to search all the index values and reach
the desired data. Hashing technique is used to calculate the direct location of a data record
on the disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data bucket
or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of the
time, the hash function uses the primary key to generate the address of the data block. A hash
function is a simple mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means each row whose
address will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we
have mod (5) hash function to determine the address of the data block. In this case, it applies
mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and
records are stored in those data block addresses.
The potential large size of the index file motivates the ISAM idea. Building an auxiliary
file on the index file and so on recursively until the final auxiliary file fits on one page?
This repeated construction of a one-level index leads to a tree structure that is illustrated in
Figure The data entries of the ISAM index are in the leaf pages of the tree and additional
overflow pages that are chained to some leaf page. In addition, some systems carefully
organize the layout of pages so that page boundaries correspond closely to the physical
characteristics of the underlying storage device. The ISAM structure is completely static
and facilitates such low-level optimizations.
1) Hash Based Indexing: This type of indexing is used to find records quickly, by
providing a search key value.
In this, a group of file records stored in pages based on bucket method. The first
bucket contains a primary page and along with other pages is chained together. In
order to determine the bucket for a record, a special function is called a hash
function along with a search key is used. By providing a bucket number, we can
obtain the primary page in one or more disk I/O operations.
Records Insertion into the Bucket: The records are inserted into the bucket by
assigning (allocating) the need “overflow” pages.
Record Searching: a hash function is used to find first, the bucket containing the
records and then by scanning all the pages in a bucket, the record with a given
search key can be found.
Suppose, if the record doesn’t have search key value then all the pages in the file
needs to be scanned.
Record Retrieval: By applying a hash function to the record’s search key, the page
containing the needed record can be identified and retrieved in one disk I/O.
Consider a file student with a hash key rno. Applying the hash function to the rno,
represents the page that contains the needed record. The hash function ‘h’ uses
the last two digits of the binary value of the rno as the bucket identifier. A
search key index of marks obtained i.e., marks contains <mrks, rid> pairs as data
entries in an auxiliary index file which is shown in the fig. The rid (record id)
points to the record whose search key value is mrks.
2) In order to find the students whose roll numbers lies b/w ‘19’ and ‘24’, the
direction of the search is shown in the fig.
Suppose we want to find all the students roll numbers lying b/w 17 and 40, we
first direct the search to the node A’1 and after analyzing its contents, then
forwarded the search to B’1 followed by the leaf node L’11, which actually contains
the required data entry. The other leaf nodes L’12 and L’13 also contains the data
entries that fulfills our search criteria. For this, all the leaf pages must be
designed using double linked list.
To search for a data entry, we apply a hash function ‘h’ to identify the
bucket to which it belongs and then search this bucket.
To insert a data entry, we use the hash function to identify the correct
bucket and then put the data entry there. If there is no space for this data
entry, we allocate a new overflow bucket, put the data entry and add to the
overflow page.
To delete a data entry, we use the hash function to identify the correct
bucket, locate the data entry by searching the bucket and then remove it.
If this data entry is the last in an overflow page, the overflow page is
removed and added to a list of free pages.
Thus, the number of buckets in a static hashing file is known when the file
is created the pages can be stored as successive disk pages.
🡪The main problem with static hashing is that the number of buckets is fixed.
Dynamic Hashing: The dynamic hashing technique allow the hash function
to be modified dynamically to accommodate the growth or shrinkage of the
database, because most databases grow larger over time and static hashing
techniques presents serious problems to deal with them.
Thus, if we are using static hashing on such growing databases, we have three
options:
Sorted Files:
1) Cost of scanning: The cost of scanning sorted files is given by B(D + RC)
because all the pages need to be scanned in order to retrieve a record. i.e. cost of
scanning sorted files = Cost of scanning heap files.
2) Cost Insertion: The cost of insertion in sorted files is given by search cost + B(D + RC).
It includes, Finding correct position of the record + Adding of record + Fetching of
pages + rewriting the pages.
3) Cost of Deletion: The cost of
deletion in sorted files is given by
Search cost + B(D + RC).
It includes, Searching a record + removing record + rewriting the modified page.
Note: The record deletion is based on equality.
4) Cost of Searching with equality selection Criteria: This cost of sorted files is
equal to D log2 B = It is the time required for performing a binary search for a page that
contain the records.
If many records satisfy, then record is equal to, D log2 B + C log2 R + Cost of sequential
reading of all the records.
5) Cost of Searching with Range Selection: This cost is given as,
Cost of fetching the first matching record’s page + Cost of obtaining the set of qualifying
records.
If the range is small, then a single page contain all the matching records, else additional
pages needs to be fetched.
Clustered Files:
1) Cost of Scanning: The cost of scanning clustered files is same as the cost of
scanning sorted files except that it has more number of pages and this cost is given as
scanning B pages with time ‘D’ per page takes BD and scanning R records of B pages
with time C per record takes BRC. Therefore the total cost is, 1.5B(D + RC).
2) Cost of Insertion: The cost of insertion in clustered files is, Search + Write (D logF1.5B +
Clog2R) + D.
3) Cost of Deletion: It is same as the cost of insertion and includes,
the cost of searching for a record + removing of a record + rewriting the modified page.
i.e. D logF 1.5B + Clog2R + D
4) Equality Selection Search:
i) For a single Qualifying Record: The cost of finding a single qualifying
record in clustered files is the sum of the binary searches involved in finding the
first page in D logF 1.5B and finding the first matching record in C log2 R. i.e. D
logF1.5B + C log2 R.
ii) For several Qualifying Records: If more than one record satisfies the
selection criteria then they are assumed to be located consecutively.
Cost required to find record is equal to,
D logF 1.5B + C log2R + cost involved in sequential reading of all records.
5) Range Selection Search: This cost in an equality search under several matched records.
4) Equality Selection Search: The total cost in the search accounts to,
i) The page containing the qualifying entries is identified at the cost H.
ii) Retrieval of the page, assuming that it is the only page present in the bucket occurs at D.
iii) The cost of finding an entry after scanning half the records on the page is 4RC.
iv) Fetching a record from the file is D. The total cost is,
H + D + 4RC + D 🡺 ( H + 2D + 4RC)
In case of many matched records the cost is,
H + D + 4RC + one I/O for each record that qualifies.
5) Range Selection Search: The cost of this is B(D + RC).
4) Heap file with - Fast insertion, deletion and searching - Scanning and range searches are slow
Un-clustered
tree index.
5) Heap file with - Fast insertion, deletion and searching - Doesn’t support range searches.
Un-clustered
Hash index.
Dangling Pointer: Dangling pointer is a pointer that does not point to a valid
object of the appropriate type. Dangling pointers arise when an object is deleted or
de-allocated, without modifying the value of the pointer, so that the pointer still
points to the memory location of the de-allocated memory.
In abject-oriented database, dangling pointer occur if we move or delete a
record to which another record contains as pointer that pointer no longer points to
the desired record.
Physical identifiers encode the location of the object so the object can be found
directly. Physical OIDs have the following parts.
1) A volume or file identifier.
2) A page identifier within the volume or file.
3) An offset within the page.
Physical OIDs may have a unique identifier. This identifier is stored in the object also and
is used to detect reference via dangling pointers.
1) Equality:
🡪 An equality query for a composite search key is defined as a search key in which
each field is associated with a constant.
For Example data entries in a student file where rno = 15 and mrks = 90 can be
retrieved by using equality query.
Thus, Tree-based indexing supports both the selection criteria as well as inserts,
deletes and updates whereas only equality selection is supported by hash-based
indexing apart form insertion, deletion and updation.
Disadvantages:
The stored file pages are in accordance with the disk’s order hence sequential
retrieval of such pages is quicker which is not possible in tree-structured indexes.
https://www.cs.uct.ac.za/mit_notes/database/htmls/chp11.html