0% found this document useful (0 votes)
4 views

Dbms - Unit 5 Notes

Unit 5 provides an overview of storage and indexing in databases, detailing the roles of storage managers, file organizations, and various indexing methods. It covers external storage types, including primary, secondary, and tertiary storage, as well as file organization techniques like sequential, heap, and hash file organizations. The document also explains indexing structures, including primary and clustered indices, and their impact on database performance and efficiency.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Dbms - Unit 5 Notes

Unit 5 provides an overview of storage and indexing in databases, detailing the roles of storage managers, file organizations, and various indexing methods. It covers external storage types, including primary, secondary, and tertiary storage, as well as file organization techniques like sequential, heap, and hash file organizations. The document also explains indexing structures, including primary and clustered indices, and their impact on database performance and efficiency.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT 5

OVERVIEW OF STORAGE AND


INDEXING

Data on External Storage, File Organization and Indexing, Cluster Indexes, Primary and Secondary Indexes,
Index data Structures, Hash Based Indexing, Tree base Indexing, Comparison of File Organizations,
Indexes and Performance Tuning, Intuitions for tree Indexes, Indexed Sequential Access Methods (ISAM),
B+ Trees: A Dynamic Index Structure
_______________________________________________________________________________________________

1. Data on external storage:

● Data in a DBMS is stored on storage devices such as disks and tapes

● The disk space manager is responsible for keeping track of available disk space.
● The file manager, which provides the abstraction of a file of records to higher levels of
DBMS code, issues requests to the disk space manager to obtain and relinquish space on
disk.
Storage Manager Component :

A Storage Manager is a component or program module that provides the interface between the low-level
data stored in the database and the application programs/queries submitted to
the system. The Storage Manager Components include –

1. File Manager- File manager manages the file space and it takes care of the structure of
the file. It manages the allocation space on disk storage and the data structures used to
represent info stored on other media.
2. Buffer Manager – It transfers blocks between disk (or other devices) and Main
Memory. A DMA (Direct Memory Access) is a form of Input/Output that controls the
exchange of blocks process. When a processor receives a request for a transfer of a
block, it sends it to the DMA Controller which transfers the block uninterrupted.
3. Authorization and Integrity Manager – This Component of storage manager checks
for the authority of the users to access and modify information, as well as integrity
constraints (keys, etc).
4. Disk Manager- The block requested by the file manager is transferred by the
Disk Manager.

Memory Hierarchy:

Figure: Memory Hierarchy

At the top, we have primary storage, which consists of cache and main memory , and provides very fast access
to data. then comes secondary storage, which consists of slower devices such as magnetic disks. tertiary storage
is the slowest class of storage devices; for example, optical disks and tapes.
Primary Storage:
1. all the primary storage level, the memory hierarchy includes at the most expensive end cache
memory, which is a static RAM (Random Access Memory ) cache memory is mainly used by the
CPU to speedup execution programs.
2. the next level of primary storage is DRAM (Dynamic Random Access Memory ), which
provides the main work area for the CPU for keeping programs and data , which is popularly
called as main memory .
3. the advantages of DRAM is its low cost, which continuous to decrease ; the drawback
is its volatility and lower speed compared with static RAM.
Secondary Storage:
At the secondary storage level, the hierarchy includes magnetic disks, as well storage in the form of CD -
ROM (Compact Disk - Read Only Memory ) devices.

Secondary storage devices are used to store data for future use or as backup. Secondary storage includes
memory devices that are not a part of the CPU chipset or motherboard, for example, magnetic disks,
optical disks (DVD, CD, etc.), hard disks, flash drives, and magnetic tapes.
Tertiary storage:
At the tertiary storage level, the hierarchy includes optical disks and tapes as the least expensive end.
The storage capacity anywhere in the hierarchy is measured in kilobytes (k bytes or bytes),
megabytes (M bytes or 1 million bytes), gigabytes (G byte or billion bytes), and even terabytes (1000 G
bytes).
Explanation:
DRAM:

programs reside execute in DRAM . Generally, large permanent database reside on secondary storage, and
portions of the database are read into and written from buffers is main memory as needed. personal computers
and work stations have tens of megabytes of data in DRAM. it is become possible to load a large fraction of the
database into main memory. an example is telephone switching applications, which store databases that contain
routing and line information in main memory.
Flash Memory:
1. Between DRAM and magnetic disk storage, another form of memory resides, flash
memory, which is becoming common, particularly because it is non - volatile.
2. flash memories are high density, high - performance memories using EEPROM
(Electrically Erasable programmable Read -only Memory) technology.
3. the advantage of flash memory is the fast access speed;
4. the disadvantage is that an entire block must be erased and written over at a time.
Magnetic disk storage:
● primary medium for long-term storage.
● Typically the entire database is stored on disk.
● Data must be moved from disk to main memory in order for the data to be operated on.
● After operations are performed, data must be copied back to disk if any changes were
made.
● Disk storage is called direct access storage as it is possible to read data on the disk in
any order (unlike sequential access).
● Disk storage usually survives power failures and system crashes.

Figure: Structure of magnetic disk

Access time: the time it takes from when a read or write request is issued to when data transfer begins.
Data-transfer rate– the rate at which data can be retrieved from or stored to the disk.
Mean time to failure (MTTF)– the average time the disk is expected to run
continuously without any failure.

CD-ROM:
CD - ROM disks store data optically and are read by a laser. CD - ROM s contain pre - recorded data that
cannot be overwritten. WORM (Write - Once - Read - Many disks) are a form of optical storage used for
archiving data; they allow data to be written once and read any number of times without the possibility of
erasing. the DVD (Digital Video Disks) is a recent standard for optical disks allowing fourteen to fifteen
gigabytes of storage per disks.
Tapes:

1. Tapes are relatively not expensive and can store very large amount of data. when we
maintain data for a long period but do expect to access it very often.

2. used primarily for backup and archival data.


3. Cheaper, but much slower access, since tape must be read sequentially from the
beginning.
4. Used as protection from disk failures!
5. A Quantum DLT 4000 drive is a typical tape device; it stores 20 GB of data and can
store about twice as much by compressing the data.

Figure: storage device hierarchy


2. File Organizations:

Storing the files in certain order is called file organization. The main objective of file organization is

● Optimal selection of records i.e.; records should be accessed as fast as possible.


● Any insert, update or delete transaction on records should be easy, quick and should not
harm other records.
● No duplicate records should be induced as a result of insert, update or delete
● Records should be stored efficiently so that cost of storage is minimal.

Some of the file organizations are

1. Sequential File Organization


2. Heap File Organization
3. Hash/Direct File Organization
4. Indexed Sequential Access Method
5. B+ Tree File Organization
6. Cluster File Organization

1. Sequential File Organization:

In sequential file organization, records are placed in the file in some sequential order based on the unique key
field or search key.

The easiest method for file Organization is Sequential method. In this method the the file are stored one after
another in a sequential manner. There are two ways to implement this method:

1. Pile FIle Method

2. Sorted File

1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.
Insertion of new record –
Let the R1, R3 and so on upto R5 and R4 be four records in the sequence. Here, records are
nothing but a row in any table. Suppose a new record R2 has to be inserted in the sequence, then
it is simply placed at the end of the file.

2. Sorted File Method –In this method, As the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.

Insertion of new record –


Let us assume that there is a preexisting sorted sequence of four records R1, R3, and so on upto
R7 and R8. Suppose a new record R2 has to be inserted in the sequence, then it will be inserted
at the end of the file and then it will sort the sequence .
Pros and Cons of Sequential File
Organization – Pros –

● Fast and efficient method for huge amount of data.


● Simple design.
● Files can be easily stored in magnetic tapes i.e cheaper storage mechanism.

Cons –

● Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
● Sorted file method is inefficient as it takes time and space for sorting records.

2. Heap File Organization:

When a file is created using Heap File Organization, the Operating System allocates memory area to that
file without any further accounting details. File records can be placed anywhere in that memory area.

Heap File Organization works with data blocks. In this method records are inserted at the end of the file,
into the data blocks. No Sorting or Ordering is required in this method. If a data block is full, the new
record is stored in some other block, Here the other data block need not be the very next data block, but it
can be any block in the memory. It is the responsibility of DBMS to store and manage the new records.

Insertion of new record –


Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new record R2 has to be
inserted in the heap then, since the last data block i.e data block 3 is full it will be inserted in any of the
database selected by the DBMS, lets say data block 1.
If we want to search, delete or update data in heap file Organization the we will traverse the data from the
beginning of the file till we get the requested record. Thus if the database is very huge, searching, deleting
or updating the record will take a lot of time.

Pros and Cons of Heap File


Organization – Pros –

● Fetching and retrieving records is faster than sequential record but only in case of small
databases.
● When there is a huge number of data needs to be loaded into the database at a time,
then this method of file Organization is best suited.

Cons –

● Problem of unused memory blocks.


● Inefficient for larger databases.

3. Hash File Organization:


Hash File Organization uses Hash function computation on some fields of the records. The output of the hash
function determines the location of disk block where the records are to be placed.

In this method of file organization, hash function is used to calculate the address of the block to store the
records.

The hash function can be any simple or complex mathematical function.

The hash function is applied on some columns/attributes – either key or non-key columns to get the
block address.

Hence each record is stored randomly irrespective of the order they come. Hence this method is also
known as Direct or Random file organization.

If the hash function is generated on key column, then that column is called hash key, and if hash function
is generated on non-key column, then the column is hash column.
When a record has to be retrieved, based on the hash key column, the address is generated and directly from
that address whole record is retrieved. Here no effort to traverse through whole file. Similarly when a new
record has to be inserted, the address is generated by hash key and record is directly inserted. Same is the
case with update and delete.

Advantages of Hash File Organization

● Records need not be sorted after any of the transaction. Hence the effort of sorting is
reduced in this method.
● Since block address is known by hash function, accessing any record is very faster.
Similarly updating or deleting a record is also very quick.
● This method can handle multiple transactions as each record is independent of other.
i.e.; since there is no dependency on storage location for each record, multiple records
can be accessed at the same time.
● It is suitable for online transaction systems like online banking, ticket booking system
etc.

clustered file organization:


Clustered file organization is not considered good for large databases. In this mechanism, related records from
one or more relations are kept in the same disk block, that is, the ordering of records is not based on primary key
or search key.

In this method two or more table which are frequently used to join and get the results are stored in the same file
called clusters. These files will have two or more tables in the same data block and the key columns which map
these tables are stored only once. This method hence reduces the cost of searching for various records in
different files. All the records are found at one place and hence making search efficient.
Comparison of file organizations:
The operations to be considered for comparisons of file organizations are below:

B - number of data pages


R records per page
D- average time to read or write a disk page
C- average time to process a record
Indexing and Hashing
Data is stored in the form of records and every record has a key field, which helps it to be
recognize uniquely. Indexing is a data structure technique to efficiently retrieve records from the
database on some attributes on which the indexing has been done. indexing in database is
similar what we see in books,

Indexing in DBMS

o Indexing is used to optimize the performance of a database by minimizing the number


of disk accesses required when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a
database table quickly.
o It is defined based on the indexing attribute.

Index structure, Indexes can be created using some database columns.

o The first column of the database is the search key that contains a copy of the primary
key or candidate key of the table. The values of the primary key are stored in sorted
order so that the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be
found.

Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3 and so on and we have to search student with
ID-543.

o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the previous
case.
Indexing Methods:

Primary Index

o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.

Dense index

o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in
the main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.

Sparse index

o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.
Clustering Index

o A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get
the unique value and create index out of them. This method is called a clustering
index.
o The records which have similar characteristics are grouped, and indexes are created
for these group.

Example: suppose a company contains several employees in each department. Suppose we


use a clustering index, where all employees which belong to the same Dept_ID are considered
within a single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a
non-unique key.

The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is
called better technique.
Secondary Index

In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary memory (hard disk).
For example:

o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now
using the address 110, it goes to the data block and starts searching each record till it
gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is
also done in the same manner.
B+ Tree

o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf
nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can
support random access as well as sequential access.

Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is
of the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.

Internal node

o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.

Leaf node

o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree without affecting the
fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55)
and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion

Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from
the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it
to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
HASHING

In a huge database structure, it is very inefficient to search all the index values and reach
the desired data. Hashing technique is used to calculate the direct location of a data record
on the disk without using index structure.

In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data bucket
or data blocks.

In this, a hash function can choose any of the column value to generate the address. Most of the
time, the hash function uses the primary key to generate the address of the data block. A hash
function is a simple mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That means each row whose
address will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we
have mod (5) hash function to determine the address of the data block. In this case, it applies
mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and
records are stored in those data block addresses.

INDEXED SEQUENTIAL ACCESS METHOD (ISAM)

The potential large size of the index file motivates the ISAM idea. Building an auxiliary
file on the index file and so on recursively until the final auxiliary file fits on one page?
This repeated construction of a one-level index leads to a tree structure that is illustrated in
Figure The data entries of the ISAM index are in the leaf pages of the tree and additional
overflow pages that are chained to some leaf page. In addition, some systems carefully
organize the layout of pages so that page boundaries correspond closely to the physical
characteristics of the underlying storage device. The ISAM structure is completely static
and facilitates such low-level optimizations.

Fig ISAM Index Structure


Each tree node is a disk page, and all the data resides in the leaf pages. This corresponds to an
index that uses Alternative (1) for data entries, we can create an index with Alternative (2) by
storing the data records in a separate file and storing key, rid pairs in the leaf pages of the ISAM
index. When the file is created, all leaf pages are allocated sequentially and sorted on the search
key value.The non-leaf level pages are then allocated. If there are several inserts to the file
subsequently, so that more entries are inserted into a leaf than will fit onto a single page,
additional pages are needed because the index structure is static. These additional pages are
allocated from an overflow area. The allocation of pages is illustrated in below Figure.

Fig: Page allocation in ISAM

Index Data Structures: The two methods in which file data


entries can be organized in two ways.
1) Hash-based indexing, which uses search key
2) Tree-based indexing, it refers to the process of
i) Finding a particular record in a file using one or more index or indexes.
ii) Strong a record in any order (randomly on the disk).

1) Hash Based Indexing: This type of indexing is used to find records quickly, by
providing a search key value.
In this, a group of file records stored in pages based on bucket method. The first
bucket contains a primary page and along with other pages is chained together. In
order to determine the bucket for a record, a special function is called a hash
function along with a search key is used. By providing a bucket number, we can
obtain the primary page in one or more disk I/O operations.
Records Insertion into the Bucket: The records are inserted into the bucket by
assigning (allocating) the need “overflow” pages.
Record Searching: a hash function is used to find first, the bucket containing the
records and then by scanning all the pages in a bucket, the record with a given
search key can be found.
Suppose, if the record doesn’t have search key value then all the pages in the file
needs to be scanned.
Record Retrieval: By applying a hash function to the record’s search key, the page
containing the needed record can be identified and retrieved in one disk I/O.
Consider a file student with a hash key rno. Applying the hash function to the rno,
represents the page that contains the needed record. The hash function ‘h’ uses
the last two digits of the binary value of the rno as the bucket identifier. A
search key index of marks obtained i.e., marks contains <mrks, rid> pairs as data
entries in an auxiliary index file which is shown in the fig. The rid (record id)
points to the record whose search key value is mrks.

2) Tree-based indexing: In Tree Based indexing the records arranged in tree-like


structure. The data entries are started according to the search key values and
they are arranged in a hierarchical number to find the correct page of the data
entries.
Examples:
1) Consider the student record with a search key rno arranged in a
tree-structured index. In order to retrieve the nodes (A’1, B’1, L’11, L’12 and L’13)
that need to perform disk I/O.
The lowest leaf level contains these records. The additional records with rno’s <
19 and > 42 are added to the left side of the leaf node L’11 and to the right of the
leaf node L’13.
The root node is responsible for the start of search and these searches are then
directed to the correct leaf pages by the non-leaf pages which contain node
pointers separated by the search key values. The data entries in a subtree smaller
than the key value ki are pointed to by the right node pointer of ki, shown in fig.

2) In order to find the students whose roll numbers lies b/w ‘19’ and ‘24’, the
direction of the search is shown in the fig.
Suppose we want to find all the students roll numbers lying b/w 17 and 40, we
first direct the search to the node A’1 and after analyzing its contents, then
forwarded the search to B’1 followed by the leaf node L’11, which actually contains
the required data entry. The other leaf nodes L’12 and L’13 also contains the data
entries that fulfills our search criteria. For this, all the leaf pages must be
designed using double linked list.

Thus, L’12 can be fetched using


next pointer on L’11 and L’13 can be
obtained using the next pointer on
L’12. Number of disk I/Os =
Length of the path from the root
to a leaf (occurs in search) + The
number of satisfying data entries
leaf pages.

Closed and Open Hashing: File organization based on the


technique of hashing allows us to avoid accessing an index structures.
Hashing also provides a way of constructing indexes. There are two types of
hashing techniques. They are,
1) Static/ Open hashing. 2) Dynamic/Closed hashing.
In a hash file organization, we obtain the address of the disk block
containing a desired record directly by computing a function on the search
key value of the record. In our description of hashing, we shall use the term
bucket to denote a unit of storage that can store one or more records. A
bucket is typically a disk block, but could be chosen to be smaller or larger
then a disk block.

Static hashing can obtain the address of the disk


block containing a desired record directly by
computing hash function on the search key value of
the record. In static hashing, the number buckets
is static (fixed). The static hashing scheme is
illustrated as shown in fig.

The pages containing the index data can be viewed


as a collection of buckets, with one page and
possible additional overflow pages for overflow
buckets. A file consists of buckets 0 through N – 1
for N buckets. Buckets contain data entries which
can be any of the three choices K*, < k, rid > pair,
<k, rid-list> pair.

To search for a data entry, we apply a hash function ‘h’ to identify the
bucket to which it belongs and then search this bucket.
To insert a data entry, we use the hash function to identify the correct
bucket and then put the data entry there. If there is no space for this data
entry, we allocate a new overflow bucket, put the data entry and add to the
overflow page.
To delete a data entry, we use the hash function to identify the correct
bucket, locate the data entry by searching the bucket and then remove it.
If this data entry is the last in an overflow page, the overflow page is
removed and added to a list of free pages.

Thus, the number of buckets in a static hashing file is known when the file
is created the pages can be stored as successive disk pages.

Drawbacks of Static Hashing:

🡪The main problem with static hashing is that the number of buckets is fixed.

🡪 If a file shrinks greatly, a lot of space is wasted.

🡪 If a file grows a lot, long overflow chains develop, resulting in poor


performance.

Dynamic Hashing: The dynamic hashing technique allow the hash function
to be modified dynamically to accommodate the growth or shrinkage of the
database, because most databases grow larger over time and static hashing
techniques presents serious problems to deal with them.
Thus, if we are using static hashing on such growing databases, we have three
options:

1) Choose a hash function based on the current file size. This


option will result in performance degradation as the database grows.
2) Choose a hash function based on predicted size of the file for future. This option
will result in the wastage of space.
3) Periodically reorganize the hash structure in response to file growth.
Thus, using dynamic hashing techniques is the best solution. They are two types,
1. Extensible Hashing Scheme: Uses a directory to support inserts and deletes
efficiently with no overflow pages.
2. Linear Hashing Scheme: Uses a clever policy for creating new buckets and
supports inserts and deletes efficiently without the use of a directory.
Comparison of File Organization: To compare file organizations, we
consider the following operations that can be performed on a record. They are,
1) Record Insertion: For inserting a record we need to identify and fetch the page
from the disk. The record is then added and the (modified) page is written back
to the disk.
2) Record Deletion: It follows the same procedure as record insertion except that
after identifying and fetching a page, the record with the given rid is deleted and
again the changed page is added back to the disk.
3) Record Scanning: In this all the file pages must be fetched from the disk and
are stored in a pool of buffers. Then the corresponding record can be retrieved.
4) Record Searching Based on Equality Selection: In this, all the records that
satisfies a given equality selection criteria are fetched from the disk.
Example: To find a student record based on the following equality selection
criteria “student whose roll number (rno) is 15 and whose marks (mrks) are
90* is the topper of the class.
5) Record Searching Based on Selected Range: In this, all the records that
satisfies a given equality selection are fetched.
Example: Find all the records of the students whose secured marks are greater than
50.
Cost Model in terms of time needed for execution:
It is a method used to calculate the costs of different operations that are performed on the
database.
Notations:
B = The total number of pages without any space wastage when records are grouped into it.
R = The total number of records present in a page.
D = The average time needed to R/W (Read/Write) a disk page.
C = The average time needed for a record processing
H = Time required to apply the hash function on a record in hashed-file organization.
F = Fan-out (in tree indexes)
For calculating I/O costs (which is the base for costs of the
database operations) we take, D = 15 ms, C and H = 100 ns.
Heap Files:
1) Cost of Scanning: The cost of scanning heap files is given by B(D + RC). It
means, Scanning R records of B pages with time C per record would take BRC and
scanning B pages with time D per page would take BD. Therefore the total cost of
scanning is BD + BRC 🡺 B(D + RC)
Cost of Insertion: The cost to insert a record in heap file is given as 2D + C. It means,
to insert a record, first we need to fetch the last page of the file that can take time ‘D’
then we need to add the record that takes time ‘C’ and finally the page is written back
to the disk from main memory will take time ‘D’. So, the total cost is, D + D + C 🡺
2D + C.
2) Cost of Deletion: The cost to delete a record from a heap file is given as, D + C
+ D = 2D + C. It means, in order to delete a record, first search the record by reading the
page that can take time

3) Record Searching based on some quality criteria: Searching exactly one


record that meet the equality that involves scanning half of the files based on the
assumption to find the record.
This takes time = ½ x scanning cost 🡺 ½ x B (D + RC).
In case of multiple records the entire file need to be scanned.
4) Record searching with a Range selection: It is the same as the cost of
scanning because it is not known in advance how many records can satisfy the particular
range. Thus, we need to scan the entire file that would take B(D + RC).

Sorted Files:
1) Cost of scanning: The cost of scanning sorted files is given by B(D + RC)
because all the pages need to be scanned in order to retrieve a record. i.e. cost of
scanning sorted files = Cost of scanning heap files.
2) Cost Insertion: The cost of insertion in sorted files is given by search cost + B(D + RC).
It includes, Finding correct position of the record + Adding of record + Fetching of
pages + rewriting the pages.
3) Cost of Deletion: The cost of
deletion in sorted files is given by
Search cost + B(D + RC).
It includes, Searching a record + removing record + rewriting the modified page.
Note: The record deletion is based on equality.

4) Cost of Searching with equality selection Criteria: This cost of sorted files is
equal to D log2 B = It is the time required for performing a binary search for a page that
contain the records.
If many records satisfy, then record is equal to, D log2 B + C log2 R + Cost of sequential
reading of all the records.
5) Cost of Searching with Range Selection: This cost is given as,
Cost of fetching the first matching record’s page + Cost of obtaining the set of qualifying
records.
If the range is small, then a single page contain all the matching records, else additional
pages needs to be fetched.
Clustered Files:
1) Cost of Scanning: The cost of scanning clustered files is same as the cost of
scanning sorted files except that it has more number of pages and this cost is given as
scanning B pages with time ‘D’ per page takes BD and scanning R records of B pages
with time C per record takes BRC. Therefore the total cost is, 1.5B(D + RC).
2) Cost of Insertion: The cost of insertion in clustered files is, Search + Write (D logF1.5B +
Clog2R) + D.
3) Cost of Deletion: It is same as the cost of insertion and includes,
the cost of searching for a record + removing of a record + rewriting the modified page.
i.e. D logF 1.5B + Clog2R + D
4) Equality Selection Search:
i) For a single Qualifying Record: The cost of finding a single qualifying
record in clustered files is the sum of the binary searches involved in finding the
first page in D logF 1.5B and finding the first matching record in C log2 R. i.e. D
logF1.5B + C log2 R.
ii) For several Qualifying Records: If more than one record satisfies the
selection criteria then they are assumed to be located consecutively.
Cost required to find record is equal to,
D logF 1.5B + C log2R + cost involved in sequential reading of all records.

5) Range Selection Search: This cost in an equality search under several matched records.

Heap File with Un-clustered Tree Index:


1) Scanning: For scanning a student’s file,
i) Scan the index’s leaf level.
ii) Get the relevant record from the file for each data entry.
iii) Obtain sorted data records according to < rno, mrks >.
The cost of reading all the data entries is 0.15B (D + 6.7RC) I/Os. For each
index entry a record has to be fetched in one I/O.
2) Insertion: The record is first inserted at 2D + C in students heap file and the
associated entry in the index. The correct leaf page can be found in D logF 0.15B + C
log2 6.7 R followed by the addition of a new entry and rewriting in D.
3) Deletion: The cost of deletion includes,
Cost of finding the record in a file + cost of finding the entry in index + Cost of
rewriting the modified page in the index and the file.
It corresponds, D logF0.15B + C log2 6.7R + D + 2D.
4) Equality Selection Search: The cost involved in this operation is the sum of,
i) The cost of finding the page containing a matched entry.
ii) The cost of finding the first matched entry and
iii) The cost of
finding the
first matched
record. It is
given as, D
logF 0.15B +
C log2 6.7R
+D
5) Range Selection Based-search: This is same as search with range selection in
clustered files except from having data pages it has data entries.

Heap file with Un-clustered Hash Index:


1) Scan : The total cost is the sum of the cost in the retrieval of all data entries and
one I/O cost for each data record. It is given as, 0.125B(D + BRC) + BR(D + C).
2) Insertion: It involves the cost of inserting a record i.e., 2D + C in the heap file,
the cost of finding the page cost of adding a new entry and rewriting of the page, it is
expressed as,
2D + C + (H + 2D + C).
3) Deletion: It involves the cost of finding the data record and the data entry at H
+ 2D + 4RC and writing back the changed page to the index and file at 2D. The total
cost is, (H + 2D + 4RC) + 2D.

4) Equality Selection Search: The total cost in the search accounts to,
i) The page containing the qualifying entries is identified at the cost H.
ii) Retrieval of the page, assuming that it is the only page present in the bucket occurs at D.
iii) The cost of finding an entry after scanning half the records on the page is 4RC.
iv) Fetching a record from the file is D. The total cost is,

H + D + 4RC + D 🡺 ( H + 2D + 4RC)
In case of many matched records the cost is,
H + D + 4RC + one I/O for each record that qualifies.
5) Range Selection Search: The cost of this is B(D + RC).

Comparing Advantages and Disadvantages of Different File Organizations:

File Organization Advantages Disadvantages


1) Heap file - Good storage efficiency - slow searches
- Rapid scanning - slow deletion
- Insertion is fast

2) Sorted file - Good storage efficiency - Insertion is slow.


- Search is faster than heap file. - Slow deletion.

3) Clustered file - Good storage efficiency - Space overhead


- Searches fast
- Efficient insertion and deletion

4) Heap file with - Fast insertion, deletion and searching - Scanning and range searches are slow
Un-clustered
tree index.

5) Heap file with - Fast insertion, deletion and searching - Doesn’t support range searches.
Un-clustered
Hash index.
Dangling Pointer: Dangling pointer is a pointer that does not point to a valid
object of the appropriate type. Dangling pointers arise when an object is deleted or
de-allocated, without modifying the value of the pointer, so that the pointer still
points to the memory location of the de-allocated memory.
In abject-oriented database, dangling pointer occur if we move or delete a
record to which another record contains as pointer that pointer no longer points to
the desired record.

Detecting Dangling pointer in object-oriented Databases: Mapping objects to


files is similar to mapping tupples to files in a relational system, object data can be
stored using file structures. Objects are identified by an object identifier (OID), the
storage system needs a mechanism to locate given its OID.

Logical identifiers do not directly specify an objects physical location, must


maintain an index that maps an OID to the object’s actual location.

Physical identifiers encode the location of the object so the object can be found
directly. Physical OIDs have the following parts.
1) A volume or file identifier.
2) A page identifier within the volume or file.
3) An offset within the page.
Physical OIDs may have a unique identifier. This identifier is stored in the object also and
is used to detect reference via dangling pointers.

Indexes and Performance Tunning:


The performance of the system depends greatly on the indexes. This is done in
terms of the expected work load.
Work Load Impact: Data entries that qualify particular selection
criteria can be retrieved effectively by means of indexes. Two selection
types are,
1) Equality 2) Range Selection.

1) Equality:

🡪 An equality query for a composite search key is defined as a search key in which
each field is associated with a constant.
For Example data entries in a student file where rno = 15 and mrks = 90 can be
retrieved by using equality query.

🡪 This is supported by Hash-file organization

2) Range Query: A range query for a composite search key is defined as


a search key in which all the fields are not bounded to the constants.
Example: Data entries in a student file where rno = 15 with any mrks can be retrieved.

Thus, Tree-based indexing supports both the selection criteria as well as inserts,
deletes and updates whereas only equality selection is supported by hash-based
indexing apart form insertion, deletion and updation.

Advantages of using tree-structured indexes:


1) By using tree-structured indexes, insertion and deletion of data entries can be handled
effectively.
2) It finds the correct leaf page faster than binary search in a sorted file.

Disadvantages:
The stored file pages are in accordance with the disk’s order hence sequential
retrieval of such pages is quicker which is not possible in tree-structured indexes.
https://www.cs.uct.ac.za/mit_notes/database/htmls/chp11.html

You might also like