File Structure and Indexing
File Structure and Indexing
Storage of Databases:
Databases typically store large amounts of data in hard disk which is the primary choice for large
databases. However, in future it may reside at different levels of the memory.
• Hence, it is important to study and understand the properties and characteristics of magnetic disks
and the data files and records organization on disk for effective physical design of DB with acceptable
performance.
• Usually, the DBMS has several options available for organizing the data. The process of physical
database design involves choosing the most appropriate one that best suit the given application
requirements from those options
• There are several primary file organizations like heap file (or unordered file), sorted file (or
sequential file), hashed file, B-trees etc. We will discuss some of the file organizations and access
technique for each organization.
Storage Hierarchy:
Following diagram shows the various storage devices available and their hierarchy:
• Cache – fastest and most costly form of storage; volatile; managed by the system hardware.
• Main memory - fast access (10s to 100s of nanoseconds; 1 nanosecond = 10-9 seconds) & volatile.
Generally too small (or too expensive) to store the entire database.
• Flash memory -
• Data survives power failure
• Data can be written at a location only once, but location can be erased and written to again
Can support only a limited number (10K – 1M) of write/erase cycles.
Erasing of memory has to be done to an entire bank of memory
• Reads are roughly as fast as main memory
• But writes are slow (few microseconds), erase is slower
• Widely used in embedded devices such as digital cameras, phones, and USB keys
• Magnetic-disk - Data is stored on spinning disk, and read/written magnetically and survives power
failures and system crashes. It is the primary medium for the long-term storage of data; typically
stores entire database. Data must be moved from disk to main memory for access, and written back
for storage.
• Optical storage
• non-volatile, data is read optically from a spinning disk using a laser
• CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
• Blu-ray disks: 27 GB to 54 GB
• Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R)
• Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM)
• Reads and writes are slower than with magnetic disk
• Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for
automatic loading/unloading of disks available for storing large volumes of data.
File Organization
File organization determines how records are represented in a file structure. Various file
organizations are used to store a relation in disk.
• A file consists of records. These records are mapped onto disk blocks. Block size is fixed but record
size can vary.
• Each file has a file header to contain a variety of information about the file.
• In a relational database, tuples of distinct relations are generally of different sizes. Following are
approaches for mapping the database to files:
• Fixed record length file to store one relation in one file.
• Variable record length files to store one relation in one file.
In RDBMS a particular relation (same record type) may need variable length records for several
reasons:
• One or more of the fields are of varying size (variable-length fields). For example, the NAME
• Field of EMPLOYEE can be a varchar(50) field.
• One or more of the fields may have multiple values for individual record
• One or more of the fields are optional.
• The file contains records of different record types and hence of varying size (mixed file).
Files of fixed length records are easier to implement than are files of variable-length records. Many
of the techniques used for the former can be applied to the variable-length case. Thus, we begin by
considering a file of fixed-length records.
Fixed-Length Records
As an example, let us consider a file of account records for bank database. Assume record length =
40 bytes. A simple approach to add a record one by one with fixed record length 40 bytes as shown
in figure 1.
Fig. 1
However, there are two problems with this simple approach:
1. It is difficult to delete a record. The space of the deleted record must be used while new record
inserted, or we must have a way of marking deleted records so that they can be ignored.
2. Unless the block size happens to be a multiple of 40 (which is unlikely), some records will cross
block boundaries. That is, part of the record will be stored in one block and part in another. It would
thus require two block accesses to read or write such a record.
After deleting record 2 move all the records after that one by one to occupy the deleted space as
shown in figure 2. Such an approach requires moving a large number of records.
Fig. 2
Other approach may be move the final record (record 8) of the file into the space occupied by the
deleted record as shown in figure 3. But doing so requires additional block accesses, so it is
undesirable. Since insertions tend to be more frequent than deletions, it is acceptable to leave open
the space occupied by the deleted record, and to wait for a subsequent insertion before reusing the
space.
Fig. 3
In the above approach a simple marker on a deleted record is not sufficient, since it is hard to find
this available space when an insertion is being done. So we need to use file header. Here we need to
store the address of the first record whose contents are deleted. The first deleted record will store
the address of the second available record, and so on. The deleted records thus form a linked list,
which is often referred to as a free list. Figure 4 shows the free list, after records 1, 4, and 6 have
been deleted.
Fig. 4
Variable-Length Records
For purposes of illustration we consider a different representation of the account information stored
in the file, in which we use one variable-length record for each branch name and for all the account
information for that branch.
Various ways to implement variable-length records are discussed below:
Some disadvantages:
• It is not easy to reuse space occupied formerly by a deleted record. Although techniques exist to
manage insertion and deletion, they lead to a large number of small fragments of disk storage that
are wasted.
• There is no space, in general, for records to grow longer. If a variable-length record becomes longer,
it must be moved—movement is costly if pointers to the record are stored elsewhere in the database
(e.g., in indices, or in other records), since the pointers must be located and updated.
Thus, the basic byte-string representation is not usually used for implementing variable-length
records. However, a modified form of the byte-string representation, called the slotted-page
structure (as shown in figure below), is commonly used for organizing records within a single block.
• Each entry of block header array contains size of record and location of record in the block.
• There is a header at the beginning of each block, containing the following information:
➢ The number of record entries in the header
➢ The end of free space in the block
➢ An array whose entries contain the location and size of each record.
The actual records are allocated contiguously in the block, starting from the end of the block. The
free space in the block is contiguous, between the final entry in the header array, and the first record.
The slotted-page structure requires that there be no pointers that point directly to records. Instead,
pointers must point to the entry in the header that contains the actual location of the record. This
level of indirection allows records to be moved to prevent fragmentation of space inside a block,
while supporting indirect pointers to the record.
Those branches with fewer than three accounts (for example, Round Hill) have records with
null fields. The reserved-space method is useful when most records have a length close to the
maximum. Otherwise, a significant amount of space may be wasted.
• List representation: We can represent variable-length records by lists of fixed length records,
chained together by pointers.
As shown in above figure pointers are used to chain together all records pertaining to the same
branch. Whereas for fixed-length record we use pointers to chain together only deleted records. A
disadvantage to the above structure is that there are spaces in all records except the first in a chain.
The first record needs to have the branch-name value, but subsequent records do not. This wasted
space is significant, since we expect, in practice, that each branch has a large number of accounts.
To deal with this problem, we allow two kinds of blocks in our file:
• Anchor block, which contains the first record of a chain
• Overflow block, which contains records other than those that are the first record of a chain
Thus, all records within a block have the same length, even though not all records in the file have the
same length. Following figure shows this file structure.
Organization of Records in Files
An instance of a relation is a set of records. Given a set of records, the next question is how to
organize them in a file. Several of the possible ways of organizing records in files are:
1. Heap file organization: Any record can be placed anywhere in the file where there is space for the
record. There is no ordering of records. Typically, there is a single file for each relation.
2. Sequential file organization: Records are stored in sequential order, according to the value of a
“search key” of each record.
3. Hashing file organization: A hash function is computed on some attribute of each record. The
result of the hash function specifies in which block of the file the record should be placed.
An index for a file in a DBMS works in much the same way as the index of a textbook.
• Index at the back of book contains topic (specified by a word or a phrase) in the textbook and the
page nos. where it appears.
• We can search for the topic in the index page, find the pages where it occurs, and then read the
pages to find the information we are looking for.
• The words in the index are in sorted order, making it easy to find the word we are looking for.
Moreover, the index is much smaller than the book, further reducing the effort.
Database system indices play the same role as book indices. For example, to retrieve an account
record given the account number, the database system would look up an index to find on which disk
block the corresponding record resides, and then fetch the disk block, to get the account record. We
will discuss several indexing techniques and their advantages and disadvantages.
Indexing in Databases
• Indexing is a way to optimize performance of a database by minimizing the number of disk accesses
required when a query is processed.
• An index or database index is a data structure which is used to quickly locate and access the data
in a database table.
Indexes are created using some database columns and it has two pieces of information:
An attribute or set of attributes used to look up records in a file is called a search key.
• The first column is the Search key that contains a copy of the primary key or candidate key of the
table. These values are stored in sorted order so that the corresponding data can be accessed quickly
(Note that the data may or may not be stored in sorted order).
• The second column is the Data Reference which contains a set of pointers holding the address of
the disk block where that particular key value can be found.
Suppose a table has a 100 rows of data, each row size = 20 bytes and there is no index file. If you
read the record number 100, DBMS read each and every row and after reading 99x20 = 1980 bytes
it will find record number 100.
If we have an index, the search for record number 100 starts by reading the index file, not from the
table. The index, containing only two columns, may be just 4 bytes wide in each of its rows. After
reading only 99x4 = 396 bytes of data from the index the management system finds an entry for
record number 100, reads the address of the disk block where record number 100 is stored and
directly points at the record in the physical storage device. The result is a much quicker access to the
record.
• The only minor disadvantage of using index is that it takes up a little more space than the main
table. Additionally, index needs to be updated periodically for insertion or deletion of records in the
main table.
• Primary Index
• Clustering Index
• Secondary Index
Primary Index
• Primary Index requires the rows in data blocks to be ordered on the index key or search key.
• So apart from the index entries being themselves sorted in the index block, primary index also
enforces an ordering of rows in the data blocks.
It shows a sequential file of account records of a bank. Here data records are stored in search-key
order, with branch-name used as the search key. So ordering of both the files is same branch name.
If the ordering field in data file is not a key field, i.e. numerous records in the file can have the same
value for the ordering field, we can use another type of index, called a clustering index. Note that a
file can have at most one physical ordering field, so it can have at most one primary index or one
clustering index, but not both.
• Sparse index: An index record appears for only some of the search-key values. Search key pointers
to the first data record with that search-key value. To locate a record with a search key, we find the
index entry which is less than or equal to that search-key. We start at the record pointed to by that
index entry, and follow the pointers in the file until we find the desired record.
Suppose that we are looking up records for the Downtown branch. There is no index record for that
search key. Since the last entry (in alphabetic order) before “Downtown” is “Brighton,” we follow
that pointer. We process this record, and follow the pointer in that record to locate the next record
search-key (branch-name) order. We continue processing records until we encounter a record for a
branch other than Downtown in data file.
As we have seen, it is generally faster to locate a record if we have a dense index rather than a sparse
index. However, sparse indices have advantages over dense indices in that they require less space
and they impose less maintenance overhead for insertions and deletions.
Insertion and deletion of records in an order list has some problems. With a primary index, the
problem is compounded because, if we attempt to insert a record in its correct position in the data
file, we have to not only move records to make space for the new record but also change some index
entries, since moving records will change the anchor records of some blocks.
Using an unordered overflow file can reduce this problem. Another possibility is to use a linked list
of overflow records for each block in the data file.
• Secondary Indices
A secondary index provides a secondary means of accessing a file for which some primary access
already exists. The secondary index may be on a field which is a candidate key and has a unique value
in every record, or a non-key with duplicate values.
Secondary indices must be dense, with an index entry for every search-key value, and a pointer to
every record in the file. If a secondary index stores only some of the search-key values, records with
intermediate search-key values may be anywhere in the file and, in general, we cannot find them
without searching the entire file.
The index is an ordered file with two fields. The first field is an indexing field (some non- ordering
field of the data file). The second field is either a block pointer or a record pointer. There can be many
secondary indexes (and hence, indexing fields) for the same file. A secondary index usually needs
more storage space and longer search time than does a primary index, because of its larger number
of entries.
The above diagram shows how the index entries in the index blocks (left side) contains pointers
(these are row locators in database terminology) to corresponding rows in data blocks (right side).
The data blocks do not have rows sorted on the index key. Each secondary index contains the pointer
to the block where the record exists. Once the appropriate block is transferred to main memory, a
search for the desired record within the block can be carried out.
• Multilevel Indices
Even if we use a sparse index, sometimes index file size becomes very large for big data file. So access
time may be long although we are using binary search. Note that, if overflow blocks have been used,
binary search will not be possible. In that case, a sequential search is typically used that will take
even longer. Thus, the process of searching a large index may be costly. To deal with this problem,
we treat the index just as we would treat any other sequential file, and construct a sparse index on
the primary index, as shown below.
To locate a record, we first use binary search on the outer index to find the record for the largest
search-key value less than or equal to the one that we desire. The pointer points to
a block of the inner index. We scan this block until we find the record that has the largest search-key
value less than or equal to the one that we desire. The pointer in this record points to the block of
the file that contains the record for which we are looking. Using the two levels of
indexing, we have read only one index block, rather than the seven we read with binary search, if we
assume that the outer index is already in main memory.
If our file is extremely large, even the outer index may grow too large to fit in main memory. In such
a case, we can create yet another level of index. Indeed, we can repeat this process as many times
as necessary. Indices with two or more levels are called multilevel indices. Searching for records with
a multilevel index requires significantly fewer I/O operations than does searching for records by
binary search. Each level of index could correspond to a unit of physical storage. Thus, we may have
indices at the track, cylinder, and disk levels. Multilevel indices are closely related to tree structures,
such as the binary trees used for in-memory indexing.
B-tree
A multi-level index is a form of search tree; however, insertion and deletion of new index entries is a
severe problem because every level of the index is an ordered file.
It can be very expensive to keep the index in sorted order. So we need a way to make insertions and
deletions to indexes that will not require massive reorganization. B-trees address the problem of
speeding up indexing schemes that are too large to copy into memory.
• A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches,
sequential access, insertions, and deletions in logarithmic time.
• Each internal node of a B-tree contains a number of keys. The keys act as separation values
which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then
it must have 2 keys: a1 and a2.
• All values in the leftmost subtree will be less than a1, all values in the middle subtree will be
between a1 & a2, and all values in the rightmost subtree will be greater than a2.
• Each B-tree node contains key values and respective data (block) pointers to the actual data
records. Additionally, there are node pointers for the left and right intervals around a key
value.
Insertion
In a B-Tree, a new element must be added only at the leaf node. That means, the new keyValue is
always attached to the leaf node only.
Step 1 - Check whether tree is Empty.
Step 2 - If tree is Empty, then create a new node with new key value and insert it into the tree as a
root node.
Step 3 - If tree is Not Empty, then find the suitable leaf node to which the new key value is added
using Binary Search Tree logic.
Step 4 - If that leaf node has empty position, add the new key value to that leaf node in ascending
order of key value within the node.
Step 5 - If that leaf node is already full, split that leaf node by sending middle value to its parent node.
Repeat the same until the sending value is fixed into a node.
Step 6 - If the splitting is performed at root node then the middle value becomes new root node for
the tree and the height of the tree is increased by one.
Example: Let us understand the insertion operation by constructing a BTree of Order 3 by inserting
numbers from 1 to 10.
While adding 4 we will find empty position in the leaf node. So add it to that
node.
While adding 8 we will find empty position in leaf node so add there.
B+Tree
• In order, to implement dynamic multilevel indexing, B-tree and B+ tree are generally
employed.
• The drawback of B-tree: it stores the data pointer (a pointer to the disk file block containing
the key value), corresponding to a particular key value, along with that key value in the node
of a B-tree. So no of entries that can be packed into a node of a B-tree is reduced, resulting
increase in the number of levels in the B-tree, hence increasing the search time of a record.
• B+ tree eliminates the above drawback by storing data pointers only at the leaf nodes of the
tree. Thus, the structure of leaf nodes of a B+ tree is quite different from the structure of
internal nodes of the B+ tree.
• internal nodes contain only search keys (no data)
• Leaf nodes contain pointers to data records
• Data records are in sorted order by the search key
• All leaves are at the same depth
Example of B+ tree of order 3
B+Tree Index:
Modern databases use mainly B+Tree index which perform range query efficiently.
• only the leave nodes store information (the location of the rows in the associated table)
• other nodes are just here to route to the right node during the search.
With this B+Tree, if we are looking for values between 40 and 100:
• We just have to look for 40 (or the closest value after 40 if 40 doesn’t exist) like we did with the
previous tree.
• Then gather the successors of 40 using the direct links to the successors until we reach 100.
Let’s say you found M successors and the tree has N nodes. The search for a specific node costs log(N)
like the previous tree. But, once you have this node, you get the M successors in M operations with
the links to their successors. This search only costs (M + log(N)) operations vs N operations with the
previous B-tree. Moreover, you don’t need to read the full tree (just M + log(N) nodes), which means
less disk usage. If M is low (like 200 rows) and N large (1000000 rows) it makes a big difference.
Disadvantage of B+ tree:
If we add or remove a row in a database (and therefore in the associated B+Tree index):
• We have to keep the order between nodes inside the B+Tree, otherwise we won’t be able to find
nodes.
• We have to keep the lowest possible number of levels in the B+Tree otherwise the time complexity
in O(log(N)) will increase (will become O(N)).
Add D: No empty place in leaf node, C is promoted in parent node and D is added in leaf.
Add E: no empty place in leaf node, so D is promoted, but a empty place in parent node. Split the
node and promote C to next higher parent.
Add G: No empty pace in leaf split and promote F (as it is middle value). But higher level parent has
also no empty place, split it and promote E to next higher level.
Add H: No empty pace in leaf split and promote G (as it is middle value) in higher level parent.
Add I: No empty place in leaf, so split and promote H (as it is middle value) to higher level parent.
But higher level parent also has no empty place, split it and promote G to next higher level. Higher
level parent has also no empty place (C, E, G), split it and promote E to next higher level.