0% found this document useful (0 votes)
2 views

Index Structures

The document discusses index structures in databases, explaining their importance in efficiently retrieving records based on specific search keys. It covers various types of indexes, including dense and sparse indexes, as well as primary and secondary indexes, and introduces B-trees as a balanced data structure for indexing. Additionally, it outlines operations for inserting and deleting keys in B-trees, emphasizing the maintenance of properties that ensure efficient data retrieval.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Index Structures

The document discusses index structures in databases, explaining their importance in efficiently retrieving records based on specific search keys. It covers various types of indexes, including dense and sparse indexes, as well as primary and secondary indexes, and introduces B-trees as a balanced data structure for indexing. Additionally, it outlines operations for inserting and deleting keys in B-trees, emphasizing the maintenance of properties that ensure efficient data retrieval.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-

II
Index Structures
Index Structures
 It is not sufficient simply to scatter the records that represent tuples of a relation among various
blocks. (SELECT * FROM R)
 Instead,
 Select *from R where x=5;
 An index is any data structure that takes the value of one or more fields and finds the records
with that value “quickly.” In particular, an index lets us find a record without having to look at
more than a small fraction of all possible records.
 The field(s) on whose values the index is based is called the search key, or just “key” if the
index is understood.
 Indexes help to speed up queries that specify values for one or more attributes.

Fig: An index takes a value for some field(s) and finds records with the
Basics of Index Structures
 Storage structures consist of files, which are similar to the files used by operating systems. A data
file may be used to store a relation, for example. The data file may have one or more index files.
 Each index file associates values of the search key with pointers to data-file records that have that
value for the attribute(s) of the search key.
 Indexes are of two types:
 Dense: There is an entry in the index file for every record of the data file
 Sparse: Only some of the data records are represented in the index, often one index entry per block
of the data file.
 Indexes can also be
 Primary: Can determine the location of the records of the data file.
 Secondary: Cannot determine the location of the records of the data file.
 For example, it is common to create a primary index on the primary key of a relation and to create
secondary indexes on some of the other attributes.
Sequential files and Dense index
 A sequential file is created by sorting the tuples of a relation by their primary
key. The tuples are then distributed among blocks, in this order.
 If records are sorted, we can build on them a dense index, which is a sequence
of blocks holding only the keys of the records and pointers to the records
themselves

Fig: A dense index (left) on a sequential data file (right)


Sequential files and Dense index
 The dense index supports queries that ask for records with a given search key
value. Given key value K , we search the index blocks for K , and when we find
it, we follow the associated pointer to the record with key K .
 It might appear that we need to examine every block of the index, or half the
blocks of the index, on average, before we find K . However, there are several
factors that make the index-based search more efficient than it seems.
1. The number of index blocks is usually small compared with the number of data
blocks.
2. Since keys are sorted, we can use binary search to find K . If there are n blocks of
the index, we only look at log2 n of them.
3. The index may be small enough to be kept permanently in main memory buffers.
If so, the search for key K involves only main-memory accesses, and there are no
expensive disk I/O ’s to be performed.
Sparse index
 A sparse index typically has only one key-pointer pair per block of the data file. It thus uses
less space than a dense index, at the expense of somewhat more time to find a record given
its key.
 You can only use a sparse index if the data file is sorted by the search key, while a dense
index can be used for any search key.

Indexfile
datafile
Fig: A sparse index on a sequential data file
Searching using Sparse index

 To find the record with search-key value K , we search the sparse index for
the largest key less than or equal to K .
 Since the index file is sorted by key, a binary search can locate this entry. We
follow the associated pointer to a data block.
 Now, we must search this block for the record with key K . Of course the
block must have enough format information that the records and their contents
can be identified.
Multiple Levels of Index
 An index file can cover many blocks. Even if we use binary search to find the
desired index entry, we still may need to do many disk I/O ’s to get to the record
we want. By putting an index on the index, we can make the use of the first level of
index more efficient.

Figure 14.4: Adding a second level of sparse index


Secondary Indexes
 A secondary index serves the purpose of any index: it is a data structure that
facilitates finding records given a value for one or more fields.
 However, the secondary index is distinguished from the primary index in that
a secondary index does not determine the placement of records in the data file.
 Rather, the secondary index tells us the current locations of records; that
location may have been decided by a primary index on some other field.
 An important consequence of the distinction between primary and secondary
indexes is that:
• Secondary indexes are always dense. It makes no sense to talk of a sparse,
secondary index. Since the secondary index does not influence location, we could
not use it to predict the location of any record whose key was not mentioned in the
index file explicitly.
Secondary Indexes
 The keys in the index file are sorted. The result is that the pointers in one index block
can go to many different data blocks, instead of one or a few consecutive blocks.
 For example, to retrieve all the records with search key 20, we not only have to look at
two index blocks, but we are sent by their pointers to three different data blocks. Thus,
using a secondary index may result in many more disk I/O ’s than if we get the same
number of records via a primary index.
 However, there is no help for this problem; we cannot control the order of tuples in the
data block, because they are presumably ordered according to some other attribute(s).

Example of secondary index


Indirection in Secondary Indexes
 There is some wasted space, perhaps a significant amount of wastage, in the structure. If a
search-key value appears n times in the data file, then the value is written n times in the
index file. It would be better if we could write the key value once for all the pointers to data
records with that value.
 A convenient way to avoid repeating values is to use a level of indirection, called buckets,
between the secondary index file and the data file.
Document Retrieval and Inverted Indexes
 With the advent of the World-Wide Web and the feasibility of keeping all documents on-line, the
retrieval of documents given keywords has become one of the largest database problems.
 A document may be thought of as a tuple in a relation Doc. This relation has very many attributes, one
corresponding to each possible word in a document. Each attribute is boolean — either the word is present in
the document, or it is not. Thus, the relation schema may be thought of as
Doc(hasCat, hasDog, ... )
where hasCat is true if and only if the document has the word “cat” atleast once.
 There is a secondary index on each of the attributes of Doc. However, we save the trouble of indexing those
tuples for which the value of the attribute is FALSE; instead, the index leads us to only the documents for
which the word is present. That is, the index has entries only for the search-key value TRUE.
 Instead of creating a separate index for each attribute (i.e., for each word), the indexes are combined into one,
called an inverted index. This index uses indirect buckets for space efficiency.
 An inverted index is a data structure used primarily in text search engines. It
maps terms (words) to their locations in a set of documents.
Document Retrieval and Inverted Indexes

An inverted index on documents


Document Retrieval and Inverted Indexes
Example : Figure illustrates a bucket file that has been used to indicate
occurrences of words in HTML documents.

•The first column indicates the type of occurrence, i.e., its marking, if any.
•The second and third columns are together the pointer to the occurrence.
•The third column indicates the document, and the second column gives the
number of the word in the document.
B Trees
 While one or two levels of index are often very helpful in speeding up queries,
there is a more general structure that is commonly used in commercial systems.
 This family of data structures is called B-trees, and the particular variant that is
most often used is known as a B+ tree.
 B-trees automatically maintain as many levels of index as is appropriate for the size of
the file being indexed.
 B-trees manage the space on the blocks they use so that every block is between half
used and completely full.
The Structure of B-Trees
 A B-tree organizes its blocks into a tree that is balanced, meaning that all paths
from the root to a leaf have the same length.
 Typically, there are three layers in a B-tree: the root, an intermediate layer, and
leaves, but any number of layers is possible.
The Structure of B-Trees
 There is a parameter ‘n’ associated with each B-tree index, and this parameter
determines the layout of all blocks of the B-tree.
 Each block will have space for n search-key values and n + 1 pointers.

Example: Finding the number of data values stored on B-Tree

 Suppose our blocks are 4096 bytes. Also let keys be integers of 4 bytes and let
pointers be 8 bytes. If there is no header information kept on the blocks, then we
want to find the largest integer value of n such that 4n + 8(n + 1) < 4096. That
value is n = 340.
The Structure of B-Trees
 There are several important rules about what can appear in the blocks of a B-tree:
1. The keys in leaf nodes are copies of keys from the data file. These keys are distributed among
the leaves in sorted order, from left to right.
2. At the root, there are at least two used pointers. All pointers point to B-tree blocks at the level
below.
3. At a leaf, the last pointer points to the next leaf block to the right, i.e., to the block with the
next higher keys. Among the other n pointers in a leaf block, at least ceil((n + 1)/2) of these
pointers are used and point to data records; unused pointers are null and do not point
anywhere. The ith pointer, if it is used, points to a record with the ith key.
Properties of B-Trees
 B-Tree of Order m has the following properties...
 Property #1 - All leaf nodes must be at same level.
 Property #2 - All nodes except root must have at least [m/2]-1 keys and maximum of
m-1 keys.
 Property #3 - All non leaf nodes except root (i.e. all internal nodes) must have at least
m/2 children.
 Property #4 - If the root node is a non leaf node, then it must have atleast 2 children.
 Property #5 - A non leaf node with n-1 keys must have n number of children.
 Property #6 - All the key values in a node must be in Ascending Order.
Insertion Operation in B-Tree
 In a B-Tree, a new element must be added only at the leaf node. That means, the new keyValue
is always attached to the leaf node only. The insertion operation is performed as follows...
 Step 1 - Check whether tree is Empty.
 Step 2 - If tree is Empty, then create a new node with new key value and insert it into the tree as a root
node.
 Step 3 - If tree is Not Empty, then find the suitable leaf node to which the new key value is added using
Binary Search Tree logic.
 Step 4 - If that leaf node has empty position, add the new key value to that leaf node in ascending order
of key value within the node.
 Step 5 - If that leaf node is already full, split that leaf node by sending middle value to its parent node.
Repeat the same until the sending value is fixed into a node.
 Step 6 - If the splitting is performed at root node then the middle value becomes new root node for the
tree and the height of the tree is increased by one.
Deletion operation B-Tree

•To delete any element from a B-tree, starting at a


leaf node:

• Remove X from the current node. Being a leaf node


there are no subtrees to worry about.
• Removing X might cause the node containing it to have
too few values.
• Recall that we require the root to have at least 1 value
in it and all other nodes to have at least (M-1)/2 values
in them. If the node has too few values, we say it has
underflowed.
• If underflow does not occur, then we finish the deletion
process. If it does occur, it must be fixed.
• The process for fixing a root is slightly different than the
process for fixing the other nodes.
Deletion operation
The deletion operation in a B tree is slightly different from the deletion
operation of a Binary Search Tree. The procedure to delete a node from a B tree
is :

Case 1 − If the key to be deleted is in a leaf node and the deletion does not
violate the minimum key property, just delete the node.
Case 2 − If the key to be deleted is in a leaf node but the deletion violates the
minimum key property, borrow a key from either its left sibling or right sibling.
In case if both siblings have exact minimum number of keys, merge the node
in either of them.
Case 3 − If the key to be deleted is in an internal node, it is replaced by a
key in either left child or right child based on which child has more keys. But
if both child nodes have minimum number of keys, they’re merged together.

You might also like