UNIT-5: Indexing and Hashing
UNIT-5: Indexing and Hashing
INDEXING AND
HASHING
BASIC CONCEPTS
Indexing mechanisms used to speed up access to desired data.
• E.g., author catalog in library
Search Key - attribute to set of attributes used to look up records
in a file.
An index file consists of records (called index entries) of the
form Index files are typically much smaller than the original file
search-key pointer
• For un-clustered index: sparse index on top of dense index (multilevel index)
SECONDARY INDICES EXAMPLE
Secondary index on salary field of instructor
Index record points to a bucket that contains pointers to all the actual
records with that particular search-key value.
Secondary indices have to be dense
MULTILEVEL INDEX
Typical node
Result of splitting node containing Brandt, Califieri and Crick on inserting Adams
Next step: insert entry with (Califieri, pointer-to-new-node) into parent
B+-TREE INSERTION
Affected nodes
Affected nodes
INSERTION IN
Splitting a non-leaf Bwhen
node:
+
-TREES
inserting (k,p) (CONT.)
into an already full internal node N
• Copy N to an in-memory area M with space for n+1 pointers and n keys
• Insert (k,p) into M
• Copy P1,K1, …, K n/2-1,P n/2 from M back into node N
• Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node N'
• Insert (K n/2,N') into parent N
Example
Affected nodes
Affected nodes
Leaf containing Singh and Wu became underfull, and borrowed a value Kim from its left
sibling
Search-key value in the parent changes as a result
EXAMPLE OF B+-TREE DELETION
(CONT.)
Node with Gold and Katz became underfull, and was merged with its sibling
Parent node becomes underfull, and is merged with its sibling
• Value separating two nodes (at the parent) is pulled down when merging
Root node then has only one child, and is deleted
UPDATES ONdeleted
Assume record already B+-TREES: DELETION
from file. Let V be the search key value of the record, and Pr
be the pointer to the record.
Remove (Pr, V) from the leaf node
If the node has too few entries due to the removal, and the entries in the node and a sibling fit
into a single node, then merge siblings:
• Insert all the search-key values in the two nodes into a single node (the one on the
left), and delete the other node.
• Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent,
recursively using the above procedure.
UPDATES
Otherwise, if ON
the nodeB
has-TREES:
+
too few entries due DELETION
to the removal, but the entries in the node
and a sibling do not fit into a single node, then redistribute pointers:
• Redistribute the pointers between the node and a sibling such that both have more
than the minimum number of entries.
• Update the corresponding search-key value in the parent of the node.
The node deletions may cascade upwards till a node which has n/2 or more pointers is
found.
If the root node has only one pointer after deletion, it is deleted and the sole child
becomes the root.
COMPLEXITY OF
Cost (in terms of number UPDATES
of I/O operations) of insertion and deletion of a single entry
proportional to height of the tree
• With K entries and maximum fanout of n, worst case complexity of insert/delete of an
entry is O(logn/2(K))
In practice, number of I/O operations is less:
• Internal nodes tend to be in buffer
• Splits/merges are rare, most insert/delete operations only affect a leaf node
Average node occupancy depends on insertion order
• 2/3rds with random, ½ with insertion in sorted order
NON-UNIQUE
Alternatives to schemeSEARCH
described earlier KEYS
• Buckets on separate block (bad idea)
• List of tuple pointers with each key
Extra code to handle long lists
Deletion of a tuple can be expensive if there are many duplicates on search key
(why?)
• Worst case complexity may be linear!
Low space overhead, no extra cost for queries
• Make search key unique by adding a record-identifier
Extra storage overhead for keys
Simpler code for insertion/deletion
Widely used
B+-TREE FILE
B -Tree File
+ ORGANIZATION
Organization:
• Leaf nodes in a B+-tree file organization store records, instead of pointers
• Helps keep data records clustered even when there are insertions/deletions/updates
Leaf nodes are still required to be half full
• Since records are larger than pointers, the maximum number of records that can be
stored in a leaf node is less than the number of pointers in a nonleaf node.
Insertion and deletion are handled in the same way as insertion and deletion of entries in a
B+-tree index.
B+-TREE FILE ORGANIZATION (CONT.)
Example of B+-tree File Organization
Good space utilization important since records use more space than pointers.
To improve space utilization, involve more sibling nodes in redistribution during splits and
merges
• Involving 2 siblings in redistribution (to avoid split / merge where possible) results in
each node having at least entries
2n / 3
OTHER
Record ISSUES IN INDEXING
relocation and secondary indices
• If a record moves, all secondary indices that store record pointers have to be
updated
• Node splits in B+-tree file organizations become very expensive
• Solution: use search key of B+-tree file organization instead of record pointer in
secondary index
Add record-id if B+-tree file organization search key is non-unique
Extra traversal of file organization to locate record
• Higher cost for queries, but node splits are cheap
INDEXING
Variable length STRINGS
strings as keys
• Variable fanout
• Use space utilization as criterion for splitting, not number of pointers
Prefix compression
• Key values at internal nodes can be prefixes of full key
Keep enough characters to distinguish entries in the subtrees separated by the
key value
• E.g., “Silas” and “Silberschatz” can be separated by “Silb”
• Keys in leaf node can be compressed by sharing common prefixes
BULK LOADING AND BOTTOM-UP
BUILD
Inserting entries one-at-a-time into a B -tree requires 1 IO per entry
+
3 1 3