0% found this document useful (0 votes)
21 views

Indexing and Hashing

Indexing mechanisms like indices and hashing are used to speed up access to desired data records. There are two main types of indices: ordered indices which store search keys in sorted order, and hash indices which distribute search keys uniformly using a hash function. B-tree indexing structures are commonly used as they can efficiently support insertion, deletion and search operations in logarithmic time. B-trees keep the tree balanced through splitting nodes and propagating keys during insertion if nodes overflow.

Uploaded by

Mukul Dilwaria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Indexing and Hashing

Indexing mechanisms like indices and hashing are used to speed up access to desired data records. There are two main types of indices: ordered indices which store search keys in sorted order, and hash indices which distribute search keys uniformly using a hash function. B-tree indexing structures are commonly used as they can efficiently support insertion, deletion and search operations in logarithmic time. B-trees keep the tree balanced through splitting nodes and propagating keys during insertion if nodes overflow.

Uploaded by

Mukul Dilwaria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

26

Indexing and Hashing


Indexing mechanisms used to speed up access to
desired data. E.g., author catalog in library.
Search Key - attribute to set of attributes used to look
up records in a file.
An index file consists of records (called index entries)
of the form
search-key

pointer

Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted
order
Hash indices: search keys are distributed uniformly
across buckets using a hash function.

DBMS: Rajeev Wankar

27

Main concepts
search keys are sorted in the index file and point to
the actual records
primary vs. secondary indices
Clustering (sparse) vs non-clustering (dense) indices
Primary key index: on primary key (no duplicates)
123
234
345
456
567

STUDENT
Ssn
123
234
678
456
345

Nam e
smith
jones
tom s o n
stevens
smith

Address
main str
forbes a ve
main str
forbes a ve
forbes a ve

Secondary key index: duplicates may exist


forbes ave
main str

Address-index

DBMS: Rajeev Wankar

STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith

Address
main str
forbes ave
main str
forbes ave
forbes ave

28

secondary key index: typically, with postings lists

Postings lists
STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith

forbes ave
main str

Address
main str
forbes ave
main str
forbes ave
forbes ave

Clustering (= sparse) index: records are physically


sorted on that key (and not all key values are
needed in the index)
Non-clustering (=dense) index: the opposite
E.g.: Clustering/sparse index on ssn
123
456

>=123

>=456

DBMS: Rajeev Wankar

STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith

Address
main str
forbes ave
main str
forbes ave
forbes ave

29

Sparse Index: contains index records for only some


search-key values.
Dense Index Files: Index record appears for every
search-key value in the file
Non-clustering / dense index
123
234
345
456
567

Ssn
345
234
567
456
123

Nam e
tom s o n
jones
smith
ste v e n s
smith

Address
main str
forbes a ve
forbes a ve
forbes a ve
main str

ISAM
What if index is too large to search sequentially?
>=123
123
3,423

123
456

>=456
block

DBMS: Rajeev Wankar

STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith

Address
main str
forbes ave
main str
forbes ave
forbes ave

30

Multilevel Index
If primary index does not fit in memory, access
becomes expensive.
if index is too large, store it on disk and keep
index-on-the-index
usually two levels of indices, one first- level entry
per disk block (why? )
What about insertions/deletions?
Index Update: Deletion
If deleted record was the only record in the file with
its particular search-key value, the search-key is
deleted from the index also.
>=123
123
3,423

123
456

>=456
124; peterson; fifth ave.

DBMS: Rajeev Wankar

STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith

Address
main str
forbes ave
main str
forbes ave
forbes ave

31

123
3,423

123
456

STUDENT
Ssn
Name
123 smith
234 jones
345 tomson
456 stevens
567 smith

overflows
Address
main str
forbes ave
main str
forbes ave
forbes ave

124; peterson; fifth ave.

Problems?

overflow chains may become very long - what to


do?
overflow chains may become very long - thus:
shut-down & reorganize
start with ~80% utilization
if index is too large, store it on disk and keep index
on the index (in memory)
usually two levels of indices, one first- level entry per
disk block (why? )
indices (like ISAM) suffer in the presence of frequent updates
alternative indexing structure: B - trees
DBMS: Rajeev Wankar

32

the most successful family of index schemes


(B-trees, B+-trees, B*-trees)
Can be used for primary/secondary, clustering/
non-clustering index.

B-trees
Eg., B-tree of order 3:

<6
>6
1

<9
7

9
>9
13

A B/B+ tree is a rooted tree satisfying the following


properties:
All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between
( ( n 2 ) and n children.
A leaf node has between ( ( ( n 1 ) 2 ) and n 1
values. Special cases: If the root is not a leaf, it has
at least 2 children.
DBMS: Rajeev Wankar

33

If the root is a leaf (that is, there are no other nodes


in the tree), it can have between 0 and (n1) values.
O(log (N)) for everything! (ins/del/search)
typically, if m = 50 - 100, then 2 - 3 levels
utilization >= 50%, guaranteed; on average 69%
Queries: Algorithm for exact match query?
(eg., ssn=8?)

<6
>6
3

H steps
>9

<9
7

13

B-tree print keys in sorted order?

<6
>6
1

DBMS: Rajeev Wankar

<9
7

9
>9
13

34

Solution B+-Tree Index Files


facilitate sequential ops.
They string all leaf nodes together
AND
replicate keys from non-leaf nodes, to make sure
every key appears at the leaf level
6

<6
>=6
1

9
>=9

<9
6

13

Advantage of B+-tree index files: automatically


reorganizes itself with small, local, changes, in the
face of insertions and deletions. Reorganization of
entire file is not required to maintain performance.
Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.
Advantages of B+-trees outweigh disadvantages,
and they are used extensively.

DBMS: Rajeev Wankar

35

B+-Tree Node Structure

Ki are the search-key values


Pi are pointers to children (for non-leaf nodes) or
pointers to records or buckets of records (for leaf
nodes).
The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn1
Leaf Nodes in B+-Trees
For i = 1, 2, . . ., n1, pointer Pi either points to a file
record with search-key value Ki, or to a bucket of
pointers to file records, each record having searchkey value Ki. Only need bucket structure if searchkey does not form a primary key.
Pn points to next leaf node in search-key order

DBMS: Rajeev Wankar

36

Non-Leaf Nodes in B+-Trees


Non leaf nodes form a multi-level sparse index on
the leaf nodes. For a non-leaf node with m pointers:
All the search-keys in the sub-tree to which P1 points
are less than K1
For 2 i n 1 , all the search-keys in the sub-tree to
which Pi points have values greater than or equal to
Ki1 and less than Km1
B+-tree for account file (n = 3)

DBMS: Rajeev Wankar

37

Leaf nodes must have between 2 and 4 values


( ( ( n 1 ) 2 ) and n 1, with n = 5).
Non-leaf nodes other than root must have between 3 and
5 children ( ( n 2 ) and n with n =5).
Root must have at least 2 children.

Observations about B+-trees


Since the inter-node connections are done by
pointers, logically close blocks need not be
physically close.
The non-leaf levels of the B+-tree form a hierarchy
of sparse indices.
The B+-tree contains a relatively small number of
levels (logarithmic in the size of the main file), thus
searches can be conducted efficiently.

DBMS: Rajeev Wankar

38

Queries on B+-Trees

Find all records with a search-key value of k.


Start with the root node

1. Examine the node for the smallest search-key value


> k.
2. If such a value exists, assume it is Kj. Then follow Pi
to the child node
3. Otherwise k K m 1 , where there are m pointers in
the node. Then follow Pm to the child node.

If the node reached by following the pointer above


is not a leaf node, repeat the above procedure on
the node, and follow the corresponding pointer.
Eventually reach a leaf node. If for some i, key Ki =
k follow pointer Pi to the desired record or bucket.
Else no record with search-key value k exists.
Insertions and deletions to the main file can be
handled efficiently, as the index can be restructured in logarithmic time (as we shall see).
If there are K search-key values in the file, the path is
no longer than log n 2 ( K ) .
With 1 million search key values and n = 100, at most
log50(1,000,000) = 4 nodes are accessed in a lookup.
DBMS: Rajeev Wankar

39

B+ tree insertion
Find the leaf node in which the search-key value
would appear
If the search-key value is already there in the leaf
node, record is added to file and if necessary a
pointer is inserted into the bucket.
If the search-key value is not there, then add the
record to the main file and create a bucket if necessary. Then:
If there is room in the leaf node, insert (key-value,
pointer) pair in the leaf node
Otherwise, split the node (along with the new (keyvalue, pointer) entry).

Splitting a node:
take the n(search-key value, pointer) pairs (including
the one being inserted) in sorted order. Place the first
n 2 in the original node, and the rest in a new
node.

DBMS: Rajeev Wankar

40

let the new node be p, and let k be the least key value
in p. Insert (k,p) in the parent of the node being split.
If the parent is full, split it and propagate the split
further up.
The splitting of nodes proceeds upwards till a node
that is not full is found. In the worst case the root
node may be split increasing the height of the tree
by 1.
/* ATTENTION:
a split at the LEAF level is handled by COPYING
the middle key upstairs;
A split at a higher level is handled by PUSHING
the middle key upstairs */
INSERTION OF KEY K
insert search-key value to L such that the keys are in
order;
if ( L overflows) {
split L ;
insert (ie., COPY) smallest search-key value
of new node to parent node P;
if (P overflows) {
repeat the B-tree split procedure recursively;
/* Notice: the B-TREE split; NOT the B+ -tree */
}
}
DBMS: Rajeev Wankar

41

E g ., in s e r t 8
6

<6

>=9

<9

>=6
3

13

E g ., in s e r t 8
6

<6

>=9

<9

>=6
1

13

C O P Y m id d le u p s t a ir s

Eg., insert 8
6

<6

>=9
>=6

<9
6

7
7

COPY middle upstairs

DBMS: Rajeev Wankar

13

42

N o n -leaf overflow
just PU S H t h e
m iddle

E g ., insert 8
6

<6

>=9

<9

>=6
1

13

C O P Y m iddle upstairs

E g ., in s e r t 8

<7

>=7
9

<6

<9

>=9

>=6
1

FIN A L T R E E

DBMS: Rajeev Wankar

13

43

B+-Tree before and after insertion of Clearview


B+-Tree File Organization
Index file degradation problem is solved by using
B+-Tree indices.
The leaf nodes in a B+-tree file organization store
records, instead of pointers.
Since records are larger than pointers, the maximum
number of records that can be stored in a leaf node is
less than the number of pointers in a nonleaf node.
Leaf nodes are still required to be half full.
Insertion and deletion are handled in the same way
as insertion and deletion of entries in a B+-tree index.
DBMS: Rajeev Wankar

44

B-Tree Index Files


Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in
the B-tree; an additional pointer field for each search
key in a nonleaf node must be included.

DBMS: Rajeev Wankar

45

B-tree (above) and B+-tree (below) on same data

DBMS: Rajeev Wankar

You might also like